[SR-Users] out of shm without any visible reason

Wed Mar 11 12:16:05 CET 2020

On 11.03.20 09:04, Juha Heinanen wrote:
> Daniel-Constantin Mierla writes:
>
>> It seems to be the case of a retransmission timeout:
>>
>> #17 0x00007f7dc04d4aca in acc_onreply (t=0x7f7d9e3b0650, req=0x7f7d9e357650, reply=0xffffffffffffffff, code=408) at acc_logic.c:604
>>
>> Code is 408 and the reply is faked value. This case is happening in
>> timer process.
> That explains it.  But isn't it risky that in this kind of situation
> the timer process (the only one) handles the reply and accounting?
There are many cases when delays can increase the risk of
malfunctioning, no matter is in timer module or a sip routing worker. If
that a process is blocked, slots on internal hash tables (e.g., user
location) can be locked and no other process can continue processing
until that process unlocks. Interaction with external systems such as
database, api servers, dns service ... are the typical candidates for
adding significant delay. For specific deployments, there are some
solutions to do as less as possible blocking operations, but it would be
probably impossible to do it everywhere when dealing with external
systems. Such example is even the async-insert added to db_mysql quite
some time ago, or mqueue+rtimer or async modules.
>
> The problem is related to db_cluster/mariadb/debian.  If db_cluster is
> not used, everything works fine.  With db_cluster, accounting hangs the
> timer process at regular (about 2 hour) intervals.

If it happens periodically, maybe you can track why: try to identify
apps accessing the database for back up, cdr generation, etc ... as well
as infrastructure maintenance operations (vm backup snapshot).

Cheers,
Daniel

-- 
Daniel-Constantin Mierla -- www.asipto.com
www.twitter.com/miconda -- www.linkedin.com/in/miconda
Kamailio World Conference - April 27-29, 2020, in Berlin -- www.kamailioworld.com