On 11.03.20 09:04, Juha Heinanen wrote:
Daniel-Constantin Mierla writes:
It seems to be the case of a retransmission timeout:
#17 0x00007f7dc04d4aca in acc_onreply (t=0x7f7d9e3b0650, req=0x7f7d9e357650, reply=0xffffffffffffffff, code=408) at acc_logic.c:604
Code is 408 and the reply is faked value. This case is happening in timer process.
That explains it. But isn't it risky that in this kind of situation the timer process (the only one) handles the reply and accounting?
There are many cases when delays can increase the risk of malfunctioning, no matter is in timer module or a sip routing worker. If that a process is blocked, slots on internal hash tables (e.g., user location) can be locked and no other process can continue processing until that process unlocks. Interaction with external systems such as database, api servers, dns service ... are the typical candidates for adding significant delay. For specific deployments, there are some solutions to do as less as possible blocking operations, but it would be probably impossible to do it everywhere when dealing with external systems. Such example is even the async-insert added to db_mysql quite some time ago, or mqueue+rtimer or async modules.
The problem is related to db_cluster/mariadb/debian. If db_cluster is not used, everything works fine. With db_cluster, accounting hangs the timer process at regular (about 2 hour) intervals.
If it happens periodically, maybe you can track why: try to identify apps accessing the database for back up, cdr generation, etc ... as well as infrastructure maintenance operations (vm backup snapshot).
Cheers, Daniel