Frequent hangs in Kamailio probably related due to lock contention in xhttp_prom module.

Environment:

The systems are using 32 Kamailio worker processes for the relevant network interface, Its also using Prometheus counter increment operations more than 30 times in the cfg during INVITE processing. The Kamailio uses otherwise no database or other IO related services. Kamailio version 5.8.3, but no relevant changes in the xhttp_prom module could be found.

Quick summary of the findings:

Multiple systems showed frequent hangs in their Kamailio servers on a customer setup. It happens usually after a few hours that all Kamailio processes gets blocked, and no more traffic can be processed on the respective system.

I have analysed three stack traces of the Kamailio on one of the system that showed the behaviour. Two without problems and one that was created from during a period where the server had problems.

Details:

Here some details of the stack traces from a problematic case.
The relevant processes are from PID 494551 to 494582.

The majority of all of these processes are blocked in paths related to the Prometheus module (PID 494551 to 494576):

PID 494551:

#1 0x00007fc69ea5f053 in futex_get (lock=0x7fc4a147acd0) at ../../core/mem/../futexlock.h:108
v = 2
i = 1024
#2 0x00007fc69ea70f99 in prom_counter_inc (s_name=0x7fffe339ffa0, number=1, l1=0x7fffe339ff90, l2=0x0, l3=0x0) at prom_metric.c:1154
p = 0x6b
__func__ = "prom_counter_inc"
[…]
#14 0x00000000005ab179 in receive_msg (buf=0x9f47e0 <buf> "INVITE [sip:+1YYYYYYY737@10.XXX.XXX107](sip:+1YYYYYYYYYY737@10.XXX.XX.107) SIP/2.0\r\nRecord-Route: [sip:10.1XXX.XXX.104;lr=on;ftag=HK507HSy55p9F;dlgcor=62b91.985c3](sip:10.XXX.XXX.104;lr=on;ftag=HK507HSy55p9F;dlgcor=62b91.985c3)\r\nRecord-Route: [sip:10.XXX.XXX.117;r2=on;lr;ftag=HK507HSy55p9F](sip:10.XXX.XXX.117;r2=on;lr;ftag=HK507HSy55p9F)\r\nRecord-R"..., len=2819, rcv_info=0x7fffe33a28d0) at core/receive.c:518

Most of the worker processes are in the same state as shown above.

Some of the processes are also working in other Prometheus related operations:

PID 494554:

#0 prom_metric_timeout_delete (p_m=0x7fc49ea102f0) at prom_metric.c:646
current = 0x7fc4a2810fd0
ts = 1744143945433
__func__ = "prom_metric_timeout_delete"
l = 0x7fc4abffc808
#1 0x00007fc69ea676ce in prom_metric_list_timeout_delete () at prom_metric.c:668
p = 0x7fc49ea102f0

Problem hypothesis:

My hypothesis is that the hang is caused from lock contention around the Prometheus module. The relevant code uses only one lock, and this together with the extensive usage of the increment counters probably causes this issues under high load.

The majority of the worker processed are occupied in the Prometheus path and are not working on SIP packets. This will cause of course an increase of the UDP queue and the described problems.

In order to test this hypothesis we removed temporarily the Prometheus logic in the kamailio cfg and see if the issue still persists. The issue did not showed up again after two days of testing, when before it was observed after a few hours.

Possible solutions:

The Prometheus module probably needs some improvements to support better high-load and concurrency situations. On common approach is to split the locks, e.g. using a per process lock array and then combining the individual values in a second pass when read from outside.

Alternatively the xhttp_prom module should be used only carefully in situations with a high concurrency setup.

I have the full backtrace available, if helpful just let me know.

—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.

henningw created an issue (kamailio/kamailio#4209)