Hello list, Hope all are doing well!
We are running load tests in our Kamailio server, that is just making inbound and outbound calls and eventually (there is no identified pattern) Kamailio freezes and of course all calls start to fail. It does not crash, it just stops responding and it has to be killed -9. When this happens, SIP messages are not processed, dmq keepalive fails (so the other node reports as down), dialog KA are not sent, but Registrations from UAC seem to still go out (logs from local_route are seen). We don't have a high amount of cps, it is max 3 or 4 per sec, and it gets around 1900 active calls. We are now using Kamailio 5.2.8 installed from the repo on a CentOS7 server. Dialog has KA active and DMQ (with 2 workers) is being used on an active-active instance.
From investigation using GDB as pasted below, I can see UDP workers are
stuck on a lock either on a callback from t_relay... #0 0x00007ffb74e9bbf9 in syscall () from /lib64/libc.so.6 #1 0x00007ffb2b1bce08 in futex_get (lock=0x7ffb35217b90) at ../../core/futexlock.h:108 #2 0x00007ffb2b1bec44 in bcast_dmq_message1 (peer=0x7ffb35e8bf38, body=0x7fff2e95ffb0, except=0x0, resp_cback=0x7ffb2a8a0ab0 <dlg_dmq_resp_callback>, max_forwards=1, content_type=0x7ffb2a8a0a70 <dlg_dmq_content_type>, incl_inactive=0) at dmq_funcs.c:156 #3 0x00007ffb2b1bf46b in bcast_dmq_message (peer=0x7ffb35e8bf38, body=0x7fff2e95ffb0, except=0x0, resp_cback=0x7ffb2a8a0ab0 <dlg_dmq_resp_callback>, max_forwards=1, content_type=0x7ffb2a8a0a70 <dlg_dmq_content_type>) at dmq_funcs.c:188 #4 0x00007ffb2a6448fa in dlg_dmq_send (body=0x7fff2e95ffb0, node=0x0) at dlg_dmq.c:88 #5 0x00007ffb2a64da5d in dlg_dmq_replicate_action (action=DLG_DMQ_UPDATE, dlg=0x7ffb362ea3c8, needlock=1, node=0x0) at dlg_dmq.c:628 #6 0x00007ffb2a61f28e in dlg_on_send (t=0x7ffb36c98120, type=16, param=0x7fff2e9601e0) at dlg_handlers.c:739 #7 0x00007ffb2ef285b6 in run_trans_callbacks_internal (cb_lst=0x7ffb36c98198, type=16, trans=0x7ffb36c98120, params=0x7fff2e9601e0) at t_hooks.c:260 #8 0x00007ffb2ef286d0 in run_trans_callbacks (type=16, trans=0x7ffb36c98120, req=0x7ffb742f27e0, rpl=0x0, code=-1) at t_hooks.c:287 #9 0x00007ffb2ef38ac1 in prepare_new_uac (t=0x7ffb36c98120, i_req=0x7ffb742f27e0, branch=0, uri=0x7fff2e9603e0, path=0x7fff2e9603c0, next_hop=0x7ffb742f2a58, fsocket=0x7ffb73e3e968, snd_flags=..., fproto=0, flags=2, instance=0x7fff2e9603b0, ruid=0x7fff2e9603a0, location_ua=0x7fff2e960390) at t_fwd.c:381 #10 0x00007ffb2ef3d02d in add_uac (t=0x7ffb36c98120, request=0x7ffb742f27e0, uri=0x7ffb742f2a58, next_hop=0x7ffb742f2a58, path=0x7ffb742f2e20, proxy=0x0, fsocket=0x7ffb73e3e968, snd_flags=..., proto=0, flags=2, instance=0x7ffb742f2e30, ruid=0x7ffb742f2e48, location_ua=0x7ffb742f2e58) at t_fwd.c:811 #11 0x00007ffb2ef4535a in t_forward_nonack (t=0x7ffb36c98120, p_msg=0x7ffb742f27e0, proxy=0x0, proto=0) at t_fwd.c:1699 #12 0x00007ffb2ef20505 in t_relay_to (p_msg=0x7ffb742f27e0, proxy=0x0, proto=0, replicate=0) at t_funcs.c:334
or loose_route... #0 0x00007ffb74e9bbf9 in syscall () from /lib64/libc.so.6 #1 0x00007ffb2b1bce08 in futex_get (lock=0x7ffb35217b90) at ../../core/futexlock.h:108 #2 0x00007ffb2b1bec44 in bcast_dmq_message1 (peer=0x7ffb35e8bf38, body=0x7fff2e9629d0, except=0x0, resp_cback=0x7ffb2a8a0ab0 <dlg_dmq_resp_callback>, max_forwards=1, content_type=0x7ffb2a8a0a70 <dlg_dmq_content_type>, incl_inactive=0) at dmq_funcs.c:156 #3 0x00007ffb2b1bf46b in bcast_dmq_message (peer=0x7ffb35e8bf38, body=0x7fff2e9629d0, except=0x0, resp_cback=0x7ffb2a8a0ab0 <dlg_dmq_resp_callback>, max_forwards=1, content_type=0x7ffb2a8a0a70 <dlg_dmq_content_type>) at dmq_funcs.c:188 #4 0x00007ffb2a6448fa in dlg_dmq_send (body=0x7fff2e9629d0, node=0x0) at dlg_dmq.c:88 #5 0x00007ffb2a64da5d in dlg_dmq_replicate_action (action=DLG_DMQ_STATE, dlg=0x7ffb363e0c10, needlock=0, node=0x0) at dlg_dmq.c:628 #6 0x00007ffb2a62b3bf in dlg_onroute (req=0x7ffb742f11d0, route_params=0x7fff2e962ce0, param=0x0) at dlg_handlers.c:1538 #7 0x00007ffb2e7db203 in run_rr_callbacks (req=0x7ffb742f11d0, rr_param=0x7fff2e962d80) at rr_cb.c:96 #8 0x00007ffb2e7eb2f9 in after_loose (_m=0x7ffb742f11d0, preloaded=0) at loose.c:945 #9 0x00007ffb2e7eb990 in loose_route (_m=0x7ffb742f11d0) at loose.c:979
or t_check_trans: #0 0x00007ffb74e9bbf9 in syscall () from /lib64/libc.so.6 #1 0x00007ffb2a5ea9c6 in futex_get (lock=0x7ffb35e78804) at ../../core/futexlock.h:108 #2 0x00007ffb2a5f1c46 in dlg_lookup_mode (h_entry=1609, h_id=59882, lmode=0) at dlg_hash.c:709 #3 0x00007ffb2a5f27aa in dlg_get_by_iuid (diuid=0x7ffb36326bd0) at dlg_hash.c:777 #4 0x00007ffb2a61ba1d in dlg_onreply (t=0x7ffb36952988, type=2, param=0x7fff2e963bf0) at dlg_handlers.c:437 #5 0x00007ffb2ef285b6 in run_trans_callbacks_internal (cb_lst=0x7ffb36952a00, type=2, trans=0x7ffb36952988, params=0x7fff2e963bf0) at t_hooks .c:260 #6 0x00007ffb2ef286d0 in run_trans_callbacks (type=2, trans=0x7ffb36952988, req=0x7ffb3675c360, rpl=0x7ffb742f1930, code=200) at t_hooks.c:28 7 #7 0x00007ffb2ee7037f in t_reply_matching (p_msg=0x7ffb742f1930, p_branch=0x7fff2e963ebc) at t_lookup.c:997 #8 0x00007ffb2ee725e4 in t_check_msg (p_msg=0x7ffb742f1930, param_branch=0x7fff2e963ebc) at t_lookup.c:1101 #9 0x00007ffb2eee44c7 in t_check_trans (msg=0x7ffb742f1930) at tm.c:2351
And the DMQ workers are here: #0 0x00007ffb74e9bbf9 in syscall () from /lib64/libc.so.6 #1 0x00007ffb2b1d6c81 in futex_get (lock=0x7ffb35217c34) at ../../core/futexlock.h:108 #2 0x00007ffb2b1d7c3a in worker_loop (id=1) at worker.c:86 #3 0x00007ffb2b1d5d35 in child_init (rank=0) at dmq.c:300
Currently I will not be able to upgrade to latest 5.4 version to try to reproduce the error and since 5.2.8 has already reached end-of-life, maybe is there anything I can do on the configuration to avoid such condition? Any ideas are welcome!
Kind regards, Patrick Wakano
Sorry to bother again, but another problem we are facing and that now seems related is an increasing memory usage similar to a memory leak. So during the load test, Kamailio shmem starts increasing fast making it run out of shmem. Today it happened again and I could retrieve some info and looks like the "leak" is likely due to the DMQ workers. The kamcmd shmem stats showed high usage in core and dmq: # kamcmd mod.stats all shm Module: core { sip_msg_shm_clone(496): 959632704 counters_prefork_init(207): 61440 cfg_clone_str(130): 392 cfg_shmize(217): 1496 main_loop(1295): 8 init_pt(113): 8 init_pt(108): 8 init_pt(107): 5920 cfg_parse_str(906): 80 register_timer(1012): 432 cfg_register_ctx(47): 96 init_tcp(4714): 8192 init_tcp(4708): 32768 init_tcp(4700): 8 init_tcp(4693): 8 init_tcp(4686): 8 init_tcp(4680): 8 init_tcp(4668): 8 init_avps(90): 8 init_avps(89): 8 init_dst_blacklist(437): 16384 init_dst_blacklist(430): 8 timer_alloc(515): 96 init_dns_cache(358): 8 init_dns_cache(350): 16384 init_dns_cache(343): 16 init_dns_cache(336): 8 init_timer(284): 8 init_timer(283): 16384 init_timer(282): 8 init_timer(281): 8 init_timer(270): 8 init_timer(238): 8 init_timer(221): 278544 init_timer(220): 8 init_timer(207): 8 cfg_child_cb_new(828): 64 sr_cfg_init(360): 8 sr_cfg_init(353): 8 sr_cfg_init(346): 8 sr_cfg_init(334): 8 sr_cfg_init(322): 8 qm_shm_lock_init(1202): 8 Total: 960071600 } Module: dmq { alloc_job_queue(250): 64 shm_str_dup(723): 48 build_dmq_node(164): 896 add_peer(68): 312 mod_init(240): 8 mod_init(233): 48 init_dmq_node_list(70): 24 init_peer_list(33): 24 job_queue_push(286): 15369848 Total: 15371272 }
From GDB I could see both workers were stuck here:
Thread 1 (Thread 0x7f4a3cba7740 (LWP 17401)): #0 0x00007f4a3c29dbf9 in syscall () from /lib64/libc.so.6 #1 0x00007f49f1a44bdd in futex_get (lock=0x7fff1d14f564) at ../../core/futexlock.h:121 #2 0x00007f49f1a4fe57 in dmq_send_all_dlgs (dmq_node=0x0) at dlg_dmq.c:657 mypid = 17401 index = 2543 entry = {first = 0x0, last = 0x0, next_id = 107271, lock = {val = 2}, locker_pid = {val = 17393}, rec_lock_level = 0} dlg = 0x0 __FUNCTION__ = "dmq_send_all_dlgs" #3 0x00007f49f1a4c000 in dlg_dmq_handle_msg (msg=0x7f49fe88b2b8, resp=0x7fff1d14f8e0, node=0x7f49fd4be5d8) at dlg_dmq.c:391 #4 0x00007f49f25da34a in worker_loop (id=1) at worker.c:113 #5 0x00007f49f25d7d35 in child_init (rank=0) at dmq.c:300
Thread 1 (Thread 0x7f4a3cba7740 (LWP 17400)): #0 0x00007f4a3c29dbf9 in syscall () from /lib64/libc.so.6 #1 0x00007f49f1a44bdd in futex_get (lock=0x7fff1d14f564) at ../../core/futexlock.h:121 #2 0x00007f49f1a4fe57 in dmq_send_all_dlgs (dmq_node=0x0) at dlg_dmq.c:657 mypid = 17400 index = 1080 entry = {first = 0x7f49fe575ad0, last = 0x7f49fe575ad0, next_id = 54971, lock = {val = 2}, locker_pid = {val = 17385}, rec_lock_level = 0} dlg = 0x0 __FUNCTION__ = "dmq_send_all_dlgs" #3 0x00007f49f1a4c000 in dlg_dmq_handle_msg (msg=0x7f49fe437878, resp=0x7fff1d14f8e0, node=0x7f49fd4be5d8) at dlg_dmq.c:391 #4 0x00007f49f25da34a in worker_loop (id=0) at worker.c:113 #5 0x00007f49f25d7d35 in child_init (rank=0) at dmq.c:300
From my analysis, it seems memory was increasing because every new call was
adding a new job to the dmq workers queue but the workers were stuck in the mutex not consuming the jobs. We could not figure out which process had the mutex, because from GDB the locker_pid had an UDP process and a timer process, but both seemed to be just waiting in normal operation.
Also, I didn't create a ticket because 5.2 is deprecated, but I found it worth reporting this potential problem in the list since it looks like the code for the DMQ is similar between 5.4 and 5.2.
On Tue, 27 Oct 2020 at 09:22, Patrick Wakano pwakano@gmail.com wrote:
Hello list, Hope all are doing well!
We are running load tests in our Kamailio server, that is just making inbound and outbound calls and eventually (there is no identified pattern) Kamailio freezes and of course all calls start to fail. It does not crash, it just stops responding and it has to be killed -9. When this happens, SIP messages are not processed, dmq keepalive fails (so the other node reports as down), dialog KA are not sent, but Registrations from UAC seem to still go out (logs from local_route are seen). We don't have a high amount of cps, it is max 3 or 4 per sec, and it gets around 1900 active calls. We are now using Kamailio 5.2.8 installed from the repo on a CentOS7 server. Dialog has KA active and DMQ (with 2 workers) is being used on an active-active instance. From investigation using GDB as pasted below, I can see UDP workers are stuck on a lock either on a callback from t_relay... #0 0x00007ffb74e9bbf9 in syscall () from /lib64/libc.so.6 #1 0x00007ffb2b1bce08 in futex_get (lock=0x7ffb35217b90) at ../../core/futexlock.h:108 #2 0x00007ffb2b1bec44 in bcast_dmq_message1 (peer=0x7ffb35e8bf38, body=0x7fff2e95ffb0, except=0x0, resp_cback=0x7ffb2a8a0ab0 <dlg_dmq_resp_callback>, max_forwards=1, content_type=0x7ffb2a8a0a70 <dlg_dmq_content_type>, incl_inactive=0) at dmq_funcs.c:156 #3 0x00007ffb2b1bf46b in bcast_dmq_message (peer=0x7ffb35e8bf38, body=0x7fff2e95ffb0, except=0x0, resp_cback=0x7ffb2a8a0ab0 <dlg_dmq_resp_callback>, max_forwards=1, content_type=0x7ffb2a8a0a70 <dlg_dmq_content_type>) at dmq_funcs.c:188 #4 0x00007ffb2a6448fa in dlg_dmq_send (body=0x7fff2e95ffb0, node=0x0) at dlg_dmq.c:88 #5 0x00007ffb2a64da5d in dlg_dmq_replicate_action (action=DLG_DMQ_UPDATE, dlg=0x7ffb362ea3c8, needlock=1, node=0x0) at dlg_dmq.c:628 #6 0x00007ffb2a61f28e in dlg_on_send (t=0x7ffb36c98120, type=16, param=0x7fff2e9601e0) at dlg_handlers.c:739 #7 0x00007ffb2ef285b6 in run_trans_callbacks_internal (cb_lst=0x7ffb36c98198, type=16, trans=0x7ffb36c98120, params=0x7fff2e9601e0) at t_hooks.c:260 #8 0x00007ffb2ef286d0 in run_trans_callbacks (type=16, trans=0x7ffb36c98120, req=0x7ffb742f27e0, rpl=0x0, code=-1) at t_hooks.c:287 #9 0x00007ffb2ef38ac1 in prepare_new_uac (t=0x7ffb36c98120, i_req=0x7ffb742f27e0, branch=0, uri=0x7fff2e9603e0, path=0x7fff2e9603c0, next_hop=0x7ffb742f2a58, fsocket=0x7ffb73e3e968, snd_flags=..., fproto=0, flags=2, instance=0x7fff2e9603b0, ruid=0x7fff2e9603a0, location_ua=0x7fff2e960390) at t_fwd.c:381 #10 0x00007ffb2ef3d02d in add_uac (t=0x7ffb36c98120, request=0x7ffb742f27e0, uri=0x7ffb742f2a58, next_hop=0x7ffb742f2a58, path=0x7ffb742f2e20, proxy=0x0, fsocket=0x7ffb73e3e968, snd_flags=..., proto=0, flags=2, instance=0x7ffb742f2e30, ruid=0x7ffb742f2e48, location_ua=0x7ffb742f2e58) at t_fwd.c:811 #11 0x00007ffb2ef4535a in t_forward_nonack (t=0x7ffb36c98120, p_msg=0x7ffb742f27e0, proxy=0x0, proto=0) at t_fwd.c:1699 #12 0x00007ffb2ef20505 in t_relay_to (p_msg=0x7ffb742f27e0, proxy=0x0, proto=0, replicate=0) at t_funcs.c:334
or loose_route... #0 0x00007ffb74e9bbf9 in syscall () from /lib64/libc.so.6 #1 0x00007ffb2b1bce08 in futex_get (lock=0x7ffb35217b90) at ../../core/futexlock.h:108 #2 0x00007ffb2b1bec44 in bcast_dmq_message1 (peer=0x7ffb35e8bf38, body=0x7fff2e9629d0, except=0x0, resp_cback=0x7ffb2a8a0ab0 <dlg_dmq_resp_callback>, max_forwards=1, content_type=0x7ffb2a8a0a70 <dlg_dmq_content_type>, incl_inactive=0) at dmq_funcs.c:156 #3 0x00007ffb2b1bf46b in bcast_dmq_message (peer=0x7ffb35e8bf38, body=0x7fff2e9629d0, except=0x0, resp_cback=0x7ffb2a8a0ab0 <dlg_dmq_resp_callback>, max_forwards=1, content_type=0x7ffb2a8a0a70 <dlg_dmq_content_type>) at dmq_funcs.c:188 #4 0x00007ffb2a6448fa in dlg_dmq_send (body=0x7fff2e9629d0, node=0x0) at dlg_dmq.c:88 #5 0x00007ffb2a64da5d in dlg_dmq_replicate_action (action=DLG_DMQ_STATE, dlg=0x7ffb363e0c10, needlock=0, node=0x0) at dlg_dmq.c:628 #6 0x00007ffb2a62b3bf in dlg_onroute (req=0x7ffb742f11d0, route_params=0x7fff2e962ce0, param=0x0) at dlg_handlers.c:1538 #7 0x00007ffb2e7db203 in run_rr_callbacks (req=0x7ffb742f11d0, rr_param=0x7fff2e962d80) at rr_cb.c:96 #8 0x00007ffb2e7eb2f9 in after_loose (_m=0x7ffb742f11d0, preloaded=0) at loose.c:945 #9 0x00007ffb2e7eb990 in loose_route (_m=0x7ffb742f11d0) at loose.c:979
or t_check_trans: #0 0x00007ffb74e9bbf9 in syscall () from /lib64/libc.so.6 #1 0x00007ffb2a5ea9c6 in futex_get (lock=0x7ffb35e78804) at ../../core/futexlock.h:108 #2 0x00007ffb2a5f1c46 in dlg_lookup_mode (h_entry=1609, h_id=59882, lmode=0) at dlg_hash.c:709 #3 0x00007ffb2a5f27aa in dlg_get_by_iuid (diuid=0x7ffb36326bd0) at dlg_hash.c:777 #4 0x00007ffb2a61ba1d in dlg_onreply (t=0x7ffb36952988, type=2, param=0x7fff2e963bf0) at dlg_handlers.c:437 #5 0x00007ffb2ef285b6 in run_trans_callbacks_internal (cb_lst=0x7ffb36952a00, type=2, trans=0x7ffb36952988, params=0x7fff2e963bf0) at t_hooks .c:260 #6 0x00007ffb2ef286d0 in run_trans_callbacks (type=2, trans=0x7ffb36952988, req=0x7ffb3675c360, rpl=0x7ffb742f1930, code=200) at t_hooks.c:28 7 #7 0x00007ffb2ee7037f in t_reply_matching (p_msg=0x7ffb742f1930, p_branch=0x7fff2e963ebc) at t_lookup.c:997 #8 0x00007ffb2ee725e4 in t_check_msg (p_msg=0x7ffb742f1930, param_branch=0x7fff2e963ebc) at t_lookup.c:1101 #9 0x00007ffb2eee44c7 in t_check_trans (msg=0x7ffb742f1930) at tm.c:2351
And the DMQ workers are here: #0 0x00007ffb74e9bbf9 in syscall () from /lib64/libc.so.6 #1 0x00007ffb2b1d6c81 in futex_get (lock=0x7ffb35217c34) at ../../core/futexlock.h:108 #2 0x00007ffb2b1d7c3a in worker_loop (id=1) at worker.c:86 #3 0x00007ffb2b1d5d35 in child_init (rank=0) at dmq.c:300
Currently I will not be able to upgrade to latest 5.4 version to try to reproduce the error and since 5.2.8 has already reached end-of-life, maybe is there anything I can do on the configuration to avoid such condition? Any ideas are welcome!
Kind regards, Patrick Wakano
Hello,
Not analysed it in detail, but this looks indeed a bit odd:
job_queue_push(286): 15369848 – 15MB allocated in this function alone.
My suggestion would be to try to reproduce it in your lab/testing environment, and then test a newer version and investigate the reasons for this issue.
As soon as you can reproduce it with a recent 5.3 or 5.4. version, please create a ticket for it.
Cheers,
Henning
-- Henning Westerholt – https://skalatan.de/blog/ Kamailio services – https://gilawa.comhttps://gilawa.com/
From: sr-users sr-users-bounces@lists.kamailio.org On Behalf Of Patrick Wakano Sent: Tuesday, October 27, 2020 7:20 AM To: Kamailio (SER) - Users Mailing List sr-users@lists.kamailio.org Subject: Re: [SR-Users] Kamailio unresponsive with Dialog+DMQ
Sorry to bother again, but another problem we are facing and that now seems related is an increasing memory usage similar to a memory leak. So during the load test, Kamailio shmem starts increasing fast making it run out of shmem. Today it happened again and I could retrieve some info and looks like the "leak" is likely due to the DMQ workers. The kamcmd shmem stats showed high usage in core and dmq: # kamcmd mod.stats all shm Module: core { sip_msg_shm_clone(496): 959632704 counters_prefork_init(207): 61440 cfg_clone_str(130): 392 cfg_shmize(217): 1496 main_loop(1295): 8 init_pt(113): 8 init_pt(108): 8 init_pt(107): 5920 cfg_parse_str(906): 80 register_timer(1012): 432 cfg_register_ctx(47): 96 init_tcp(4714): 8192 init_tcp(4708): 32768 init_tcp(4700): 8 init_tcp(4693): 8 init_tcp(4686): 8 init_tcp(4680): 8 init_tcp(4668): 8 init_avps(90): 8 init_avps(89): 8 init_dst_blacklist(437): 16384 init_dst_blacklist(430): 8 timer_alloc(515): 96 init_dns_cache(358): 8 init_dns_cache(350): 16384 init_dns_cache(343): 16 init_dns_cache(336): 8 init_timer(284): 8 init_timer(283): 16384 init_timer(282): 8 init_timer(281): 8 init_timer(270): 8 init_timer(238): 8 init_timer(221): 278544 init_timer(220): 8 init_timer(207): 8 cfg_child_cb_new(828): 64 sr_cfg_init(360): 8 sr_cfg_init(353): 8 sr_cfg_init(346): 8 sr_cfg_init(334): 8 sr_cfg_init(322): 8 qm_shm_lock_init(1202): 8 Total: 960071600 } Module: dmq { alloc_job_queue(250): 64 shm_str_dup(723): 48 build_dmq_node(164): 896 add_peer(68): 312 mod_init(240): 8 mod_init(233): 48 init_dmq_node_list(70): 24 init_peer_list(33): 24 job_queue_push(286): 15369848 Total: 15371272 }
From GDB I could see both workers were stuck here: Thread 1 (Thread 0x7f4a3cba7740 (LWP 17401)): #0 0x00007f4a3c29dbf9 in syscall () from /lib64/libc.so.6 #1 0x00007f49f1a44bdd in futex_get (lock=0x7fff1d14f564) at ../../core/futexlock.h:121 #2 0x00007f49f1a4fe57 in dmq_send_all_dlgs (dmq_node=0x0) at dlg_dmq.c:657 mypid = 17401 index = 2543 entry = {first = 0x0, last = 0x0, next_id = 107271, lock = {val = 2}, locker_pid = {val = 17393}, rec_lock_level = 0} dlg = 0x0 __FUNCTION__ = "dmq_send_all_dlgs" #3 0x00007f49f1a4c000 in dlg_dmq_handle_msg (msg=0x7f49fe88b2b8, resp=0x7fff1d14f8e0, node=0x7f49fd4be5d8) at dlg_dmq.c:391 #4 0x00007f49f25da34a in worker_loop (id=1) at worker.c:113 #5 0x00007f49f25d7d35 in child_init (rank=0) at dmq.c:300
Thread 1 (Thread 0x7f4a3cba7740 (LWP 17400)): #0 0x00007f4a3c29dbf9 in syscall () from /lib64/libc.so.6 #1 0x00007f49f1a44bdd in futex_get (lock=0x7fff1d14f564) at ../../core/futexlock.h:121 #2 0x00007f49f1a4fe57 in dmq_send_all_dlgs (dmq_node=0x0) at dlg_dmq.c:657 mypid = 17400 index = 1080 entry = {first = 0x7f49fe575ad0, last = 0x7f49fe575ad0, next_id = 54971, lock = {val = 2}, locker_pid = {val = 17385}, rec_lock_level = 0} dlg = 0x0 __FUNCTION__ = "dmq_send_all_dlgs" #3 0x00007f49f1a4c000 in dlg_dmq_handle_msg (msg=0x7f49fe437878, resp=0x7fff1d14f8e0, node=0x7f49fd4be5d8) at dlg_dmq.c:391 #4 0x00007f49f25da34a in worker_loop (id=0) at worker.c:113 #5 0x00007f49f25d7d35 in child_init (rank=0) at dmq.c:300
From my analysis, it seems memory was increasing because every new call was adding a new job to the dmq workers queue but the workers were stuck in the mutex not consuming the jobs. We could not figure out which process had the mutex, because from GDB the locker_pid had an UDP process and a timer process, but both seemed to be just waiting in normal operation.
Also, I didn't create a ticket because 5.2 is deprecated, but I found it worth reporting this potential problem in the list since it looks like the code for the DMQ is similar between 5.4 and 5.2.
On Tue, 27 Oct 2020 at 09:22, Patrick Wakano <pwakano@gmail.commailto:pwakano@gmail.com> wrote: Hello list, Hope all are doing well!
We are running load tests in our Kamailio server, that is just making inbound and outbound calls and eventually (there is no identified pattern) Kamailio freezes and of course all calls start to fail. It does not crash, it just stops responding and it has to be killed -9. When this happens, SIP messages are not processed, dmq keepalive fails (so the other node reports as down), dialog KA are not sent, but Registrations from UAC seem to still go out (logs from local_route are seen). We don't have a high amount of cps, it is max 3 or 4 per sec, and it gets around 1900 active calls. We are now using Kamailio 5.2.8 installed from the repo on a CentOS7 server. Dialog has KA active and DMQ (with 2 workers) is being used on an active-active instance. From investigation using GDB as pasted below, I can see UDP workers are stuck on a lock either on a callback from t_relay... #0 0x00007ffb74e9bbf9 in syscall () from /lib64/libc.so.6 #1 0x00007ffb2b1bce08 in futex_get (lock=0x7ffb35217b90) at ../../core/futexlock.h:108 #2 0x00007ffb2b1bec44 in bcast_dmq_message1 (peer=0x7ffb35e8bf38, body=0x7fff2e95ffb0, except=0x0, resp_cback=0x7ffb2a8a0ab0 <dlg_dmq_resp_callback>, max_forwards=1, content_type=0x7ffb2a8a0a70 <dlg_dmq_content_type>, incl_inactive=0) at dmq_funcs.c:156 #3 0x00007ffb2b1bf46b in bcast_dmq_message (peer=0x7ffb35e8bf38, body=0x7fff2e95ffb0, except=0x0, resp_cback=0x7ffb2a8a0ab0 <dlg_dmq_resp_callback>, max_forwards=1, content_type=0x7ffb2a8a0a70 <dlg_dmq_content_type>) at dmq_funcs.c:188 #4 0x00007ffb2a6448fa in dlg_dmq_send (body=0x7fff2e95ffb0, node=0x0) at dlg_dmq.c:88 #5 0x00007ffb2a64da5d in dlg_dmq_replicate_action (action=DLG_DMQ_UPDATE, dlg=0x7ffb362ea3c8, needlock=1, node=0x0) at dlg_dmq.c:628 #6 0x00007ffb2a61f28e in dlg_on_send (t=0x7ffb36c98120, type=16, param=0x7fff2e9601e0) at dlg_handlers.c:739 #7 0x00007ffb2ef285b6 in run_trans_callbacks_internal (cb_lst=0x7ffb36c98198, type=16, trans=0x7ffb36c98120, params=0x7fff2e9601e0) at t_hooks.c:260 #8 0x00007ffb2ef286d0 in run_trans_callbacks (type=16, trans=0x7ffb36c98120, req=0x7ffb742f27e0, rpl=0x0, code=-1) at t_hooks.c:287 #9 0x00007ffb2ef38ac1 in prepare_new_uac (t=0x7ffb36c98120, i_req=0x7ffb742f27e0, branch=0, uri=0x7fff2e9603e0, path=0x7fff2e9603c0, next_hop=0x7ffb742f2a58, fsocket=0x7ffb73e3e968, snd_flags=..., fproto=0, flags=2, instance=0x7fff2e9603b0, ruid=0x7fff2e9603a0, location_ua=0x7fff2e960390) at t_fwd.c:381 #10 0x00007ffb2ef3d02d in add_uac (t=0x7ffb36c98120, request=0x7ffb742f27e0, uri=0x7ffb742f2a58, next_hop=0x7ffb742f2a58, path=0x7ffb742f2e20, proxy=0x0, fsocket=0x7ffb73e3e968, snd_flags=..., proto=0, flags=2, instance=0x7ffb742f2e30, ruid=0x7ffb742f2e48, location_ua=0x7ffb742f2e58) at t_fwd.c:811 #11 0x00007ffb2ef4535a in t_forward_nonack (t=0x7ffb36c98120, p_msg=0x7ffb742f27e0, proxy=0x0, proto=0) at t_fwd.c:1699 #12 0x00007ffb2ef20505 in t_relay_to (p_msg=0x7ffb742f27e0, proxy=0x0, proto=0, replicate=0) at t_funcs.c:334
or loose_route... #0 0x00007ffb74e9bbf9 in syscall () from /lib64/libc.so.6 #1 0x00007ffb2b1bce08 in futex_get (lock=0x7ffb35217b90) at ../../core/futexlock.h:108 #2 0x00007ffb2b1bec44 in bcast_dmq_message1 (peer=0x7ffb35e8bf38, body=0x7fff2e9629d0, except=0x0, resp_cback=0x7ffb2a8a0ab0 <dlg_dmq_resp_callback>, max_forwards=1, content_type=0x7ffb2a8a0a70 <dlg_dmq_content_type>, incl_inactive=0) at dmq_funcs.c:156 #3 0x00007ffb2b1bf46b in bcast_dmq_message (peer=0x7ffb35e8bf38, body=0x7fff2e9629d0, except=0x0, resp_cback=0x7ffb2a8a0ab0 <dlg_dmq_resp_callback>, max_forwards=1, content_type=0x7ffb2a8a0a70 <dlg_dmq_content_type>) at dmq_funcs.c:188 #4 0x00007ffb2a6448fa in dlg_dmq_send (body=0x7fff2e9629d0, node=0x0) at dlg_dmq.c:88 #5 0x00007ffb2a64da5d in dlg_dmq_replicate_action (action=DLG_DMQ_STATE, dlg=0x7ffb363e0c10, needlock=0, node=0x0) at dlg_dmq.c:628 #6 0x00007ffb2a62b3bf in dlg_onroute (req=0x7ffb742f11d0, route_params=0x7fff2e962ce0, param=0x0) at dlg_handlers.c:1538 #7 0x00007ffb2e7db203 in run_rr_callbacks (req=0x7ffb742f11d0, rr_param=0x7fff2e962d80) at rr_cb.c:96 #8 0x00007ffb2e7eb2f9 in after_loose (_m=0x7ffb742f11d0, preloaded=0) at loose.c:945 #9 0x00007ffb2e7eb990 in loose_route (_m=0x7ffb742f11d0) at loose.c:979
or t_check_trans: #0 0x00007ffb74e9bbf9 in syscall () from /lib64/libc.so.6 #1 0x00007ffb2a5ea9c6 in futex_get (lock=0x7ffb35e78804) at ../../core/futexlock.h:108 #2 0x00007ffb2a5f1c46 in dlg_lookup_mode (h_entry=1609, h_id=59882, lmode=0) at dlg_hash.c:709 #3 0x00007ffb2a5f27aa in dlg_get_by_iuid (diuid=0x7ffb36326bd0) at dlg_hash.c:777 #4 0x00007ffb2a61ba1d in dlg_onreply (t=0x7ffb36952988, type=2, param=0x7fff2e963bf0) at dlg_handlers.c:437 #5 0x00007ffb2ef285b6 in run_trans_callbacks_internal (cb_lst=0x7ffb36952a00, type=2, trans=0x7ffb36952988, params=0x7fff2e963bf0) at t_hooks .c:260 #6 0x00007ffb2ef286d0 in run_trans_callbacks (type=2, trans=0x7ffb36952988, req=0x7ffb3675c360, rpl=0x7ffb742f1930, code=200) at t_hooks.c:28 7 #7 0x00007ffb2ee7037f in t_reply_matching (p_msg=0x7ffb742f1930, p_branch=0x7fff2e963ebc) at t_lookup.c:997 #8 0x00007ffb2ee725e4 in t_check_msg (p_msg=0x7ffb742f1930, param_branch=0x7fff2e963ebc) at t_lookup.c:1101 #9 0x00007ffb2eee44c7 in t_check_trans (msg=0x7ffb742f1930) at tm.c:2351
And the DMQ workers are here: #0 0x00007ffb74e9bbf9 in syscall () from /lib64/libc.so.6 #1 0x00007ffb2b1d6c81 in futex_get (lock=0x7ffb35217c34) at ../../core/futexlock.h:108 #2 0x00007ffb2b1d7c3a in worker_loop (id=1) at worker.c:86 #3 0x00007ffb2b1d5d35 in child_init (rank=0) at dmq.c:300
Currently I will not be able to upgrade to latest 5.4 version to try to reproduce the error and since 5.2.8 has already reached end-of-life, maybe is there anything I can do on the configuration to avoid such condition? Any ideas are welcome!
Kind regards, Patrick Wakano
The increase of memory is very likely the side effect of processes not handing the traffic, sip_msg_shm_clone() is manly used when creating the transactions. They can accumulate if the timer processes are stuck or other callback functions block the transactions hash table. As said in the other email, try to get the output of kamctl trap to have information about what all processes do at that moment.
Cheers, Daniel
On 27.10.20 07:20, Patrick Wakano wrote:
Sorry to bother again, but another problem we are facing and that now seems related is an increasing memory usage similar to a memory leak. So during the load test, Kamailio shmem starts increasing fast making it run out of shmem. Today it happened again and I could retrieve some info and looks like the "leak" is likely due to the DMQ workers. The kamcmd shmem stats showed high usage in core and dmq: # kamcmd mod.stats all shm Module: core { sip_msg_shm_clone(496): 959632704 counters_prefork_init(207): 61440 cfg_clone_str(130): 392 cfg_shmize(217): 1496 main_loop(1295): 8 init_pt(113): 8 init_pt(108): 8 init_pt(107): 5920 cfg_parse_str(906): 80 register_timer(1012): 432 cfg_register_ctx(47): 96 init_tcp(4714): 8192 init_tcp(4708): 32768 init_tcp(4700): 8 init_tcp(4693): 8 init_tcp(4686): 8 init_tcp(4680): 8 init_tcp(4668): 8 init_avps(90): 8 init_avps(89): 8 init_dst_blacklist(437): 16384 init_dst_blacklist(430): 8 timer_alloc(515): 96 init_dns_cache(358): 8 init_dns_cache(350): 16384 init_dns_cache(343): 16 init_dns_cache(336): 8 init_timer(284): 8 init_timer(283): 16384 init_timer(282): 8 init_timer(281): 8 init_timer(270): 8 init_timer(238): 8 init_timer(221): 278544 init_timer(220): 8 init_timer(207): 8 cfg_child_cb_new(828): 64 sr_cfg_init(360): 8 sr_cfg_init(353): 8 sr_cfg_init(346): 8 sr_cfg_init(334): 8 sr_cfg_init(322): 8 qm_shm_lock_init(1202): 8 Total: 960071600 } Module: dmq { alloc_job_queue(250): 64 shm_str_dup(723): 48 build_dmq_node(164): 896 add_peer(68): 312 mod_init(240): 8 mod_init(233): 48 init_dmq_node_list(70): 24 init_peer_list(33): 24 job_queue_push(286): 15369848 Total: 15371272 }
From GDB I could see both workers were stuck here: Thread 1 (Thread 0x7f4a3cba7740 (LWP 17401)): #0 0x00007f4a3c29dbf9 in syscall () from /lib64/libc.so.6 #1 0x00007f49f1a44bdd in futex_get (lock=0x7fff1d14f564) at ../../core/futexlock.h:121 #2 0x00007f49f1a4fe57 in dmq_send_all_dlgs (dmq_node=0x0) at dlg_dmq.c:657 mypid = 17401 index = 2543 entry = {first = 0x0, last = 0x0, next_id = 107271, lock = {val = 2}, locker_pid = {val = 17393}, rec_lock_level = 0} dlg = 0x0 __FUNCTION__ = "dmq_send_all_dlgs" #3 0x00007f49f1a4c000 in dlg_dmq_handle_msg (msg=0x7f49fe88b2b8, resp=0x7fff1d14f8e0, node=0x7f49fd4be5d8) at dlg_dmq.c:391 #4 0x00007f49f25da34a in worker_loop (id=1) at worker.c:113 #5 0x00007f49f25d7d35 in child_init (rank=0) at dmq.c:300
Thread 1 (Thread 0x7f4a3cba7740 (LWP 17400)): #0 0x00007f4a3c29dbf9 in syscall () from /lib64/libc.so.6 #1 0x00007f49f1a44bdd in futex_get (lock=0x7fff1d14f564) at ../../core/futexlock.h:121 #2 0x00007f49f1a4fe57 in dmq_send_all_dlgs (dmq_node=0x0) at dlg_dmq.c:657 mypid = 17400 index = 1080 entry = {first = 0x7f49fe575ad0, last = 0x7f49fe575ad0, next_id = 54971, lock = {val = 2}, locker_pid = {val = 17385}, rec_lock_level = 0} dlg = 0x0 __FUNCTION__ = "dmq_send_all_dlgs" #3 0x00007f49f1a4c000 in dlg_dmq_handle_msg (msg=0x7f49fe437878, resp=0x7fff1d14f8e0, node=0x7f49fd4be5d8) at dlg_dmq.c:391 #4 0x00007f49f25da34a in worker_loop (id=0) at worker.c:113 #5 0x00007f49f25d7d35 in child_init (rank=0) at dmq.c:300
From my analysis, it seems memory was increasing because every new call was adding a new job to the dmq workers queue but the workers were stuck in the mutex not consuming the jobs. We could not figure out which process had the mutex, because from GDB the locker_pid had an UDP process and a timer process, but both seemed to be just waiting in normal operation.
Also, I didn't create a ticket because 5.2 is deprecated, but I found it worth reporting this potential problem in the list since it looks like the code for the DMQ is similar between 5.4 and 5.2.
On Tue, 27 Oct 2020 at 09:22, Patrick Wakano <pwakano@gmail.com mailto:pwakano@gmail.com> wrote:
Hello list, Hope all are doing well! We are running load tests in our Kamailio server, that is just making inbound and outbound calls and eventually (there is no identified pattern) Kamailio freezes and of course all calls start to fail. It does not crash, it just stops responding and it has to be killed -9. When this happens, SIP messages are not processed, dmq keepalive fails (so the other node reports as down), dialog KA are not sent, but Registrations from UAC seem to still go out (logs from local_route are seen). We don't have a high amount of cps, it is max 3 or 4 per sec, and it gets around 1900 active calls. We are now using Kamailio 5.2.8 installed from the repo on a CentOS7 server. Dialog has KA active and DMQ (with 2 workers) is being used on an active-active instance. From investigation using GDB as pasted below, I can see UDP workers are stuck on a lock either on a callback from t_relay... #0 0x00007ffb74e9bbf9 in syscall () from /lib64/libc.so.6 #1 0x00007ffb2b1bce08 in futex_get (lock=0x7ffb35217b90) at ../../core/futexlock.h:108 #2 0x00007ffb2b1bec44 in bcast_dmq_message1 (peer=0x7ffb35e8bf38, body=0x7fff2e95ffb0, except=0x0, resp_cback=0x7ffb2a8a0ab0 <dlg_dmq_resp_callback>, max_forwards=1, content_type=0x7ffb2a8a0a70 <dlg_dmq_content_type>, incl_inactive=0) at dmq_funcs.c:156 #3 0x00007ffb2b1bf46b in bcast_dmq_message (peer=0x7ffb35e8bf38, body=0x7fff2e95ffb0, except=0x0, resp_cback=0x7ffb2a8a0ab0 <dlg_dmq_resp_callback>, max_forwards=1, content_type=0x7ffb2a8a0a70 <dlg_dmq_content_type>) at dmq_funcs.c:188 #4 0x00007ffb2a6448fa in dlg_dmq_send (body=0x7fff2e95ffb0, node=0x0) at dlg_dmq.c:88 #5 0x00007ffb2a64da5d in dlg_dmq_replicate_action (action=DLG_DMQ_UPDATE, dlg=0x7ffb362ea3c8, needlock=1, node=0x0) at dlg_dmq.c:628 #6 0x00007ffb2a61f28e in dlg_on_send (t=0x7ffb36c98120, type=16, param=0x7fff2e9601e0) at dlg_handlers.c:739 #7 0x00007ffb2ef285b6 in run_trans_callbacks_internal (cb_lst=0x7ffb36c98198, type=16, trans=0x7ffb36c98120, params=0x7fff2e9601e0) at t_hooks.c:260 #8 0x00007ffb2ef286d0 in run_trans_callbacks (type=16, trans=0x7ffb36c98120, req=0x7ffb742f27e0, rpl=0x0, code=-1) at t_hooks.c:287 #9 0x00007ffb2ef38ac1 in prepare_new_uac (t=0x7ffb36c98120, i_req=0x7ffb742f27e0, branch=0, uri=0x7fff2e9603e0, path=0x7fff2e9603c0, next_hop=0x7ffb742f2a58, fsocket=0x7ffb73e3e968, snd_flags=..., fproto=0, flags=2, instance=0x7fff2e9603b0, ruid=0x7fff2e9603a0, location_ua=0x7fff2e960390) at t_fwd.c:381 #10 0x00007ffb2ef3d02d in add_uac (t=0x7ffb36c98120, request=0x7ffb742f27e0, uri=0x7ffb742f2a58, next_hop=0x7ffb742f2a58, path=0x7ffb742f2e20, proxy=0x0, fsocket=0x7ffb73e3e968, snd_flags=..., proto=0, flags=2, instance=0x7ffb742f2e30, ruid=0x7ffb742f2e48, location_ua=0x7ffb742f2e58) at t_fwd.c:811 #11 0x00007ffb2ef4535a in t_forward_nonack (t=0x7ffb36c98120, p_msg=0x7ffb742f27e0, proxy=0x0, proto=0) at t_fwd.c:1699 #12 0x00007ffb2ef20505 in t_relay_to (p_msg=0x7ffb742f27e0, proxy=0x0, proto=0, replicate=0) at t_funcs.c:334 or loose_route... #0 0x00007ffb74e9bbf9 in syscall () from /lib64/libc.so.6 #1 0x00007ffb2b1bce08 in futex_get (lock=0x7ffb35217b90) at ../../core/futexlock.h:108 #2 0x00007ffb2b1bec44 in bcast_dmq_message1 (peer=0x7ffb35e8bf38, body=0x7fff2e9629d0, except=0x0, resp_cback=0x7ffb2a8a0ab0 <dlg_dmq_resp_callback>, max_forwards=1, content_type=0x7ffb2a8a0a70 <dlg_dmq_content_type>, incl_inactive=0) at dmq_funcs.c:156 #3 0x00007ffb2b1bf46b in bcast_dmq_message (peer=0x7ffb35e8bf38, body=0x7fff2e9629d0, except=0x0, resp_cback=0x7ffb2a8a0ab0 <dlg_dmq_resp_callback>, max_forwards=1, content_type=0x7ffb2a8a0a70 <dlg_dmq_content_type>) at dmq_funcs.c:188 #4 0x00007ffb2a6448fa in dlg_dmq_send (body=0x7fff2e9629d0, node=0x0) at dlg_dmq.c:88 #5 0x00007ffb2a64da5d in dlg_dmq_replicate_action (action=DLG_DMQ_STATE, dlg=0x7ffb363e0c10, needlock=0, node=0x0) at dlg_dmq.c:628 #6 0x00007ffb2a62b3bf in dlg_onroute (req=0x7ffb742f11d0, route_params=0x7fff2e962ce0, param=0x0) at dlg_handlers.c:1538 #7 0x00007ffb2e7db203 in run_rr_callbacks (req=0x7ffb742f11d0, rr_param=0x7fff2e962d80) at rr_cb.c:96 #8 0x00007ffb2e7eb2f9 in after_loose (_m=0x7ffb742f11d0, preloaded=0) at loose.c:945 #9 0x00007ffb2e7eb990 in loose_route (_m=0x7ffb742f11d0) at loose.c:979 or t_check_trans: #0 0x00007ffb74e9bbf9 in syscall () from /lib64/libc.so.6 #1 0x00007ffb2a5ea9c6 in futex_get (lock=0x7ffb35e78804) at ../../core/futexlock.h:108 #2 0x00007ffb2a5f1c46 in dlg_lookup_mode (h_entry=1609, h_id=59882, lmode=0) at dlg_hash.c:709 #3 0x00007ffb2a5f27aa in dlg_get_by_iuid (diuid=0x7ffb36326bd0) at dlg_hash.c:777 #4 0x00007ffb2a61ba1d in dlg_onreply (t=0x7ffb36952988, type=2, param=0x7fff2e963bf0) at dlg_handlers.c:437 #5 0x00007ffb2ef285b6 in run_trans_callbacks_internal (cb_lst=0x7ffb36952a00, type=2, trans=0x7ffb36952988, params=0x7fff2e963bf0) at t_hooks .c:260 #6 0x00007ffb2ef286d0 in run_trans_callbacks (type=2, trans=0x7ffb36952988, req=0x7ffb3675c360, rpl=0x7ffb742f1930, code=200) at t_hooks.c:28 7 #7 0x00007ffb2ee7037f in t_reply_matching (p_msg=0x7ffb742f1930, p_branch=0x7fff2e963ebc) at t_lookup.c:997 #8 0x00007ffb2ee725e4 in t_check_msg (p_msg=0x7ffb742f1930, param_branch=0x7fff2e963ebc) at t_lookup.c:1101 #9 0x00007ffb2eee44c7 in t_check_trans (msg=0x7ffb742f1930) at tm.c:2351 And the DMQ workers are here: #0 0x00007ffb74e9bbf9 in syscall () from /lib64/libc.so.6 #1 0x00007ffb2b1d6c81 in futex_get (lock=0x7ffb35217c34) at ../../core/futexlock.h:108 #2 0x00007ffb2b1d7c3a in worker_loop (id=1) at worker.c:86 #3 0x00007ffb2b1d5d35 in child_init (rank=0) at dmq.c:300 Currently I will not be able to upgrade to latest 5.4 version to try to reproduce the error and since 5.2.8 has already reached end-of-life, maybe is there anything I can do on the configuration to avoid such condition? Any ideas are welcome! Kind regards, Patrick Wakano
Kamailio (SER) - Users Mailing List sr-users@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users
Hi Daniel! Thanks for your reply! I managed to get the kamctl trap output, it is attached here. This is a case where Kamailio is unresponsive with the UDP workers stuck in the mutex... I agree with your observations, it goes along with what we were thinking...
Kind regards, Patrick Wakano
On Tue, 27 Oct 2020 at 19:17, Daniel-Constantin Mierla miconda@gmail.com wrote:
The increase of memory is very likely the side effect of processes not handing the traffic, sip_msg_shm_clone() is manly used when creating the transactions. They can accumulate if the timer processes are stuck or other callback functions block the transactions hash table. As said in the other email, try to get the output of kamctl trap to have information about what all processes do at that moment.
Cheers, Daniel
Hello, For reference, we updated Kamailio to version 5.4.2 and the problem still persists. So I opened the ticket ( https://github.com/kamailio/kamailio/issues/2535) to track the issue.
Thank you, Patrick Wakano
On Wed, 28 Oct 2020 at 10:38, Patrick Wakano pwakano@gmail.com wrote:
Hi Daniel! Thanks for your reply! I managed to get the kamctl trap output, it is attached here. This is a case where Kamailio is unresponsive with the UDP workers stuck in the mutex... I agree with your observations, it goes along with what we were thinking...
Kind regards, Patrick Wakano
On Tue, 27 Oct 2020 at 19:17, Daniel-Constantin Mierla miconda@gmail.com wrote:
The increase of memory is very likely the side effect of processes not handing the traffic, sip_msg_shm_clone() is manly used when creating the transactions. They can accumulate if the timer processes are stuck or other callback functions block the transactions hash table. As said in the other email, try to get the output of kamctl trap to have information about what all processes do at that moment.
Cheers, Daniel
Hello,
You say you can reproduce is as you do some load tests, it is better to get the output of:
kamctl trap
It writes the gdb bt full for all kamailio processes in a file that you can attach here.
All the locks you listed in your email can be a side effect of another blocking operations, because at the first sight the lock() inside bcast_dmq_message1() has a corresponding unlock().
Cheers, Daniel
On 26.10.20 23:22, Patrick Wakano wrote:
Hello list, Hope all are doing well!
We are running load tests in our Kamailio server, that is just making inbound and outbound calls and eventually (there is no identified pattern) Kamailio freezes and of course all calls start to fail. It does not crash, it just stops responding and it has to be killed -9. When this happens, SIP messages are not processed, dmq keepalive fails (so the other node reports as down), dialog KA are not sent, but Registrations from UAC seem to still go out (logs from local_route are seen). We don't have a high amount of cps, it is max 3 or 4 per sec, and it gets around 1900 active calls. We are now using Kamailio 5.2.8 installed from the repo on a CentOS7 server. Dialog has KA active and DMQ (with 2 workers) is being used on an active-active instance. From investigation using GDB as pasted below, I can see UDP workers are stuck on a lock either on a callback from t_relay... #0 0x00007ffb74e9bbf9 in syscall () from /lib64/libc.so.6 #1 0x00007ffb2b1bce08 in futex_get (lock=0x7ffb35217b90) at ../../core/futexlock.h:108 #2 0x00007ffb2b1bec44 in bcast_dmq_message1 (peer=0x7ffb35e8bf38, body=0x7fff2e95ffb0, except=0x0, resp_cback=0x7ffb2a8a0ab0 <dlg_dmq_resp_callback>, max_forwards=1, content_type=0x7ffb2a8a0a70 <dlg_dmq_content_type>, incl_inactive=0) at dmq_funcs.c:156 #3 0x00007ffb2b1bf46b in bcast_dmq_message (peer=0x7ffb35e8bf38, body=0x7fff2e95ffb0, except=0x0, resp_cback=0x7ffb2a8a0ab0 <dlg_dmq_resp_callback>, max_forwards=1, content_type=0x7ffb2a8a0a70 <dlg_dmq_content_type>) at dmq_funcs.c:188 #4 0x00007ffb2a6448fa in dlg_dmq_send (body=0x7fff2e95ffb0, node=0x0) at dlg_dmq.c:88 #5 0x00007ffb2a64da5d in dlg_dmq_replicate_action (action=DLG_DMQ_UPDATE, dlg=0x7ffb362ea3c8, needlock=1, node=0x0) at dlg_dmq.c:628 #6 0x00007ffb2a61f28e in dlg_on_send (t=0x7ffb36c98120, type=16, param=0x7fff2e9601e0) at dlg_handlers.c:739 #7 0x00007ffb2ef285b6 in run_trans_callbacks_internal (cb_lst=0x7ffb36c98198, type=16, trans=0x7ffb36c98120, params=0x7fff2e9601e0) at t_hooks.c:260 #8 0x00007ffb2ef286d0 in run_trans_callbacks (type=16, trans=0x7ffb36c98120, req=0x7ffb742f27e0, rpl=0x0, code=-1) at t_hooks.c:287 #9 0x00007ffb2ef38ac1 in prepare_new_uac (t=0x7ffb36c98120, i_req=0x7ffb742f27e0, branch=0, uri=0x7fff2e9603e0, path=0x7fff2e9603c0, next_hop=0x7ffb742f2a58, fsocket=0x7ffb73e3e968, snd_flags=..., fproto=0, flags=2, instance=0x7fff2e9603b0, ruid=0x7fff2e9603a0, location_ua=0x7fff2e960390) at t_fwd.c:381 #10 0x00007ffb2ef3d02d in add_uac (t=0x7ffb36c98120, request=0x7ffb742f27e0, uri=0x7ffb742f2a58, next_hop=0x7ffb742f2a58, path=0x7ffb742f2e20, proxy=0x0, fsocket=0x7ffb73e3e968, snd_flags=..., proto=0, flags=2, instance=0x7ffb742f2e30, ruid=0x7ffb742f2e48, location_ua=0x7ffb742f2e58) at t_fwd.c:811 #11 0x00007ffb2ef4535a in t_forward_nonack (t=0x7ffb36c98120, p_msg=0x7ffb742f27e0, proxy=0x0, proto=0) at t_fwd.c:1699 #12 0x00007ffb2ef20505 in t_relay_to (p_msg=0x7ffb742f27e0, proxy=0x0, proto=0, replicate=0) at t_funcs.c:334
or loose_route... #0 0x00007ffb74e9bbf9 in syscall () from /lib64/libc.so.6 #1 0x00007ffb2b1bce08 in futex_get (lock=0x7ffb35217b90) at ../../core/futexlock.h:108 #2 0x00007ffb2b1bec44 in bcast_dmq_message1 (peer=0x7ffb35e8bf38, body=0x7fff2e9629d0, except=0x0, resp_cback=0x7ffb2a8a0ab0 <dlg_dmq_resp_callback>, max_forwards=1, content_type=0x7ffb2a8a0a70 <dlg_dmq_content_type>, incl_inactive=0) at dmq_funcs.c:156 #3 0x00007ffb2b1bf46b in bcast_dmq_message (peer=0x7ffb35e8bf38, body=0x7fff2e9629d0, except=0x0, resp_cback=0x7ffb2a8a0ab0 <dlg_dmq_resp_callback>, max_forwards=1, content_type=0x7ffb2a8a0a70 <dlg_dmq_content_type>) at dmq_funcs.c:188 #4 0x00007ffb2a6448fa in dlg_dmq_send (body=0x7fff2e9629d0, node=0x0) at dlg_dmq.c:88 #5 0x00007ffb2a64da5d in dlg_dmq_replicate_action (action=DLG_DMQ_STATE, dlg=0x7ffb363e0c10, needlock=0, node=0x0) at dlg_dmq.c:628 #6 0x00007ffb2a62b3bf in dlg_onroute (req=0x7ffb742f11d0, route_params=0x7fff2e962ce0, param=0x0) at dlg_handlers.c:1538 #7 0x00007ffb2e7db203 in run_rr_callbacks (req=0x7ffb742f11d0, rr_param=0x7fff2e962d80) at rr_cb.c:96 #8 0x00007ffb2e7eb2f9 in after_loose (_m=0x7ffb742f11d0, preloaded=0) at loose.c:945 #9 0x00007ffb2e7eb990 in loose_route (_m=0x7ffb742f11d0) at loose.c:979
or t_check_trans: #0 0x00007ffb74e9bbf9 in syscall () from /lib64/libc.so.6 #1 0x00007ffb2a5ea9c6 in futex_get (lock=0x7ffb35e78804) at ../../core/futexlock.h:108 #2 0x00007ffb2a5f1c46 in dlg_lookup_mode (h_entry=1609, h_id=59882, lmode=0) at dlg_hash.c:709 #3 0x00007ffb2a5f27aa in dlg_get_by_iuid (diuid=0x7ffb36326bd0) at dlg_hash.c:777 #4 0x00007ffb2a61ba1d in dlg_onreply (t=0x7ffb36952988, type=2, param=0x7fff2e963bf0) at dlg_handlers.c:437 #5 0x00007ffb2ef285b6 in run_trans_callbacks_internal (cb_lst=0x7ffb36952a00, type=2, trans=0x7ffb36952988, params=0x7fff2e963bf0) at t_hooks .c:260 #6 0x00007ffb2ef286d0 in run_trans_callbacks (type=2, trans=0x7ffb36952988, req=0x7ffb3675c360, rpl=0x7ffb742f1930, code=200) at t_hooks.c:28 7 #7 0x00007ffb2ee7037f in t_reply_matching (p_msg=0x7ffb742f1930, p_branch=0x7fff2e963ebc) at t_lookup.c:997 #8 0x00007ffb2ee725e4 in t_check_msg (p_msg=0x7ffb742f1930, param_branch=0x7fff2e963ebc) at t_lookup.c:1101 #9 0x00007ffb2eee44c7 in t_check_trans (msg=0x7ffb742f1930) at tm.c:2351
And the DMQ workers are here: #0 0x00007ffb74e9bbf9 in syscall () from /lib64/libc.so.6 #1 0x00007ffb2b1d6c81 in futex_get (lock=0x7ffb35217c34) at ../../core/futexlock.h:108 #2 0x00007ffb2b1d7c3a in worker_loop (id=1) at worker.c:86 #3 0x00007ffb2b1d5d35 in child_init (rank=0) at dmq.c:300
Currently I will not be able to upgrade to latest 5.4 version to try to reproduce the error and since 5.2.8 has already reached end-of-life, maybe is there anything I can do on the configuration to avoid such condition? Any ideas are welcome!
Kind regards, Patrick Wakano
Kamailio (SER) - Users Mailing List sr-users@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users