[SR-Users] Kamailio unresponsive with Dialog+DMQ

Daniel-Constantin Mierla miconda at gmail.com
Tue Oct 27 09:13:50 CET 2020


Hello,

You say you can reproduce is as you do some load tests, it is better to
get the output of:

kamctl trap

It writes the gdb bt full for all kamailio processes in a file that you
can attach here.

All the locks you listed in your email can be a side effect of another
blocking operations, because at the first sight the lock() inside
bcast_dmq_message1() has a corresponding unlock().

Cheers,
Daniel

On 26.10.20 23:22, Patrick Wakano wrote:
> Hello list,
> Hope all are doing well!
>
> We are running load tests in our Kamailio server, that is just making
> inbound and outbound calls and eventually (there is no identified
> pattern) Kamailio freezes and of course all calls start to fail. It
> does not crash, it just stops responding and it has to be killed -9.
> When this happens, SIP messages are not processed, dmq keepalive fails
> (so the other node reports as down), dialog KA are not sent, but
> Registrations from UAC seem to still go out (logs from local_route are
> seen).
> We don't have a high amount of cps, it is max 3 or 4 per sec, and it
> gets around 1900 active calls. We are now using Kamailio 5.2.8
> installed from the repo on a CentOS7 server. Dialog has KA active and
> DMQ (with 2 workers) is being used on an active-active instance.
> From investigation using GDB as pasted below, I can see UDP workers
> are stuck on a lock either on a callback from t_relay...
> #0  0x00007ffb74e9bbf9 in syscall () from /lib64/libc.so.6
> #1  0x00007ffb2b1bce08 in futex_get (lock=0x7ffb35217b90) at
> ../../core/futexlock.h:108
> #2  0x00007ffb2b1bec44 in bcast_dmq_message1 (peer=0x7ffb35e8bf38,
> body=0x7fff2e95ffb0, except=0x0, resp_cback=0x7ffb2a8a0ab0
> <dlg_dmq_resp_callback>, max_forwards=1, content_type=0x7ffb2a8a0a70
> <dlg_dmq_content_type>, incl_inactive=0) at dmq_funcs.c:156
> #3  0x00007ffb2b1bf46b in bcast_dmq_message (peer=0x7ffb35e8bf38,
> body=0x7fff2e95ffb0, except=0x0, resp_cback=0x7ffb2a8a0ab0
> <dlg_dmq_resp_callback>, max_forwards=1, content_type=0x7ffb2a8a0a70
> <dlg_dmq_content_type>) at dmq_funcs.c:188
> #4  0x00007ffb2a6448fa in dlg_dmq_send (body=0x7fff2e95ffb0, node=0x0)
> at dlg_dmq.c:88
> #5  0x00007ffb2a64da5d in dlg_dmq_replicate_action
> (action=DLG_DMQ_UPDATE, dlg=0x7ffb362ea3c8, needlock=1, node=0x0) at
> dlg_dmq.c:628
> #6  0x00007ffb2a61f28e in dlg_on_send (t=0x7ffb36c98120, type=16,
> param=0x7fff2e9601e0) at dlg_handlers.c:739
> #7  0x00007ffb2ef285b6 in run_trans_callbacks_internal
> (cb_lst=0x7ffb36c98198, type=16, trans=0x7ffb36c98120,
> params=0x7fff2e9601e0) at t_hooks.c:260
> #8  0x00007ffb2ef286d0 in run_trans_callbacks (type=16,
> trans=0x7ffb36c98120, req=0x7ffb742f27e0, rpl=0x0, code=-1) at
> t_hooks.c:287
> #9  0x00007ffb2ef38ac1 in prepare_new_uac (t=0x7ffb36c98120,
> i_req=0x7ffb742f27e0, branch=0, uri=0x7fff2e9603e0,
> path=0x7fff2e9603c0, next_hop=0x7ffb742f2a58, fsocket=0x7ffb73e3e968,
> snd_flags=..., fproto=0, flags=2, instance=0x7fff2e9603b0,
> ruid=0x7fff2e9603a0, location_ua=0x7fff2e960390) at t_fwd.c:381
> #10 0x00007ffb2ef3d02d in add_uac (t=0x7ffb36c98120,
> request=0x7ffb742f27e0, uri=0x7ffb742f2a58, next_hop=0x7ffb742f2a58,
> path=0x7ffb742f2e20, proxy=0x0, fsocket=0x7ffb73e3e968, snd_flags=...,
> proto=0, flags=2, instance=0x7ffb742f2e30, ruid=0x7ffb742f2e48,
> location_ua=0x7ffb742f2e58) at t_fwd.c:811
> #11 0x00007ffb2ef4535a in t_forward_nonack (t=0x7ffb36c98120,
> p_msg=0x7ffb742f27e0, proxy=0x0, proto=0) at t_fwd.c:1699
> #12 0x00007ffb2ef20505 in t_relay_to (p_msg=0x7ffb742f27e0, proxy=0x0,
> proto=0, replicate=0) at t_funcs.c:334
>
> or loose_route...
> #0  0x00007ffb74e9bbf9 in syscall () from /lib64/libc.so.6
> #1  0x00007ffb2b1bce08 in futex_get (lock=0x7ffb35217b90) at
> ../../core/futexlock.h:108
> #2  0x00007ffb2b1bec44 in bcast_dmq_message1 (peer=0x7ffb35e8bf38,
> body=0x7fff2e9629d0, except=0x0, resp_cback=0x7ffb2a8a0ab0
> <dlg_dmq_resp_callback>, max_forwards=1, content_type=0x7ffb2a8a0a70
> <dlg_dmq_content_type>, incl_inactive=0) at dmq_funcs.c:156
> #3  0x00007ffb2b1bf46b in bcast_dmq_message (peer=0x7ffb35e8bf38,
> body=0x7fff2e9629d0, except=0x0, resp_cback=0x7ffb2a8a0ab0
> <dlg_dmq_resp_callback>, max_forwards=1, content_type=0x7ffb2a8a0a70
> <dlg_dmq_content_type>) at dmq_funcs.c:188
> #4  0x00007ffb2a6448fa in dlg_dmq_send (body=0x7fff2e9629d0, node=0x0)
> at dlg_dmq.c:88
> #5  0x00007ffb2a64da5d in dlg_dmq_replicate_action
> (action=DLG_DMQ_STATE, dlg=0x7ffb363e0c10, needlock=0, node=0x0) at
> dlg_dmq.c:628
> #6  0x00007ffb2a62b3bf in dlg_onroute (req=0x7ffb742f11d0,
> route_params=0x7fff2e962ce0, param=0x0) at dlg_handlers.c:1538
> #7  0x00007ffb2e7db203 in run_rr_callbacks (req=0x7ffb742f11d0,
> rr_param=0x7fff2e962d80) at rr_cb.c:96
> #8  0x00007ffb2e7eb2f9 in after_loose (_m=0x7ffb742f11d0, preloaded=0)
> at loose.c:945
> #9  0x00007ffb2e7eb990 in loose_route (_m=0x7ffb742f11d0) at loose.c:979
>
> or  t_check_trans:
> #0  0x00007ffb74e9bbf9 in syscall () from /lib64/libc.so.6
> #1  0x00007ffb2a5ea9c6 in futex_get (lock=0x7ffb35e78804) at
> ../../core/futexlock.h:108
> #2  0x00007ffb2a5f1c46 in dlg_lookup_mode (h_entry=1609, h_id=59882,
> lmode=0) at dlg_hash.c:709
> #3  0x00007ffb2a5f27aa in dlg_get_by_iuid (diuid=0x7ffb36326bd0) at
> dlg_hash.c:777
> #4  0x00007ffb2a61ba1d in dlg_onreply (t=0x7ffb36952988, type=2,
> param=0x7fff2e963bf0) at dlg_handlers.c:437
> #5  0x00007ffb2ef285b6 in run_trans_callbacks_internal
> (cb_lst=0x7ffb36952a00, type=2, trans=0x7ffb36952988,
> params=0x7fff2e963bf0) at t_hooks
> .c:260
> #6  0x00007ffb2ef286d0 in run_trans_callbacks (type=2,
> trans=0x7ffb36952988, req=0x7ffb3675c360, rpl=0x7ffb742f1930,
> code=200) at t_hooks.c:28
> 7
> #7  0x00007ffb2ee7037f in t_reply_matching (p_msg=0x7ffb742f1930,
> p_branch=0x7fff2e963ebc) at t_lookup.c:997
> #8  0x00007ffb2ee725e4 in t_check_msg (p_msg=0x7ffb742f1930,
> param_branch=0x7fff2e963ebc) at t_lookup.c:1101
> #9  0x00007ffb2eee44c7 in t_check_trans (msg=0x7ffb742f1930) at tm.c:2351
>
> And the DMQ workers are here:
> #0  0x00007ffb74e9bbf9 in syscall () from /lib64/libc.so.6
> #1  0x00007ffb2b1d6c81 in futex_get (lock=0x7ffb35217c34) at
> ../../core/futexlock.h:108
> #2  0x00007ffb2b1d7c3a in worker_loop (id=1) at worker.c:86
> #3  0x00007ffb2b1d5d35 in child_init (rank=0) at dmq.c:300
>
> Currently I will not be able to upgrade to latest 5.4 version to try
> to reproduce the error and since 5.2.8 has already reached
> end-of-life, maybe is there anything I can do on the configuration to
> avoid such condition?
> Any ideas are welcome!
>
> Kind regards,
> Patrick Wakano
>
> _______________________________________________
> Kamailio (SER) - Users Mailing List
> sr-users at lists.kamailio.org
> https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users

-- 
Daniel-Constantin Mierla -- www.asipto.com
www.twitter.com/miconda -- www.linkedin.com/in/miconda
Funding: https://www.paypal.me/dcmierla

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.kamailio.org/pipermail/sr-users/attachments/20201027/da59921e/attachment.htm>


More information about the sr-users mailing list