<!-- Kamailio Project uses GitHub Issues only for bugs in the code or feature requests. Please use this template only for bug reports.
If you have questions about using Kamailio or related to its configuration file, ask on sr-users mailing list:
* http://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users
If you have questions about developing extensions to Kamailio or its existing C code, ask on sr-dev mailing list:
* http://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-dev
Please try to fill this template as much as possible for any issue. It helps the developers to troubleshoot the issue.
If there is no content to be filled in a section, the entire section can be removed.
You can delete the comments from the template sections when filling.
You can delete next line and everything above before submitting (it is a comment). -->
### Description
When running a load test Kamailio eventually becames unresponsive and stops processing calls. Kamilio is configured to use the DMQ replication for dialog and usrloc. Also the dialog keepalived is enabled.
### Troubleshooting
From investigation, the problem happens faster and easier when there is some network degradation causing packet loss and/or retransmissions but even without any noticeable network issue the freeze eventually happens.
#### Reproduction
Run a simple load test making calls at a rate of ~5 cps and keep around ~2000 calls connected all the time. A higher cps seems to make it easier to reproduce the problem. Adding network degradation to the environment makes the problem happens, but when running a tool such as SIPp for the load test, the retransmission can be forced by simply killing the SIPp instance receiving calls which will then force Kamailio to retransmit.
#### Debugging Data
<!-- If you got a core dump, use gdb to extract troubleshooting data - full backtrace, local variables and the list of the code at the issue location.
gdb /path/to/kamailio /path/to/corefile bt full info locals list
If you are familiar with gdb, feel free to attach more of what you consider to be relevant. --> Output of kamct trap: [gdb_kamailio_20201028_213030.txt](https://github.com/kamailio/kamailio/files/5455020/gdb_kamailio_20201028_213...)
#### Log Messages
Local generated requests shows up in the log, but are not sent in the network
### Possible Solutions
Not found so far
### Additional Information
* **Kamailio Version** - output of `kamailio -v`
``` # kamailio -v version: kamailio 5.4.2 (x86_64/linux) c3b91f flags: USE_TCP, USE_TLS, USE_SCTP, TLS_HOOKS, USE_RAW_SOCKS, DISABLE_NAGLE, USE_MCAST, DNS_IP_HACK, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, TLSF_MALLOC, DBG_SR_MEMORY, USE_FUTEX, FAST_LOCK-ADAPTIVE_WAIT, USE_DNS_CACHE, USE_DNS_FAILOVER, USE_NAPTR, USE_DST_BLACKLIST, HAVE_RESOLV_RES ADAPTIVE_WAIT_LOOPS 1024, MAX_RECV_BUFFER_SIZE 262144, MAX_URI_SIZE 1024, BUF_SIZE 65535, DEFAULT PKG_SIZE 8MB poll method support: poll, epoll_lt, epoll_et, sigio_rt, select. id: c3b91f compiled on 13:50:37 Oct 27 2020 with gcc 4.8.5 ```
* **Operating System**:
<!-- Details about the operating system, the type: Linux (e.g.,: Debian 8.4, Ubuntu 16.04, CentOS 7.1, ...), MacOS, xBSD, Solaris, ...; Kernel details (output of `uname -a`) -->
``` # cat /etc/centos-release CentOS Linux release 7.8.2003 (Core) # uname -a Linux hostname 3.10.0-1127.19.1.el7.x86_64 #1 SMP Tue Aug 25 17:23:54 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux ```
Thanks for reproducing it with a recent version and opening the issue.
Hello, sorry for being pushy, but just hoping someone could acknowledge the problem or if this is some config on my side which is causing the issue. This is a major problem, it is worst than a crash, a manual intervention with kill -9 is necessary to restore the service back.
There is not much to configuration needed regarding DMQ, so my guess its either something regarding the particular environment or a bug in the code. Maybe @charlesrchance can comment on it.
Thanks henningw! I was just wondering about some config in the DMQ or dialog which can be causing the issue because it looks like others in the community are using DMQ + dialog without issues.
Can you try with couple of variants to try to isolate the cause? Try each one independent of the other.
1) use `modparam("dmq", "num_workers", 1)`
2) use `modparam("dmq", "worker_usleep", 1000)`
3) use `modparam("dialog", "track_cseq_updates", 1)` - this may break authentication done with `uac` module, but I understood you test and can reproduce it in the lab
Just to copy from the list discussion: the utilization of the feature vsphere Fault Tolerance is linked as a source of network latency and probably CPU allocation latency. This increases the chances of the mentioned issue to happen.
The investigation done so far regarding the dead lock shows that 2 process entered a dead lock, a udp worker and the timer process. The timer process was trying to delete a dialog reference, and got the dialog lock (d_table->entry lock) and then tried to get the DMQ node list lock. On the other hand, the UDP worker was trying to broadcast a DMQ message, got the DMQ node list lock and then tried to get the d_table entry lock for the same dialog as the timer process. ``` N 21068 main process - attendant Y 21247 udp receiver child=0 sock=124.47.168.242:5060 TRYING TO GET LOCK 0x7f03675bdc48 Y 21248 udp receiver child=1 sock=124.47.168.242:5060 TRYING TO GET LOCK 0x7f03675bdc48 Y 21249 udp receiver child=2 sock=124.47.168.242:5060 TRYING TO GET LOCK 0x7f03675bdc48 Y 21250 udp receiver child=3 sock=124.47.168.242:5060 TRYING TO GET LOCK 0x7f03675bdc48 Y 21251 udp receiver child=4 sock=124.47.168.242:5060 TRYING TO GET LOCK 0x7f03675bdc48 Y 21252 udp receiver child=5 sock=124.47.168.242:5060 TRYING TO GET LOCK 0x7f03675bdc48 Y 21253 udp receiver child=6 sock=124.47.168.242:5060 TRYING TO GET LOCK 0x7f036822441c HAS LOCK 0x7f03675bdc48 Y 21254 udp receiver child=7 sock=124.47.168.242:5060 Y 21255 udp receiver child=0 sock=10.51.1.72:5060 Y 21257 udp receiver child=1 sock=10.51.1.72:5060 Y 21258 udp receiver child=2 sock=10.51.1.72:5060 Y 21260 udp receiver child=3 sock=10.51.1.72:5060 Y 21261 udp receiver child=4 sock=10.51.1.72:5060 Y 21262 udp receiver child=5 sock=10.51.1.72:5060 Y 21263 udp receiver child=6 sock=10.51.1.72:5060 Y 21264 udp receiver child=7 sock=10.51.1.72:5060 N 21265 slow timer Y 21266 timer TRYING TO GET LOCK 0x7f03675bdc48 HAS LOCK 0x7f036822441c N 21267 secondary timer N 21268 ctl handler N 21269 JSONRPCS FIFO Y 21270 DMQ WORKER TRYING TO GET LOCK 0x7f03675bdcd4 Y 21271 DMQ WORKER TRYING TO GET LOCK 0x7f03675bdcec Y 21273 Dialog KA Timer Y 21275 Dialog Clean Timer TRYING TO GET LOCK 0x7f036822441c N 21278 WEBSOCKET KEEPALIVE N 21279 WEBSOCKET TIMER N 21281 TIMER NH N 21282 TIMER NH Y 21283 TIMER UAC REG N 21284 tcp receiver (generic) child=0 N 21285 tcp receiver (generic) child=1 N 21286 tcp receiver (generic) child=2 N 21287 tcp receiver (generic) child=3 N 21289 tcp receiver (generic) child=4 N 21291 tcp receiver (generic) child=5 N 21292 tcp receiver (generic) child=6 N 21293 tcp receiver (generic) child=7 N 21294 tcp main process ``` 21266 # timer BT: ``` #0 0x00007f0386d30bf9 in syscall () from /lib64/libc.so.6 #1 0x00007f035d4effa9 in futex_get (lock=0x7f03675bdc48) at ../../core/futexlock.h:121 <- TRYING TO GET LOCK 0x7f03675bdc48 #2 0x00007f035d4f205b in bcast_dmq_message1 (peer=0x7f0368231ff0, body=0x7ffe95bc0920, except=0x0, resp_cback=0x7f035cbc9af0 <dlg_dmq_resp_callback>, max_forwards=1, content_type=0x7f035cbc9ab0 <dlg_dmq_content_type>, incl_inactive=0) at dmq_funcs.c:156 #3 0x00007f035d4f28d4 in bcast_dmq_message (peer=0x7f0368231ff0, body=0x7ffe95bc0920, except=0x0, resp_cback=0x7f035cbc9af0 <dlg_dmq_resp_callback>, max_forwards=1, content_type=0x7f035cbc9ab0 <dlg_dmq_content_type>) at dmq_funcs.c:188 #4 0x00007f035c96704d in dlg_dmq_send (body=0x7ffe95bc0920, node=0x0) at dlg_dmq.c:88 #5 0x00007f035c9709d5 in dlg_dmq_replicate_action (action=DLG_DMQ_RM, dlg=0x7f03684b77f8, needlock=0, node=0x0) at dlg_dmq.c:628 #6 0x00007f035c90a39a in destroy_dlg (dlg=0x7f03684b77f8) at dlg_hash.c:367 #7 0x00007f035c914064 in dlg_unref_helper (dlg=0x7f03684b77f8, cnt=2, fname=0x7f035c9b222d "dlg_handlers.c", fline=1212) at dlg_hash.c:1081 << --- GOT LOCK 0x7f036822441c
(gdb) p *d_entry $2 = {first = 0x7f0369205248, last = 0x7f036855c680, next_id = 30442, lock = {val = 2}, locker_pid = {val = 21266}, rec_lock_level = 0} (gdb) p dlg->h_entry $3 = 2340 (gdb) p &d_entry->lock $5 = (gen_lock_t *) 0x7f036822441c
#8 0x00007f035c945bd2 in unref_dlg_from_cb (t=0x7f036936b678, type=131072, param=0x7ffe95bc0e20) at dlg_handlers.c:1212 #9 0x00007f03612d3165 in run_trans_callbacks_internal (cb_lst=0x7f036936b6f0, type=131072, trans=0x7f036936b678, params=0x7ffe95bc0e20) at t_hooks.c:258 #10 0x00007f03612d3297 in run_trans_callbacks (type=131072, trans=0x7f036936b678, req=0x0, rpl=0x0, code=0) at t_hooks.c:285 #11 0x00007f03611f9b8f in free_cell_helper (dead_cell=0x7f036936b678, silent=0, fname=0x7f0361313363 "timer.c", fline=643) at h_table.c:165 #12 0x00007f03612bfc4d in wait_handler (ti=1856451739, wait_tl=0x7f036936b700, data=0x7f036936b678) at timer.c:643 #13 0x00000000004ea094 in timer_list_expire (t=1856451739, h=0x7f036587d380, slow_l=0x7f036587f948, slow_mark=5689) at core/timer.c:857 #14 0x00000000004ea53d in timer_handler () at core/timer.c:922 #15 0x00000000004ea9e7 in timer_main () at core/timer.c:961 #16 0x000000000042a838 in main_loop () at main.c:1753 #17 0x0000000000433a96 in main (argc=10, argv=0x7ffe95bc1968) at main.c:2856
``` 21253 # udp receiver child=6 sock=124.47.168.242:5060 BT: ``` #0 0x00007f0386d30bf9 in syscall () from /lib64/libc.so.6 #1 0x00007f035c9061a7 in futex_get (lock=0x7f036822441c) at ../../core/futexlock.h:121 #2 0x00007f035c90ee00 in dlg_lookup_mode (h_entry=2340, h_id=30422, lmode=0) at dlg_hash.c:781 <- TRYING TO GET LOCK KHERE
(gdb) p *d_entry $1 = {first = 0x7f0369205248, last = 0x7f036855c680, next_id = 30442, lock = {val = 2}, locker_pid = {val = 21266}, rec_lock_level = 0}
(gdb) p h_entry $2 = 2340
#3 0x00007f035c90fa27 in dlg_get_by_iuid (diuid=0x7f035cbcbd40 <_dlg_ctx+64>) at dlg_hash.c:849 #4 0x00007f035c9abfd1 in dlg_get_ctx_dialog () at dlg_var.c:940 #5 0x00007f035c945c03 in dlg_lookup_msg_dialog (msg=0x7ffe95bbc8c0, dir=0x7ffe95bbc8bc) at dlg_handlers.c:1227 #6 0x00007f035c956f77 in dlg_cseq_msg_sent (evp=0x7ffe95bbd2d0) at dlg_cseq.c:393 #7 0x00000000004ec14b in sr_event_exec (type=2, evp=0x7ffe95bbd2d0) at core/events.c:240 #8 0x00007f03612c1e88 in msg_send_buffer (dst=0x7f036a0a4870, buf=0x7f036bfa1af8 "KDMQ sip:dialog@10.51.1.71:5060 SIP/2.0\r\nVia: SIP/2.0/UDP 10.51.1.72;branch=z9hG4bK6492.a407a601", '0' <repeats 24 times>, ".0\r\nTo: sip:dialog@10.51.1.71:5060\r\nFrom: sip:dialog@10.51.1.72:5060;tag=cf4"..., len=703, flags=0) at ../../core/forward.h:148 #9 0x00007f03612c743e in send_pr_buffer (rb=0x7f036a0a4820, buf=0x7f036bfa1af8, len=703) at t_funcs.c:69 #10 0x00007f036128d315 in send_prepared_request_impl (request=0x7f036a0a4820, retransmit=1, branch=0) at uac.c:669 #11 0x00007f036128e89e in t_uac_with_ids (uac_r=0x7ffe95bbd8f0, ret_index=0x0, ret_label=0x0) at uac.c:753 #12 0x00007f036128e7ca in t_uac (uac_r=0x7ffe95bbd8f0) at uac.c:721 #13 0x00007f0361291a62 in request (uac_r=0x7ffe95bbd8f0, ruri=0x7ffe95bbd8c0, to=0x7ffe95bbd8c0, from=0x7ffe95bbd8d0, next_hop=0x0) at uac.c:1089 #14 0x00007f035d4f4053 in dmq_send_message (peer=0x7f0368231ff0, body=0x7ffe95bbdca0, node=0x7f0368464068, resp_cback=0x7f035cbc9af0 <dlg_dmq_resp_callback>, max_forwards=1, content_type=0x7f035cbc9ab0 <dlg_dmq_content_type>) at dmq_funcs.c:251 #15 0x00007f035d4f24ff in bcast_dmq_message1 (peer=0x7f0368231ff0, body=0x7ffe95bbdca0, except=0x0, resp_cback=0x7f035cbc9af0 <dlg_dmq_resp_callback>, max_forwards=1, content_type=0x7f035cbc9ab0 <dlg_dmq_content_type>, incl_inactive=0) at dmq_funcs.c:170 <----- GOT LOCK 0x7f03675bdc48
(gdb) p &dmq_node_list->lock $1 = (gen_lock_t *) 0x7f03675bdc48
#16 0x00007f035d4f28d4 in bcast_dmq_message (peer=0x7f0368231ff0, body=0x7ffe95bbdca0, except=0x0, resp_cback=0x7f035cbc9af0 <dlg_dmq_resp_callback>, max_forwards=1, content_type=0x7f035cbc9ab0 <dlg_dmq_content_type>) at dmq_funcs.c:188 #17 0x00007f035c96704d in dlg_dmq_send (body=0x7ffe95bbdca0, node=0x0) at dlg_dmq.c:88 #18 0x00007f035c9709d5 in dlg_dmq_replicate_action (action=DLG_DMQ_STATE, dlg=0x7f0369205248, needlock=0, node=0x0) at dlg_dmq.c:628 #19 0x00007f035c94ba78 in dlg_onroute (req=0x7f03860006e0, route_params=0x7ffe95bbe030, param=0x0) at dlg_handlers.c:1559 #20 0x00007f0360b665c6 in run_rr_callbacks (req=0x7f03860006e0, rr_param=0x7ffe95bbe120) at rr_cb.c:96 #21 0x00007f0360b799b2 in after_loose (_m=0x7f03860006e0, preloaded=0) at loose.c:984 #22 0x00007f0360b7a0dc in loose_route (_m=0x7f03860006e0) at loose.c:1018 #23 0x00007f0360b7e240 in w_loose_route (msg=0x7f03860006e0, p1=0x0, p2=0x0) at rr_mod.c:276 #24 0x00000000005d0bc5 in do_action (h=0x7ffe95bbecd0, a=0x7f0385daa880, msg=0x7f03860006e0) at core/action.c:1076 #25 0x00000000005ddf78 in run_actions (h=0x7ffe95bbecd0, a=0x7f0385daa880, msg=0x7f03860006e0) at core/action.c:1581 #26 0x00000000005de5f4 in run_actions_safe (h=0x7ffe95bc09f0, a=0x7f0385daa880, msg=0x7f03860006e0) at core/action.c:1645 #27 0x0000000000446834 in rval_get_int (h=0x7ffe95bc09f0, msg=0x7f03860006e0, i=0x7ffe95bbf5fc, rv=0x7f0385daaac0, cache=0x0) at core/rvalue.c:915 #28 0x000000000044b365 in rval_expr_eval_int (h=0x7ffe95bc09f0, msg=0x7f03860006e0, res=0x7ffe95bbf5fc, rve=0x7f0385daaab8) at core/rvalue.c:1913 #29 0x00000000005d06a6 in do_action (h=0x7ffe95bc09f0, a=0x7f0385db3280, msg=0x7f03860006e0) at core/action.c:1052 #30 0x00000000005ddf78 in run_actions (h=0x7ffe95bc09f0, a=0x7f0385da9ee8, msg=0x7f03860006e0) at core/action.c:1581 #31 0x00000000005cd315 in do_action (h=0x7ffe95bc09f0, a=0x7f0385d24ea0, msg=0x7f03860006e0) at core/action.c:700 #32 0x00000000005ddf78 in run_actions (h=0x7ffe95bc09f0, a=0x7f0385d24c68, msg=0x7f03860006e0) at core/action.c:1581 #33 0x00000000005d0b34 in do_action (h=0x7ffe95bc09f0, a=0x7f0385d3e420, msg=0x7f03860006e0) at core/action.c:1067 #34 0x00000000005ddf78 in run_actions (h=0x7ffe95bc09f0, a=0x7f0385d1ae68, msg=0x7f03860006e0) at core/action.c:1581 #35 0x00000000005de6bc in run_top_route (a=0x7f0385d1ae68, msg=0x7f03860006e0, c=0x0) at core/action.c:1666 #36 0x00000000005e5c40 in receive_msg ( buf=0xabcb80 <buf.7133> "ACK sip:61283214906@10.51.1.50:5060;alias=10.51.1.50~5060~1 SIP/2.0\r\nVia: SIP/2.0/UDP 124.47.168.253:6080;branch=z9hG4bK-18807-894-4\r\nFrom: test_894 sip:61289993111@124.47.168.253:6080;tag=18807SIPp"..., len=724, rcv_info=0x7ffe95bc10e0) at core/receive.c:423 #37 0x00000000004c1442 in udp_rcv_loop () at core/udp_server.c:543 #38 0x0000000000429d47 in main_loop () at main.c:1683 #39 0x0000000000433a96 in main (argc=10, argv=0x7ffe95bc1968) at main.c:2856 ```
Full kamctl trap: [gdb_kamailio_20201102_110450.txt](https://github.com/kamailio/kamailio/files/5478787/gdb_kamailio_20201102_110...)
Hi Daniel, thanks for your attention! Regarding the tests you suggested, we ran all changes individually and both dm_worker and worker_usleep didn't work and problem still happened. The track_cseq_updates is still under test but is already running for 7 hours without having the issue. Usually the freeze was happening within the first 2 hours of test. Also considering the previous analysis we've done, we can see the function dlg_cseq_msg_sent is being called when the dead lock happened, so I guess this is a strong evidence there is something wrong with this feature + DMQ? Also I saw this commit https://github.com/kamailio/kamailio/commit/a309a0ee48278923871197391a30d485... from yesterday, is this potentially a fix for this problem?
Actually I forget to mention that the track_cseq_updates was set to 0, not 1 as per your comment 0 is what breaks the authentication and also it was already set to 1
Indeed `track_cseq_updates` had to be set to `0`, forgot to update after copy&paste from readme (I updated my comment for future references).
My commit from yesterday was because I discovered that cseq update processing was done for dmq requests. It's useless, but I didn't get the time to dig in further if that can expose any issue.
I suggested to disable it because it was executed and it is not a common deployment, so I wanted to isolate if that was somehow leading to the deadlock
And indeed, after looking further into this direction, helped also by your troubleshooting findings posted in an comment above, the processing of cseq update for dmq was leading to the deadlock.
The commit I pushed yesterday (which was backported to 5.4 and 5.3 branches as well) should fix it. Can you try with it and `track_cseq_updates`set back to 1?
Hi Daniel, thanks for your review, appreciate it! We will test that change tomorrow but isn't it then also possible to fall in the deadlock when receiving/processing an invite?
I cannot say that there are no other deadlocks, but this case is not going to happen anymore when processing invite or other requests part of the dialog.
By trying to update the cseq in the locally generated DMQ request (which is for updating the dlg states to other peers), it was a spiral of accessing both dialogs and dmq-peers lists.
Hi Daniel, we've run the test, but unfortunately the problem happened again after about 5 hours of test. The UDP receiver had the DMQ lock (0x7f28bd321478) and tried to get the dialog lock(0x7f28bdf8b5b4), and the Timer process got the dialog lock (0x7f28bdf8b5b4) and tried to get the DMQ worker lock (0x7f28bd321478)...
Check the backtrace of the processes in deadlock:
1331 udp receiver child=3 sock=45.116.131.61:5060 ``` #0 0x00007f28dca971c9 in syscall () from /lib64/libc.so.6 #1 0x00007f28b269e5cd in futex_get (lock=0x7f28bdf8b5b4) at ../../core/mem/../futexlock.h:121 #2 0x00007f28b26a7226 in dlg_lookup_mode (h_entry=2827, h_id=70761, lmode=0) at dlg_hash.c:781 #3 0x00007f28b26a7e4d in dlg_get_by_iuid (diuid=0x7f28b2958d40 <_dlg_ctx+64>) at dlg_hash.c:849 #4 0x00007f28b27393e0 in dlg_get_ctx_dialog () at dlg_var.c:940 #5 0x00007f28b26c44fb in dlg_lookup_msg_dialog (msg=0x7ffe79df6bd0, dir=0x7ffe79df6bcc) at dlg_handlers.c:1227 #6 0x00007f28b26e4386 in dlg_cseq_msg_sent (evp=0x7ffe79df75e0) at dlg_cseq.c:400 (gdb) p ((struct to_body*)(&msg)->to->parsed)->tag_value.len $1 = 0 (gdb) p msg.first_line.u.request.method_value $2 = 16384
(gdb) p *(((struct to_body*)(&msg)->to->parsed)) $9 = {error = 1, body = { s = 0x7f28dbbb0714 "sip:dialog@10.61.187.66:5060\r\nFrom: sip:dialog@10.61.187.75:5060;tag=9bf3c67f7017dfd69402f08ee6e7069c-d6bc8b8a\r\nCSeq: 10 KDMQ\r\nCall-ID: 4a4bd13913afec15-1331@45.116.131.61\r\nContent-Length: 310\r\nUs"..., len = 32}, uri = { s = 0x7f28dbbb0715 "sip:dialog@10.61.187.66:5060>\r\nFrom: sip:dialog@10.61.187.75:5060;tag=9bf3c67f7017dfd69402f08ee6e7069c-d6bc8b8a\r\nCSeq: 10 KDMQ\r\nCall-ID: 4a4bd13913afec15-1331@45.116.131.61\r\nContent-Length: 310\r\nUse"..., len = 28}, display = {s = 0x0, len = 0}, tag_value = {s = 0x0, len = 0}, parsed_uri = {user = {s = 0x0, len = 0}, passwd = {s = 0x0, len = 0}, host = {s = 0x0, len = 0}, port = { s = 0x0, len = 0}, params = {s = 0x0, len = 0}, sip_params = {s = 0x0, len = 0}, headers = {s = 0x0, len = 0}, port_no = 0, proto = 0, type = ERROR_URI_T, flags = (unknown: 0), transport = {s = 0x0, len = 0}, ttl = {s = 0x0, len = 0}, user_param = {s = 0x0, len = 0}, maddr = {s = 0x0, len = 0}, method = {s = 0x0, len = 0}, lr = {s = 0x0, len = 0}, r2 = {s = 0x0, len = 0}, gr = {s = 0x0, len = 0}, transport_val = {s = 0x0, len = 0}, ttl_val = {s = 0x0, len = 0}, user_param_val = {s = 0x0, len = 0}, maddr_val = { s = 0x0, len = 0}, method_val = {s = 0x0, len = 0}, lr_val = {s = 0x0, len = 0}, r2_val = {s = 0x0, len = 0}, gr_val = {s = 0x0, len = 0}}, param_lst = 0x0, last_param = 0x0}
#7 0x0000000000503088 in sr_event_exec (type=2, evp=0x7ffe79df75e0) at core/events.c:240 #8 0x00007f28b7035a43 in msg_send_buffer (dst=0x7f28bf3a8c28, buf=0x7f28bee53140 "KDMQ sip:dialog@10.61.187.66:5060 SIP/2.0\r\nVia: SIP/2.0/UDP 10.61.187.75;branch=z9hG4bK4646.29f4a7c1", '0' <repeats 24 times>, ".0\r\nTo: sip:dialog@10.61.187.66:5060\r\nFrom: sip:dialog@10.61.187.75:5060"..., len=716, flags=0) at ../../core/forward.h:148 #9 0x00007f28b703aff9 in send_pr_buffer (rb=0x7f28bf3a8bd8, buf=0x7f28bee53140, len=716) at t_funcs.c:69 #10 0x00007f28b700d8eb in send_prepared_request_impl (request=0x7f28bf3a8bd8, retransmit=1, branch=0) at uac.c:669 #11 0x00007f28b700ee74 in t_uac_with_ids (uac_r=0x7ffe79df7c00, ret_index=0x0, ret_label=0x0) at uac.c:753 #12 0x00007f28b700eda0 in t_uac (uac_r=0x7ffe79df7c00) at uac.c:721 #13 0x00007f28b7012038 in request (uac_r=0x7ffe79df7c00, ruri=0x7ffe79df7bd0, to=0x7ffe79df7bd0, from=0x7ffe79df7be0, next_hop=0x0) at uac.c:1089 #14 0x00007f28b3281053 in dmq_send_message (peer=0x7f28bdf954a8, body=0x7ffe79df7fb0, node=0x7f28be1bf270, resp_cback=0x7f28b2956af0 <dlg_dmq_resp_callback>, max_forwards=1, content_type=0x7f28b2956ab0 <dlg_dmq_content_type>) at dmq_funcs.c:251 #15 0x00007f28b327f4ff in bcast_dmq_message1 (peer=0x7f28bdf954a8, body=0x7ffe79df7fb0, except=0x0, resp_cback=0x7f28b2956af0 <dlg_dmq_resp_callback>, max_forwards=1, content_type=0x7f28b2956ab0 <dlg_dmq_content_type>, incl_inactive=0) at dmq_funcs.c:170 (gdb) p &dmq_node_list->lock $1 = (gen_lock_t *) 0x7f28bd321478
#16 0x00007f28b327f8d4 in bcast_dmq_message (peer=0x7f28bdf954a8, body=0x7ffe79df7fb0, except=0x0, resp_cback=0x7f28b2956af0 <dlg_dmq_resp_callback>, max_forwards=1, content_type=0x7f28b2956ab0 <dlg_dmq_content_type>) at dmq_funcs.c:188 #17 0x00007f28b26f445c in dlg_dmq_send (body=0x7ffe79df7fb0, node=0x0) at dlg_dmq.c:88 #18 0x00007f28b26fdde4 in dlg_dmq_replicate_action (action=DLG_DMQ_STATE, dlg=0x7f28be36a058, needlock=0, node=0x0) at dlg_dmq.c:628 #19 0x00007f28b26ca370 in dlg_onroute (req=0x7f28dbee2088, route_params=0x7ffe79df8340, param=0x0) at dlg_handlers.c:1559 #20 0x00007f28b68f35c6 in run_rr_callbacks (req=0x7f28dbee2088, rr_param=0x7ffe79df8430) at rr_cb.c:96 #21 0x00007f28b69069b2 in after_loose (_m=0x7f28dbee2088, preloaded=0) at loose.c:984 #22 0x00007f28b69070dc in loose_route (_m=0x7f28dbee2088) at loose.c:1018 #23 0x00007f28b690b240 in w_loose_route (msg=0x7f28dbee2088, p1=0x0, p2=0x0) at rr_mod.c:276 #24 0x0000000000472e48 in do_action (h=0x7ffe79df8fe0, a=0x7f28dbb11a38, msg=0x7f28dbee2088) at core/action.c:1076 #25 0x00000000004801fb in run_actions (h=0x7ffe79df8fe0, a=0x7f28dbb11a38, msg=0x7f28dbee2088) at core/action.c:1581 #26 0x0000000000480877 in run_actions_safe (h=0x7ffe79dfad00, a=0x7f28dbb11a38, msg=0x7f28dbee2088) at core/action.c:1645 #27 0x0000000000446834 in rval_get_int (h=0x7ffe79dfad00, msg=0x7f28dbee2088, i=0x7ffe79df990c, rv=0x7f28dbb11c78, cache=0x0) at core/rvalue.c:915 #28 0x000000000044b365 in rval_expr_eval_int (h=0x7ffe79dfad00, msg=0x7f28dbee2088, res=0x7ffe79df990c, rve=0x7f28dbb11c70) at core/rvalue.c:1913 #29 0x0000000000472929 in do_action (h=0x7ffe79dfad00, a=0x7f28dbb1a438, msg=0x7f28dbee2088) at core/action.c:1052 #30 0x00000000004801fb in run_actions (h=0x7ffe79dfad00, a=0x7f28dbb110a0, msg=0x7f28dbee2088) at core/action.c:1581 #31 0x000000000046f598 in do_action (h=0x7ffe79dfad00, a=0x7f28dba8c058, msg=0x7f28dbee2088) at core/action.c:700 #32 0x00000000004801fb in run_actions (h=0x7ffe79dfad00, a=0x7f28dba8be20, msg=0x7f28dbee2088) at core/action.c:1581 #33 0x0000000000472db7 in do_action (h=0x7ffe79dfad00, a=0x7f28dbaa55d8, msg=0x7f28dbee2088) at core/action.c:1067 #34 0x00000000004801fb in run_actions (h=0x7ffe79dfad00, a=0x7f28dba82020, msg=0x7f28dbee2088) at core/action.c:1581 #35 0x000000000048093f in run_top_route (a=0x7f28dba82020, msg=0x7f28dbee2088, c=0x0) at core/action.c:1666 #36 0x00000000005ecc77 in receive_msg ( buf=0xabca80 <buf.7134> "ACK sip:61283214906@10.61.162.48:5090;alias=10.61.162.48~5090~1 SIP/2.0\r\nVia: SIP/2.0/UDP 45.116.131.37:6060;branch=z9hG4bK-13506-36982-4\r\nFrom: test_36982 sip:61289993111@45.116.131.37:6060;tag=135"..., len=740, rcv_info=0x7ffe79dfb3f0) at core/receive.c:423 #37 0x00000000004d837f in udp_rcv_loop () at core/udp_server.c:543 #38 0x0000000000429d47 in main_loop () at main.c:1683 #39 0x0000000000433a96 in main (argc=12, argv=0x7ffe79dfbc78) at main.c:2856
```
1345 timer: ``` #0 0x00007f28dca971c9 in syscall () from /lib64/libc.so.6 #1 0x00007f28b327cfa9 in futex_get (lock=0x7f28bd321478) at ../../modules/tm/../../core/mem/../futexlock.h:121 #2 0x00007f28b327f05b in bcast_dmq_message1 (peer=0x7f28bdf954a8, body=0x7ffe79dfac30, except=0x0, resp_cback=0x7f28b2956af0 <dlg_dmq_resp_callback>, max_forwards=1, content_type=0x7f28b2956ab0 <dlg_dmq_content_type>, incl_inactive=0) at dmq_funcs.c:156 #3 0x00007f28b327f8d4 in bcast_dmq_message (peer=0x7f28bdf954a8, body=0x7ffe79dfac30, except=0x0, resp_cback=0x7f28b2956af0 <dlg_dmq_resp_callback>, max_forwards=1, content_type=0x7f28b2956ab0 <dlg_dmq_content_type>) at dmq_funcs.c:188 #4 0x00007f28b26f445c in dlg_dmq_send (body=0x7ffe79dfac30, node=0x0) at dlg_dmq.c:88 #5 0x00007f28b26fdde4 in dlg_dmq_replicate_action (action=DLG_DMQ_RM, dlg=0x7f28be825d20, needlock=0, node=0x0) at dlg_dmq.c:628 #6 0x00007f28b26a27c0 in destroy_dlg (dlg=0x7f28be825d20) at dlg_hash.c:367 #7 0x00007f28b26ac867 in dlg_unref_helper (dlg=0x7f28be825d20, cnt=2, fname=0x7f28b273d21d "dlg_handlers.c", fline=431) at dlg_hash.c:1095 (gdb) p &d_entry->lock $1 = (gen_lock_t *) 0x7f28bdf8b5b4
#8 0x00007f28b26b9b29 in dlg_ontdestroy (t=0x7f28bef7d488, type=131072, param=0x7ffe79dfb130) at dlg_handlers.c:431 #9 0x00007f28b6fb8286 in run_trans_callbacks_internal (cb_lst=0x7f28bef7d500, type=131072, trans=0x7f28bef7d488, params=0x7ffe79dfb130) at t_hooks.c:258 #10 0x00007f28b6fb83b8 in run_trans_callbacks (type=131072, trans=0x7f28bef7d488, req=0x0, rpl=0x0, code=0) at t_hooks.c:285 #11 0x00007f28b6f8c96c in free_cell_helper (dead_cell=0x7f28bef7d488, silent=0, fname=0x7f28b709c2d3 "timer.c", fline=643) at h_table.c:165 #12 0x00007f28b7033808 in wait_handler (ti=241915320, wait_tl=0x7f28bef7d510, data=0x7f28bef7d488) at timer.c:643 #13 0x0000000000500fd1 in timer_list_expire (t=241915320, h=0x7f28bb5e4378, slow_l=0x7f28bb5e80e0, slow_mark=14259) at core/timer.c:857 #14 0x000000000050147a in timer_handler () at core/timer.c:922 #15 0x0000000000501924 in timer_main () at core/timer.c:961 #16 0x000000000042a838 in main_loop () at main.c:1753 #17 0x0000000000433a96 in main (argc=12, argv=0x7ffe79dfbc78) at main.c:2856 ```
Full kamctl trap: [gdb_kamailio_20201104_035531.txt](https://github.com/kamailio/kamailio/files/5485522/gdb_kamailio_20201104_035...)
Also continuing the investigation we saw that the function dlg_cseq_msg_sent we saw it was processing a KDMQ message and went past the condition of your commit. However from the conversation it seemed DMQ message should not go through the track cseq logic. So after checking the patch, the condition of checking the tag len > 0 will always be false for the initial KDMQ, so based on the comment we changed the check to == 0. Is that a correct assumption?
Right, the commit with supposed fix had the wrong condition for to-tag length. I pushed a follow up commit to fix that.
After our change to == 0, the load test ran for over 20 hours with no problems! Just to be aligned we compiled the latest from the 5.4 branch and started running the test again. So far, 3h+ in with no problems. I think we are safe to say the deadlock is not happening anymore. Thank you for your effort Daniel! Also may I ask when is it expected the next 5.4 release?
It is worth mentioning that the INVITE authentication which depends on the cseq tracking, still works fine after the change.
New releases will be in the near future, during the next few weeks. You can use nightly builds from stable branches, at least debian/ubuntu packages are built every night, see deb.kamailio.org site.
Closing this one. Thanks!
Closed #2535.