<!-- Kamailio Project uses GitHub Issues only for bugs in the code or feature requests. Please use this template only for bug reports.
If you have questions about using Kamailio or related to its configuration file, ask on sr-users mailing list:
* http://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users
If you have questions about developing extensions to Kamailio or its existing C code, ask on sr-dev mailing list:
* http://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-dev
Please try to fill this template as much as possible for any issue. It helps the developers to troubleshoot the issue.
If there is no content to be filled in a section, the entire section can be removed.
You can delete the comments from the template sections when filling.
You can delete next line and everything above before submitting (it is a comment). -->
### Description Kamailio v5.2.2 rev. 9bc44f randomly crashes once a week. The problem seems to be related to TM and ASYNC modules as seen in back-trace of dump core below.
<!-- Explain what you did, what you expected to happen, and what actually happened. -->
### Troubleshooting None. We have dump core file available for analysis.
#### Reproduction Randomly, roughly once every 5 to 7 days.
<!-- If the issue can be reproduced, describe how it can be done. -->
#### Debugging Data
<!-- If you got a core dump, use gdb to extract troubleshooting data - full backtrace, local variables and the list of the code at the issue location.
gdb /path/to/kamailio /path/to/corefile bt full info locals list
If you are familiar with gdb, feel free to attach more of what you consider to be relevant. -->
``` Program terminated with signal 11, Segmentation fault. #0 0xb4ecdb33 in faked_env (t=0x7ad779f8, msg=0x0, is_async_env=1) at t_reply.c:813 813 t_reply.c: No such file or directory.
(gdb) info locals __FUNCTION__ = "faked_env"
(gdb) list 808 in t_reply.c
(gdb) bt full #0 0xb4ecdb33 in faked_env (t=0x7ad779f8, msg=0x0, is_async_env=1) at t_reply.c:813 __FUNCTION__ = "faked_env" #1 0xb4eedba2 in t_continue_helper (hash_index=48512, label=393930873, rtact=0xb5ba6184, cbname=0x0, cbparam=0x0) at t_suspend.c:337 t = 0x7ad779f8 backup_T = 0xffffffff backup_T_branch = -1 faked_req = 0x7afdf1bc brpl = 0x0 erpl = 0x8055fd2 faked_req_len = 3176 cancel_data = {cancel_bitmap = 3076137104, reason = {cause = 26896, u = {text = {s = 0xb51cc7e0 "\254\026\003", len = 134606304}, e2e_cancel = 0xb51cc7e0, packed_hdrs = {s = 0xb51cc7e0 "\254\026\003", len = 134606304}}}} branch = 0 uac = 0x7ad77b64 ret = 0 cb_type = 2 msg_status = -1256442488 last_uac_status = -1245318100 reply_status = 137942287 do_put_on_wait = 134606304 hdr = 0xb55987d8 prev = 0x0 tmp = 0x0 route_type_bk = 1 keng = 0x0 evname = {s = 0xb4f31305 "tm:continue", len = 11} __FUNCTION__ = "t_continue_helper" #2 0xb4ef0a2d in t_continue (hash_index=48512, label=393930873, route=0xb5ba6184) at t_suspend.c:583 No locals. #3 0xb43ceac3 in async_exec_task (param=0x7a94bf54) at async_sleep.c:233 atp = 0x7a94bf54 keng = 0x0 cbname = {s = 0x0, len = 0} evname = {s = 0xb43d1a94 "async:task-exec", len = 15} __FUNCTION__ = "async_exec_task" #4 0x080894c6 in async_task_run (idx=4) at core/async_task.c:282 ptask = 0x7a94bf4c received = 4 __FUNCTION__ = "async_task_run" #5 0x08087ca1 in async_task_child_init (rank=0) at core/async_task.c:185 pid = 0 i = 3 __FUNCTION__ = "async_task_child_init" #6 0x08201d26 in init_child (rank=0) at core/sr_module.c:871 ret = 62 #7 0x0806b3d9 in main_loop () at main.c:1737 i = 4 pid = 5982 si = 0x0 si_desc = "udp receiver child=3 sock=85.118.42.43:5080\000\000\002a\267гn\267p\037\000\000\240\263n\267\364\237n\267\240\263n\267`\025\316\b\220W\371\277\315\002a\267\220\000\000\000`\025\316\b\364\237n\267\025\000\000\000\060X\371\277\355\ng\267\060/\316\b0/\316\b@q\245\265\000@\000\000@\220n\267\000\000\000" nrprocs = 4 woneinit = 1 __FUNCTION__ = "main_loop" #8 0x08071e84 in main (argc=13, argv=0xbff95be4) at main.c:2696 cfg_stream = 0x8bd3008 c = -1 r = 0 tmp = 0xbff96822 "" tmp_len = -1218858976 port = 2209 proto = 1 options = 0x83779a8 ":f:cm:M:dVIhEeb:l:L:n:vKrRDTN:W:w:t:u:g:P:G:SQ:O:a:A:x:X:Y:" ret = -1 seed = 1217010649 rfd = 4 debug_save = 0 debug_flag = 0 dont_fork_cnt = 0 n_lst = 0xbff95be4 p = 0x805e474 "[\201\303\060\037@" st = {st_dev = 14, __pad1 = 0, st_ino = 10873, st_mode = 16832, st_nlink = 2, st_uid = 0, st_gid = 0, st_rdev = 0, __pad2 = 0, st_size = 120, st_blksize = 4096, st_blocks = 0, st_atim = {tv_sec = 1559687937, tv_nsec = 2808493}, st_mtim = {tv_sec = 1566372756, tv_nsec = 587274417}, st_ctim = { tv_sec = 1567450779, tv_nsec = 139516276}, __unused4 = 0, __unused5 = 0} __FUNCTION__ = "main"
```
#### Log Messages
<!-- Check the syslog file and if there are relevant log messages printed by Kamailio, add them next, or attach to issue, or provide a link to download them (e.g., to a pastebin site). -->
``` Sep 3 03:02:03 webrtc-as1 kernel: [7785063.653846] kamailio[5987]: segfault at 318 ip b4ecdb33 sp bff94f90 error 6 in tm.so[b4e15000+12a000] ```
#### SIP Traffic Not available
### Possible Solutions N/A
### Additional Information
* **Kamailio Version** - output of `kamailio -v`
``` version: kamailio 5.2.2 (i386/linux) 9bc44f flags: STATS: Off, USE_TCP, USE_TLS, USE_SCTP, TLS_HOOKS, USE_RAW_SOCKS, DISABLE_NAGLE, USE_MCAST, DNS_IP_HACK, SHM_MEM, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, TLSF_MALLOC, DBG_SR_MEMORY, USE_FUTEX, FAST_LOCK-ADAPTIVE_WAIT, USE_DNS_CACHE, USE_DNS_FAILOVER, USE_NAPTR, USE_DST_BLACKLIST, HAVE_RESOLV_RES ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144 MAX_URI_SIZE 1024, BUF_SIZE 65535, DEFAULT PKG_SIZE 8MB poll method support: poll, epoll_lt, epoll_et, sigio_rt, select. id: 9bc44f -dirty compiled on 11:33:22 Apr 29 2019 with gcc 4.7.2 ```
* **Operating System**:
<!-- Details about the operating system, the type: Linux (e.g.,: Debian 8.4, Ubuntu 16.04, CentOS 7.1, ...), MacOS, xBSD, Solaris, ...; Kernel details (output of `uname -a`) -->
``` Linux webrtc-as1 3.16.0-0.bpo.4-686-pae #1 SMP Debian 3.16.36-1+deb8u2~bpo70+1 (2016-10-19) i686 GNU/Linux ```
Can you get the output for the next gdb commands:
``` frame 0 p *t ```
Here it is,
``` (gdb) frame 0 #0 0xb4ecdb33 in faked_env (t=0x7ad779f8, msg=0x0, is_async_env=1) at t_reply.c:813 813 in t_reply.c (gdb) p *t $1 = {next_c = 0x0, prev_c = 0x0, hash_index = 0, label = 0, flags = 0, nr_of_outgoings = 0, fcount = 0, ref_count = {val = 0}, from = {s = 0x0, len = 0}, callid = { s = 0x0, len = 0}, cseq_n = {s = 0x0, len = 0}, to = {s = 0x0, len = 0}, method = {s = 0x0, len = 0}, tmcb_hl = {first = 0x0, reg_types = 0}, wait_timer = { next = 0x0, prev = 0x0, expire = 0, initial_timeout = 0, data = 0x0, f = 0, flags = 0, slow_idx = 0}, wait_start = 0, uas = {request = 0x0, end_request = 0x0, response = {rbtype = 0, flags = 0, t_active = 0, branch = 0, buffer_len = 0, buffer = 0x0, my_T = 0x0, timer = {next = 0x0, prev = 0x0, expire = 0, initial_timeout = 0, data = 0x0, f = 0, flags = 0, slow_idx = 0}, dst = {send_sock = 0x0, to = {s = {sa_family = 0, sa_data = '\000' <repeats 13 times>}, sin = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, sin6 = {sin6_family = 0, sin6_port = 0, sin6_flowinfo = 0, sin6_addr = {__in6_u = {__u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}, sin6_scope_id = 0}}, id = 0, proto = 0 '\000', send_flags = {f = 0, blst_imask = 0}}, retr_expire = 0, fr_expire = 0}, local_totag = {s = 0x0, len = 0}, cancel_reas = 0x0, status = 0}, uac = 0x0, async_backup = {backup_route = 0, backup_branch = 0, blind_uac = 0, ruri_new = 0}, fwded_totags = 0x0, uri_avps_from = 0x0, uri_avps_to = 0x0, user_avps_from = 0x0, user_avps_to = 0x0, domain_avps_from = 0x0, domain_avps_to = 0x0, xavps_list = 0x0, reply_mutex = { val = 0}, reply_locker_pid = {val = 0}, reply_rec_lock_level = 0, fr_timeout = 0, fr_inv_timeout = 0, rt_t1_timeout_ms = 0, rt_t2_timeout_ms = 0, end_of_life = 0, relayed_reply_branch = 0, on_failure = 0, on_branch_failure = 0, on_reply = 0, on_branch = 0, on_branch_delayed = 0, md5 = 0x7ad77b44 "92164e4777075c44c4f5de28a2b6f050\377\377\377\377"} ```
Seems to be a removed transaction structure. Can the async operation take very long till it resumes the transaction processing? What kind of operations are done in this async execution?
The async operation basically related to chat history setup on our production server. It fetches chat sessions and relevent info from backend (REDIS + MySQL or Elastic Search) using mod_perl.
According to our load test results, the whole operation takes roughly 500 to 800 ms (<1 second) under normal conditions but may prolong upto 5000 ms (5 seconds) in case backend service (MySQL or ES) failure.
The server logs show that transaction in question was started at 03:02:00 CET on Sept. 03, 2019 and kamailio crashed at 03:02:03 CET, there was a backend failure (REDIS was unreachable at the time).
Do you still have the sip traffic for that transaction (requests/replies received/sent by kamailio)? If yes, you can send me the pcap to try to match with code execution.
Hi,
Sorry for late reply, i was travelling and didn't get chance to look at this issue.
Anyways, no, we do not have sip trace. I only have syslogs which indicate that right before crash the REDIS connectivity failed.
``` Sep 3 03:02:03 webrtc-service kamailio[5987]: ERROR: app_perl [kamailioxs.xs:1041]: XS_Kamailio__Message_log(): perl error: Could not connect to Redis server at 172.16.x.x:6379: Connection timed out at /usr/share/perl5/Redis.pm line 420#012#011Redis::__build_sock('Redis=HASH(0x93a1948)') called at /usr/share/perl5/Redis.pm line 407#012#011eval {...} called at /usr/share/perl5/Redis.pm line 407#012#011Redis::__connect('Redis=HASH(0x93a1948)') called at /usr/share/perl5/Redis.pm line 65#012#011Redis::new('Redis', 'server', '172.16.x.x:6379', 'reconnect', 2, 'every', 100, 'encoding', undef, ...) called at /usr/local/kamailio/etc/kamailio/chat_history.pl line 2272#012#011main::get_chat_sessions('Kamailio::Message=SCALAR(0x8cddb64)', 502100012130) called at /usr/local/kamailio/etc/kamailio/chat_history.pl line 0#012#011eval{...} called at /usr/local/kamailio/etc/kamailio/chat_history.pl line 0#012#011...propagated at /usr/share/perl5/Redis.pm line 410.#012 Sep 3 03:02:03 webrtc-service kernel: [7785063.653846] kamailio[5987]: segfault at 318 ip b4ecdb33 sp bff94f90 error 6 in tm.so[b4e15000+12a000] ```
Here is the code fragment that was executing in async process when crash happened,
``` xlog("L_NOTICE", "[$pr:$fU@$si:$sp]: Found '$avp(rows)' db records for SIP SCI command '$avp(sip_sci_cmd)' \n"); append_to_reply("X-Message-Count: $avp(rows)\r\n"); $avp(reason) = "Accepted: sending " + $avp(rows) + " messages"; send_reply("201", "$avp(reason)"); ... perl_exec("get_chat_sessions", $fU); xlog("L_INFO", "[$pr:$fU@$si:$sp]: Updated Chat History Summary: $avp(new_body) \n");
#### self message with json body #### $uac_req(all)=$null; $uac_req(method)="MESSAGE"; $uac_req(furi)=$fu; $uac_req(turi)=$fu; #$uac_req(evroute)=1; $uac_req(callid)=$uuid(r); $uac_req(hdrs)="Content-Type: application/json\r\nDate: " + $TF + "\r\n"; $uac_req(body)=$avp(new_body);
$uac_req(ruri)=$avp(sender_ruri); $uac_req(ouri)=$avp(sender_duri); $uac_req(sock)=$avp(sender_sock);
uac_req_send(); ```
Note that variable **$avp(new_body)** was set by PERL script. There is a theory from one of my developers is that since REDIS failed to connect so this variable was never set and when UAC tried to set NULL as JSON body for SIP MESSAGE request, the crashed must have happened. However, the core dump attached in this ticket does not confirm that.
This resulted in this syslog messages,
``` Sep 3 02:59:56 webrtc-service kamailio[5987]: NOTICE: <script>: [wss:xxxxxxxx@x.x.x.x:52025]: Found '26' db records for SIP SCI command 'chat_summary' Sep 3 03:02:03 webrtc-service kamailio[5987]: ERROR: app_perl [kamailioxs.xs:1041]: XS_Kamailio__Message_log(): perl error: Could not connect to Redis server at 172.16.x.x:6379: Connection timed out at /usr/share/perl5/Redis.pm line 420#012#011Redis::__build_sock('Redis=HASH(0x93a1948)') called at /usr/share/perl5/Redis.pm line 407#012#011eval {...} called at /usr/share/perl5/Redis.pm line 407#012#011Redis::__connect('Redis=HASH(0x93a1948)') called at /usr/share/perl5/Redis.pm line 65#012#011Redis::new('Redis', 'server', '172.16.x.x:6379', 'reconnect', 2, 'every', 100, 'encoding', undef, ...) called at /usr/local/kamailio/etc/kamailio/chat_history.pl line 2272#012#011main::get_chat_sessions('Kamailio::Message=SCALAR(0x8cddb64)', 502100012130) called at /usr/local/kamailio/etc/kamailio/chat_history.pl line 0#012#011eval{...} called at /usr/local/kamailio/etc/kamailio/chat_history.pl line 0#012#011...propagated at /usr/share/perl5/Redis.pm line 410.#012 Sep 3 03:02:03 webrtc-service kernel: [7785063.653846] kamailio[5987]: segfault at 318 ip b4ec ```
Let know if you have any further queries.
Closed #2055.
My expectation is that the crash happened because the transaction was resumed more or less when it was about to be destroyed. I pushed a patch to prevent that happening -- I will back port it. If you still get the crash with it, reopen.