### Description After upgrading from Kamailio 5.2.x, a high volume Kamailio 5.4.4 instance randomly crashes with either a general protection or segfault error message in `siptrace.so` during use of `sip_trace` function from one of its child processes (which cascades to the parent crashing). This appears to occur once about every 36 hours on average, but has not yet appeared to correspond with any particular event.
We are continuing to collect debug information and will be populating this ticket as more information becomes available. However, this issue has been observed.
Sip trace function is applied in this example snippet: ``` # ------- siptrace -------- modparam("siptrace", "hep_mode_on", 1) modparam("siptrace", "hep_version", 3) modparam("siptrace", "trace_to_database", 0) modparam("siptrace", "trace_flag", 22) modparam("siptrace", "trace_on", 1)
request_route { #.... if ( is_method("INVITE") && !has_totag() ) { # Only start sip_trace on initial INVITE sip_trace("HEP_URL","$ci-MY_IP","d"); } setflag(22); #... } ```
### Troubleshooting We attempted packet collection with Homer v5 and Homer v7 and changed between HEP protocol v2 and v3.
#### Reproduction We have not determined a means of reproducing this issue without simply letting the server run until a crash occurs. There are four almost identical servers all experiencing the same random crashing but not at the same time.
#### Debugging Data Our next troubleshooting case will be to simply comment out the `sip_trace` function, but this effectively disables the `siptrace` module completely rather than addressing an underlying problem.
Core dumps are still in-progress for retrieval. Debug logs should also be more readily available soon. There will be delays since these are high volume production servers.
<!-- If you got a core dump, use gdb to extract troubleshooting data - full backtrace, local variables and the list of the code at the issue location.
gdb /path/to/kamailio /path/to/corefile bt full info locals list
If you are familiar with gdb, feel free to attach more of what you consider to be relevant. -->
#### Log Messages
All of them have randomly crashed with the following example log entry. Regardless of troubleshooting tactics to date: ``` kernel: traps: kamailio[7579] general protection ip:7fb1a64e2dbf sp:7ffc60f04180 error:0 in siptrace.so[7fb1a64b8000+4e000] systemd: kamailio.service: main process exited, code=exited, status=1/FAILURE systemd: Unit kamailio.service entered failed state. systemd: kamailio.service failed. ```
#### SIP Traffic To date, there is no corresponding SIP Traffic with the crash.
### Possible Solutions To date, only disabling the `siptrace` module seems to be the solution.
### Additional Information
* **Kamailio Version** - output of `kamailio -v`
``` version: kamailio 5.4.4 (x86_64/linux) e16352 flags: USE_TCP, USE_TLS, USE_SCTP, TLS_HOOKS, USE_RAW_SOCKS, DISABLE_NAGLE, USE_MCAST, DNS_IP_HACK, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, TLSF_MALLOC, DBG_SR_MEMORY, USE_FUTEX, FAST_LOCK-ADAPTIVE_WAIT, USE_DNS_CACHE, USE_DNS_FAILOVER, USE_NAPTR, USE_DST_BLACKLIST, HAVE_RESOLV_RES ADAPTIVE_WAIT_LOOPS 1024, MAX_RECV_BUFFER_SIZE 262144, MAX_URI_SIZE 1024, BUF_SIZE 65535, DEFAULT PKG_SIZE 8MB poll method support: poll, epoll_lt, epoll_et, sigio_rt, select. id: e16352 compiled on 15:56:46 Feb 15 2021 with gcc 4.8.5 ```
* **Operating System**:
<!-- Details about the operating system, the type: Linux (e.g.,: Debian 8.4, Ubuntu 16.04, CentOS 7.1, ...), MacOS, xBSD, Solaris, ...; Kernel details (output of `uname -a`) -->
``` Linux <hostname> 3.10.0-957.27.2.el7.x86_64 #1 SMP Mon Jul 29 17:46:05 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux ```
Getting `bt full` with gdb from all core dump files would be useful, otherwise is no lead at this moment where the issue can be.
Any hints on getting a core dump? I have been working my way through everything in this documentation on CentOS https://www.kamailio.org/wiki/tutorials/troubleshooting/coredumpfile and it appears to absolutely refuse everything ulimit, sysctl parameters, CORE_DUMP=yes, etc... I am sure I will find it eventually but in the meantime I am testing with "kill -ABRT"
ALERT: <core> [main.c:786]: handle_sigs(): core was not generated
I'll find it eventually I am sure...
In case someone else runs into this issue -- it was far from transparent, but my best guess is it was caused by some combination of file permission issues. These are the various things on CentOS 7 that were tweaked to allow a core dump
The following combination of things were changed:
/etc/sysctl.d/99-sysctl.conf ``` kernel.core_pattern = /tmp/core.%e.%p.%h.%t kernel.core_uses_pid=1 fs.suid_dumpable = 2 ```
`sysctl --system` to reload kernel parameters
/etc/sysconfig/kamailio ``` DUMP_CORE=yes ```
May be required but unclear: `ulimit -c unlimited`
Tested by running `kill -ABRT <PID of Kamailio Child>`
@rampageai: thanks for the notes for CentOS, I added them to wiki:
* https://www.kamailio.org/wiki/tutorials/troubleshooting/coredumpfile#centos_...
The `ulimit` is useful if the core file is going to have a large size, it may be already set to a value large enough, however setting it to `unlimited` ensures the size won't hit any system limit.
Core dumps for parent and child using `bt full` in gdb.
NOTE: I am most definitely not a gdb ninja so I focused on running the full backtrace command and hoping the output of it provides some further input from anyone with additional knowledge.
PARENT: ``` (gdb) bt full #0 0x00007efc7d7152c7 in raise () from /lib64/libc.so.6 No symbol table info available. #1 0x00007efc7d7169b8 in abort () from /lib64/libc.so.6 No symbol table info available. #2 0x000000000041e300 in sig_alarm_abort () No symbol table info available. #3 <signal handler called> No symbol table info available. #4 0x00007efc7d7c1ec7 in sched_yield () from /lib64/libc.so.6 No symbol table info available. #5 0x00000000004f72ad in timer_del_safe () No symbol table info available. #6 0x00007efc765f5799 in unlink_timers () from /usr/lib64/kamailio/modules/tm.so No symbol table info available. #7 0x00007efc765f65e5 in free_cell_helper () from /usr/lib64/kamailio/modules/tm.so No symbol table info available. #8 0x00007efc765f8dec in free_hash_table () from /usr/lib64/kamailio/modules/tm.so No symbol table info available. #9 0x00007efc766a6b51 in tm_shutdown () from /usr/lib64/kamailio/modules/tm.so No symbol table info available. #10 0x000000000058916f in destroy_modules () No symbol table info available. #11 0x000000000041cd77 in cleanup () No symbol table info available. #12 0x000000000041e652 in shutdown_children () No symbol table info available. #13 0x00000000004216fa in handle_sigs () No symbol table info available. #14 0x000000000042b9b6 in main_loop () No symbol table info available. #15 0x0000000000433f34 in main () No symbol table info available. ```
CHILD: ``` #0 0x00007efc73a6adbf in msg_send_buffer () from /usr/lib64/kamailio/modules/siptrace.so No symbol table info available. #1 0x00007efc73a7371d in trace_send_hep2_duplicate () from /usr/lib64/kamailio/modules/siptrace.so No symbol table info available. #2 0x00007efc73a6dfa9 in trace_send_hep_duplicate () from /usr/lib64/kamailio/modules/siptrace.so No symbol table info available. #3 0x00007efc73a4c517 in sip_trace_store () from /usr/lib64/kamailio/modules/siptrace.so No symbol table info available. #4 0x00007efc73a5c760 in trace_onreply_out () from /usr/lib64/kamailio/modules/siptrace.so No symbol table info available. #5 0x00007efc766b1ec2 in run_trans_callbacks_internal () from /usr/lib64/kamailio/modules/tm.so No symbol table info available. #6 0x00007efc766b2147 in run_trans_callbacks_off_params () from /usr/lib64/kamailio/modules/tm.so No symbol table info available. #7 0x00007efc7664a1b2 in _reply_light () from /usr/lib64/kamailio/modules/tm.so No symbol table info available. #8 0x00007efc7664b1fb in _reply () from /usr/lib64/kamailio/modules/tm.so No symbol table info available. #9 0x00007efc766570aa in t_reply_str_unsafe () No symbol table info available. #10 0x00007efc766cc03e in ki_t_reply () from /usr/lib64/kamailio/modules/tm.so No symbol table info available. #11 0x00007efc766cc8d4 in w_t_reply () from /usr/lib64/kamailio/modules/tm.so No symbol table info available. #12 0x000000000046617e in do_action () No symbol table info available. #13 0x000000000047341b in run_actions () No symbol table info available. #14 0x0000000000465fd7 in do_action () No symbol table info available. #15 0x000000000047341b in run_actions () No symbol table info available. #16 0x0000000000465fd7 in do_action () No symbol table info available. #17 0x000000000047341b in run_actions () No symbol table info available. #18 0x0000000000473b5f in run_top_route () No symbol table info available. #19 0x00007efc7664e860 in run_failure_handlers () from /usr/lib64/kamailio/modules/tm.so No symbol table info available. #20 0x00007efc76652edb in t_should_relay_response () from /usr/lib64/kamailio/modules/tm.so No symbol table info available. #21 0x00007efc7665843d in relay_reply () from /usr/lib64/kamailio/modules/tm.so No symbol table info available. #22 0x00007efc7669cce3 in fake_reply () from /usr/lib64/kamailio/modules/tm.so No symbol table info available. #23 0x00007efc7669d150 in final_response_handler () from /usr/lib64/kamailio/modules/tm.so No symbol table info available. #24 0x00007efc7669d211 in retr_buf_handler () from /usr/lib64/kamailio/modules/tm.so No symbol table info available. #25 0x00000000004f8c0e in slow_timer_main () No symbol table info available. #26 0x000000000042a602 in main_loop () No symbol table info available. #27 0x0000000000433f34 in main () No symbol table info available. ```
The one for parent is a side effect, due to the crash of the child process.
You have to install kamailio debug symbols package (kamailio-dbg or so) and then get again the `bt full`, it will contain more useful details. It should work with existing core files.
Acknowledged. I am learning the debug system better at least. I was able to install `kamailio-debuginfo` CentOS package from the repository, test by forcing a core dump, and prove that it does indeed carry more information in the full backtrace. I have to wait until one of my instances generates a random crash and core dump this following week, though, so it will be some time.
You do not need to wait for new core dumps. As I wrote above, it should work with existing core files, after you installed debugging info package, just re-do the gdb commands and should get backtraces with more debugging details.
Thank you for pointing that out again. I did miss that. I attached the gdb logging file for the core dump of the child process (since it's more interesting) to this message. I really hope it's helpful.
[gdb-20210503.txt](https://github.com/kamailio/kamailio/files/6415372/gdb-20210503.txt)
I need further details from the core file, run the following commands in the gdb and paste the output here:
``` frame 0 list p *dst ```
Hi, here is what I got from running those commands in order: ``` (gdb) frame 0 #0 0x00007efc73a6adbf in msg_send_buffer (dst=0x7efc0a6064b8, buf=0x7efc79a0e1e8 "\002\020\002\021\023\304\023\304\064\bɀ4\bɀ\330\037\214`\237\002\f", len=753, flags=1) at ../../core/forward.h:220 220 if (unlikely((dst->send_sock==0) || (gdb) list 215 goto done; 216 } 217 #endif 218 219 if (likely(dst->proto==PROTO_UDP)){ 220 if (unlikely((dst->send_sock==0) || 221 (dst->send_sock->flags & SI_IS_MCAST))){ 222 new_dst=*dst; 223 new_dst.send_sock=get_send_socket(0, &dst->to, dst->proto); 224 if (unlikely(new_dst.send_sock==0)){ (gdb) p *dst $1 = {send_sock = 0x726177726f462d78, to = {s = {sa_family = 29540, sa_data = ": 70\r\nContent-"}, sin = { sin_family = 29540, sin_port = 8250, sin_addr = {s_addr = 168636471}, sin_zero = "Content-"}, sin6 = { sin6_family = 29540, sin6_port = 8250, sin6_flowinfo = 168636471, sin6_addr = {__in6_u = { __u6_addr8 = "Content-Length: ", __u6_addr16 = {28483, 29806, 28261, 11636, 25932, 26478, 26740, 8250}, __u6_addr32 = {1953394499, 762605157, 1735288140, 540698740}}}, sin6_scope_id = 1426722096}, sas = { ss_family = 29540, __ss_padding = ": 70\r\nContent-Length: 0\r\nUser-Agent: Skyetel\r\n\r\n", '\000' <repeats 69 times>, __ss_align = 0}}, id = 0, send_flags = {f = 0, blst_imask = 0}, proto = 1 '\001', proto_pad0 = 0 '\000', proto_pad1 = 0} ```
Hi, any thoughts or updates on this issue? It looks roughly like there is garbled data making it into the networking.
Probably the field `dst->send_sock` is no longer valid, it can confirmed by getting:
``` frame 0 p *dst->send_sock ```
If the address is invalid, can be the case that the SIP transaction is destroyed at same time due to a race.
I see you use dialog mode tracing - `d` mode:
``` # Only start sip_trace on initial INVITE sip_trace("HEP_URL","$ci-MY_IP","d"); ```
Which was not in 5.2.x, so it might be a bug of that feature. I haven't implemented it, so I do not have proper insights at this moment, it will take me some time to get into it. Meanwhile, if you still face crashes, maybe you can switch to tracing using:
``` modparam("siptrace", "trace_mode", 1) ```
and:
``` event_route[siptrace:msg] { if(!is_method("INVITE|ACK|BYE|CANCEL|PRACK|UPDATE")) { drop(); } } ```
as I understood that you want to do sip trace for traffic related to calls.
Thank you for the excellent insight. I think you may be right about the race condition on dialogs since it is a combination of a random crash on a high-use server. I have to deal with some pre-requisites later this week, but I will try to give more gdb feedback confirming the `dst->send_sock` field you mentioned when I do.
You suggested a workaround which I am trying to implement *carefully*. If I wanted to totally avoid the "dialog" mode but still populate my Homer (HEP) server with records that give feedback on full calls, how would I think in terms of 'transaction' or 'message' modes instead? Would I just use sip_trace on all request_route transactions? I am only asking here because it doesn't seem clear about what these modes entail in the 5.4.x module documentation.
"dialog" seemed like the clear choice for me since I was attempting to provide call debug information to my Homer server so, to be honest, I am a little confused about the use-case for the other two modes?
Per the doc:
In message tracing mode only the current message is stored or mirrored. In transaction tracing mode, all the messages of the current transaction are stored or mirrored. In dialog tracing mode, all the messages of the current dialog are stored or mirrored.
Hi @miconda We have 2 instances of this crash from two different servers. If it helps I'm attaching the 2 backtraces. At the time of the crash we had the following.
`# ----- siptrace params ----- modparam("siptrace", "duplicate_uri","sip:REDACTED:9060") modparam("siptrace", "hep_mode_on",1) modparam("siptrace", "hep_version", 2) modparam("siptrace", "hep_capture_id", 11) modparam("siptrace", "trace_to_database",0) modparam("siptrace", "trace_flag",22) modparam("siptrace", "trace_on", 1)
...
sip_trace("", "", "d"); `
And the syslog looked like this.
`Jul 1 11:37:21 REDACTED kernel: [19379376.973389] kamailio[20796]: segfault at c0c0c110 ip 00007fb91985c3b7 sp 00007ffc3a691cd0 error 4 in siptrace.so[7fb919831000+4e000] Jul 1 11:37:21 REDACTED /usr/local/sbin/kamailio[20697]: ALERT: <core> [main.c:788]: handle_sigs(): child process 20796 exited by a signal 11 Jul 1 11:37:21 REDACTED /usr/local/sbin/kamailio[20697]: ALERT: <core> [main.c:791]: handle_sigs(): core was generated Jul 1 11:37:21 REDACTED /usr/local/sbin/kamailio[20697]: INFO: <core> [main.c:813]: handle_sigs(): terminating due to SIGCHLD`
Hopefully that is helpful. I'm also happy to provide any other info if needed. Thanks!
[kamailio-bt-5-5-1-ServerA.txt](https://github.com/kamailio/kamailio/files/6748649/kamailio-bt-5-5-1-ServerA...) [kamailio-bt-5-5-1-ServerB.txt](https://github.com/kamailio/kamailio/files/6748650/kamailio-bt-5-5-1-ServerB...)
@jimdss Are you still observing this issue or was it solved by some other means?
@jimdss Are you still observing this issue or was it solved by some other means?
@henningw We've since upgraded to the latest 5.5.x release and have had no problem since.
Thank you for the feedback, close it then.
Closed #2718.