Hi,
we have crashes on some of our kamailio hosts:
``` /usr/sbin/kamailio[1833]: CRITICAL: <core> [core/pass_fd.c:277]: receive_fd(): EOF on 9 /usr/sbin/kamailio[1820]: ALERT: <core> [main.c:738]: handle_sigs(): child process 1831 exited by a signal 6 /usr/sbin/kamailio[1820]: ALERT: <core> [main.c:741]: handle_sigs(): core was not generated /usr/sbin/kamailio[1820]: INFO: <core> [main.c:764]: handle_sigs(): terminating due to SIGCHLD ```
## Problem
Kamailio segfaults when processing tcp replies.
## Config file and use case
We're running a somewhat complex Kemi/python configuration. It routes internal udp traffic from Asterisk to PSTN providers with various protocols (udp/tcp/tls).
I was able to isolate the problem on a test rig to a simple test setup. It happens if kamailio uses a outgoing tcp socket, udp is not affected. The segfault occurs always after approx. 12-13 register/auth/ok cycles (8 packets, each). No other messages or calls are involved at all. I can reproduce the case now with a short register timeout within 4-5 minutes. Registers are not directly processed in kamailio (no usrloc at al).
A downgrade to 5.1.6 shows a similar problem.
## Kamailio version / os
Prebuilt 5.2.1 from deb.kamailio.org in a Ubuntu 16.04 lxc container.
kamailio -V ``` version: kamailio 5.2.1 (x86_64/linux)
flags: STATS: Off, USE_TCP, USE_TLS, USE_SCTP, TLS_HOOKS, USE_RAW_SOCKS, DISABLE_NAGLE, USE_MCAST, DNS_IP_HACK, SHM_MEM, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, TLSF_MALLOC, DBG_SR_MEMORY, USE_FUTEX, FAST_LOCK-ADAPTIVE_WAIT, USE_DNS_CACHE, USE_DNS_FAILOVER, USE_NAPTR, USE_DST_BLACKLIST, HAVE_RESOLV_RES ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144 MAX_URI_SIZE 1024, BUF_SIZE 65535, DEFAULT PKG_SIZE 8MB poll method support: poll, epoll_lt, epoll_et, sigio_rt, select. id: unknown compiled with gcc 5.3.1 ```
## gdb debug info
Core dumps where analyzed with the source and dbg packages from deb.kamailio.org.
(gdb) list ``` 251 clear_branches(); 252 253 if(unlikely(ksr_route_locks_set!=NULL && msg->callid && msg->callid->body.s 254 && msg->callid->body.len >0)) { 255 cidlockidx = get_hash1_raw(msg->callid->body.s, msg->callid->body.len); 256 cidlockidx = cidlockidx % ksr_route_locks_set->size; 257 cidlockset = 1; 258 } 259 260 if(msg->first_line.type == SIP_REQUEST) { ```
(gdb) bt full
``` #0 0x0000000000596b86 in receive_msg ( buf=0x2969590 "SIP/2.0 401 Unauthorized\r\nVia: SIP/2.0/TCP 81.27.163.30:5062;branch=z9hG4bK4473.59e6c5c0a45c67b063e075a8f9f8eccb.0\r\nVia: SIP/2.0/UDP 100.72.253.100:5060;rport=5060;received=100.72.253.100; branch=z9hG4"..., len=573, rcv_info=0x7f256b187e08) at core/receive.c:256 msg = 0x7f2577648d90 ctx = {rec_lev = -878175776, run_flags = 32767, last_retcode = 6442811, jmp_env = {{__jmpbuf = {8108147, 134217728, 8388608, 6, 140736610180220, 139798687284720, 1, 70364449210377}, __mask_was_saved = 878175761, __saved_mask = {__val = {139798687284848, 140736610179776, 139798687284720, 42949672963, 139798687286050, 8388608, 139798687286053, 140736610179792, 6445147, 0, 0, 255, 0, 140736610180220, 139798687284720, 2314885530818453536}}}}} bctx = 0x0 ret = 1 stats_on = 0 tvb = {tv_sec = 140736610180220, tv_usec = 139798687285480} tve = {tv_sec = 139798687284720, tv_usec = 38654722047} tz = {tz_minuteswest = 2014554944, tz_dsttime = 1} diff = 0 inb = { s = 0x2969590 "SIP/2.0 401 Unauthorized\r\nVia: SIP/2.0/TCP 81.27.163.30:5062;branch=z9hG4bK4473.59e6c5c0a45c67b063e075a8f9f8eccb.0\r\nVia: SIP/2.0/UDP 100.72.253.100:5060;rport=5060;received=100.72.25 3.100;branch=z9hG4"..., len = 573} netinfo = {data = {s = 0x7bb873 "INFO", len = 134217728}, rcv = 0x800000, dst = 0x6} keng = 0x0 evp = {data = 0x7fffcba81900, rcv = 0x7f256b187e08, dst = 0x0} cidlockidx = 2555451508 cidlockset = 0 errsipmsg = 0 __func__ = "receive_msg" #1 0x000000000062cb6c in receive_tcp_msg ( tcpbuf=0x7f256b1880e8 "SIP/2.0 401 Unauthorized\r\nVia: SIP/2.0/TCP 81.27.163.30:5062;branch=z9hG4bK4473.59e6c5c0a45c67b063e075a8f9f8eccb.0\r\nVia: SIP/2.0/UDP 100.72.253.100:5060;rport=5060;received=100.72. 253.100;branch=z9hG4"..., len=573, rcv_info=0x7f256b187e08, con=0x7f256b187df0) at core/tcp_read.c:1448 buf = 0x2969590 "SIP/2.0 401 Unauthorized\r\nVia: SIP/2.0/TCP 81.27.163.30:5062;branch=z9hG4bK4473.59e6c5c0a45c67b063e075a8f9f8eccb.0\r\nVia: SIP/2.0/UDP 100.72.253.100:5060;rport=5060;received=100.72.25 3.100;branch=z9hG4"... bsize = 65535 blen = 65535 __func__ = "receive_tcp_msg" #2 0x000000000062ec28 in tcp_read_req (con=0x7f256b187df0, bytes_read=0x7fffcba81c78, read_flags=0x7fffcba81c7c) at core/tcp_read.c:1631 bytes = 573 total_bytes = 573 resp = 1 size = 24 req = 0x7f256b187e70 dst = {send_sock = 0x0, to = {s = {sa_family = 61216, sa_data = "\255\000\000\000\000\000\270\034\250\313\377\177\000"}, sin = {sin_family = 61216, sin_port = 173, sin_addr = {s_addr = 0}, sin_zero = "\270\034\250\313\377\177\000"}, sin6 = {sin6_family = 61216, sin6_port = 173, sin6_flowinfo = 0, sin6_addr = {__in6_u = { __u6_addr8 = "\270\034\250\313\377\177\000\000\b\000\000\000\000\000\000", __u6_addr16 = {7352, 52136, 32767, 0, 8, 0, 0, 0}, __u6_addr32 = {3416792248, 32767, 8, 0}}}, sin6_scope_id = 20}}, id = 0, proto = 1 '\001', send_flags = {f = 0, blst_imask = 1}} c = 57 '9' ret = -878175312 __func__ = "tcp_read_req" #3 0x00000000006318c1 in handle_io (fm=0x7f2577636468, events=1, idx=-1) at core/tcp_read.c:1804 ret = 8 n = 8 read_flags = 1 con = 0x7f256b187df0 s = 9 resp = 140736610180352 t = 32549 __func__ = "handle_io" #4 0x000000000062017f in io_wait_loop_epoll (h=0xadef20 <io_w>, t=2, repeat=0) at core/io_wait.h:1065 n = 1 r = 0 fm = 0x7f2577636468 revents = 1 __func__ = "io_wait_loop_epoll" #5 0x000000000063375a in tcp_receive_loop (unix_sock=25) at core/tcp_read.c:1974 __func__ = "tcp_receive_loop" #6 0x000000000051a6ed in tcp_init_children () at core/tcp_main.c:4853 r = 0 i = 1 reader_fd_1 = 25 pid = 0 si_desc = "tcp receiver (generic)\000\000\r\216U\000\000\000\000\000\bqbw%\177\000\000\320x`w\001", '\000' <repeats 11 times>, "M\307]\000\000\000\000\000Q\243|\000\000\000\000\000\250\271aw%\177\000\000`Fx\000\000\000\000\000\000\000\000\000N\031\000\000s\270{\000\000\000\000\000\320_\347w%\177\000\000s\270{\000\000\000\000\000\000\000\000\b\000\000\000" si = 0x0 __func__ = "tcp_init_children" #7 0x000000000042619b in main_loop () at main.c:1735 i = 2 pid = 6477 si = 0x0 si_desc = "udp receiver child=1 sock=81.27.163.30:5060 (81.27.163.30:5060)\000\004\000\000\000\000\000\000\000\000\260\354j%\177\000\000\000\000\000\000\000\000\000\000\250\347\027k%\177\000\000p \250\313\377\177\000\000\034,M\000\000\000\000\000", '\377' <repeats 12 times>, "\n\000\000" nrprocs = 2 woneinit = 1 __func__ = "main_loop" #8 0x000000000042c9a5 in main (argc=15, argv=0x7fffcba82318) at main.c:2675 cfg_stream = 0x27eb010 c = -1 r = 0 tmp = 0x7fffcba83eb8 "" tmp_len = 1 port = 0 proto = 2021155176 options = 0x76c218 ":f:cm:M:dVIhEeb:l:L:n:vKrRDTN:W:w:t:u:g:P:G:SQ:O:a:A:x:X:Y:" ret = -1 seed = 856022880 rfd = 4 debug_save = 0 debug_flag = 0 dont_fork_cnt = 0 n_lst = 0x7f2578786718 p = 0xffffffff <error: Cannot access memory at address 0xffffffff> st = {st_dev = 93, st_ino = 98, st_nlink = 2, st_mode = 16832, st_uid = 378, st_gid = 378, __pad0 = 0, st_rdev = 0, st_size = 60, st_blksize = 4096, st_blocks = 0, st_atim = {tv_sec = 1549008770, tv_nsec = 369207448}, st_mtim = {tv_sec = 1549009064, tv_nsec = 813217177}, st_ctim = {tv_sec = 1549009064, tv_nsec = 813217177}, __glibc_reserved = {0, 0, 0}} __func__ = "main" ```
Let me know if you need more info, i can fully reproduce the problem.
Cheers,
Thomas
Did you set `route_locks_size` global parameter? If yes, what is the value?
Can you provide from gdb the output for:
``` p ksr_route_locks_set p *ksr_route_locks_set ```
No, this parameter is not set, it should have the default value.
@miconda not sure if it's useful but i tried to start kamailio with the mentioned route_locks_size but could not get it up (tried with values 1,2,8,16,20).
``` Starting Kamailio (OpenSER) - the Open Source SIP Server... Feb 01 09:47:57 ifens224 kamailio[11130]: INFO: pv [pv_shv.c:60]: shvar_init_locks(): locks array size 16 Feb 01 09:47:57 ifens224 kamailio[11130]: ERROR: <core> [core/receive.c:82]: ksr_route_locks_set_init(): failed to allocate route locks set ```
You haven't provided the output of the gdb commands I listed in my previous comment, I need to see what's there to be able to troubleshoot further.
Sorry, i overlooked it.
Output is:
``` (gdb) p ksr_route_locks_set $1 = (rec_lock_set_t *) 0x7c00000100
(gdb) p *ksr_route_locks_set Cannot access memory at address 0x7c00000100 ```
That should be NULL (0x0) by default, so somewhere is an invalid write/buffer overflow.
Can you just start kamailio with your config, but do not send any sip traffic to it -- just attach with gdb to one of the processes running there and do the same commands as above to see if the overwrite happens at startup or at runtime?
Ok, i attached to a random child and i get the expected output:
``` (gdb) p ksr_route_locks_set $1 = (rec_lock_set_t *) 0x0
(gdb) p *ksr_route_locks_set Cannot access memory at address 0x0 ```
Strange. Can i check for some specific included module or is there some other global config that might affect this setting?
@miconda, i was able to isolate the problem to a single line of python script. I did so by writing two sipp scripts to simulate our asterisk and the isp and subsequently removing logic from the kamailio script.
Segfault happens when i use this selector within ksr_reply_route:
`KSR.pv.get("$sel(@hf_value.Via)")`
a simpler `KSR.pv.get("$sel(via)")` works without crashes. I could surely ommit the "hf_value" part to workaround the problem but it looks like some KEMI/python binding problem. Using e.g. "hf_value.From" does not result in a crash as well. I tried replies with multiple Via lines and also multiple Via's in one header line and could not find a difference in the behavior.
As stated above, the segfault happens after so and so many packets (typically 14-20) but the pace of the messages seems to play a role as well.
Hope it helps!
Cheers, Thomas
Trying to understand how I can reproduce it in order to troubleshoot ...
So, I have to use `KSR.pv.get("$sel(@hf_value.Via)")` inside `ksr_reply_route()` of a Python (app_python) script? The it crashes after a while of handling traffic (14-20 packets -- requests and replies, or only replies)?
At the first thought, should not be a KEMI export issue, but a `select` variable issue. From what is strictly related to KEMI code, either KSR.pv.get("$sel(via)") or KSR.pv.get("$sel(@hf_value.Via)") have the sam code, but then it ends up in executed code from textopsx module to get the $sel(@hf_value.Via), while I think the $sel(via) is implemented in the core. The `select` variables were implemented by SER developers when Kamailio was a separate project (during 2005-2008) and we got the code with the merge in 2008-2009. So, overall I am not that familiar with that code, but it has to be fixed, so being able to reproduce will help a lot.
Daniel, i will shrink the script as much as possible and then i can surely provide a full test case incl. sipp scenarios. Give me a few days, though. By now i can say that this is the only place where i used a "@hf_value" selector. Maybe the segfault occurs also if one uses it in a request instead of a reply, this would simplify the test case greatly (e.g. can then maybe use sipsak).
I prepared some files in a zip: https://drive.google.com/open?id=1qVap6BhbGpJwHL-Gf3aUCBOSSbekscs5 including some instructions on how to reproduce. Kamailio config is down to 15 lines (5 modules only) and the python script is also bare bones with about 30 lines.
It boils down to:
1. setup kamailio and sipp 1. run kamailio 1. start the "isp" sipp in one shell 1. start the "asterisk" sipp in another shell (more details to be found in the zip)
I did a few more checks, also. One initial assumption about the problem was wrong. It is reproducible with a single udp socket as well, not related to tcp like initially stated. Also i tried to reproduce it with a native kamailio config file but could not reproduce it, this script is also included in the zip file.
Good luck hunting it down :-)
Thanks for the instructions to reproduce the issue, it helped to identify the problem and fix it. Select required some function to be executed in order to initialize some of its internal buffers and it didn't happen when executing kemi scripts. I pushed fixes to master branch, I am going to backport to stable branches these days.
If you want to test with master branch right now, let me know the results in your side.
Closed #1829.
@miconda great! Glad i could help. I will try it out in a few days. Cu at Kamailio World
Cheers