Hi,
we have a machine running 16 Kamailio instances, and while upgrading to 4.4.1 (from 4.3.5), 8 of them wouldn't start. When downgrading to 4.3.5, they all start again.
All of them have pretty identical configuration files, except IPs, ports and some code enabled or disabled via defines. After comparing a working and non-working configuration and adjusting setting by setting, we finally ended up with a working configuration. The difference is, that it won't start when a part of the code DOES NOT get included. If it gets included, it will start.
This is the mentioned part of our main route:
#!ifdef ENABLE_INV_RATELIMIT # Check for INVITE limit if (is_method("INVITE") && $au == $null && !($ua =~ "sipgate") ) { $var(invcount) = $shtcn(invcount=>%~$fU); xlog("L_INFO", "INVITE Requests from $fU in last 30 seconds: $var(invcount)\n");
if ($var(invcount) < 12) { $var(uniqcid) = $ci + $Ts + $ft; $var(tkey) = $fU + '-' + $(var(uniqcid){s.md5}{s.substr,0,10}); $sht(invcount=>$(var(tkey))) = 1; $var(uniqcid) = $null; $var(tkey) = $null; }
if ($var(invcount) > 10) { if ($var(invcount) == 11 ) { xlog("L_NOTICE", "User $fU ($var(domain2use)) over ratelimit for new calls, rejecting.\n"); } # Enable this only after evaluating the impact! append_to_reply("Retry-After: 30\r\n"); sl_send_reply("503", "Call Rate Limit Exceeded"); exit; } } #!endif
If we put this line at the top of the configuration file, everything works:
#!define ENABLE_INV_RATELIMIT
If we delete this line, startup does not work. It just sits in ps for one minute without forking, and then gets terminated.
We enabled a bit of debugging, and this is apparently the error causing Kamailio to shutdown:
May 25 14:50:15 kammel /usr/sbin/kamailio[24989]: DEBUG: <core> [sr_module.c:920]: init_mod_child(): rank 53: nathelper May 25 14:50:15 kammel /usr/sbin/kamailio[24987]: DEBUG: <core> [local_timer.c:61]: init_local_timer(): timer_list between 0x9f0428 and 0xa34428 May 25 14:50:15 kammel /usr/sbin/kamailio[24987]: DEBUG: <core> [io_wait.h:376]: io_watch_add(): DBG: io_watch_add(0x9f0240, 82, 1, (nil)), fd_no=0 May 25 14:50:15 kammel /usr/sbin/kamailio[24987]: ERROR: <core> [io_wait.h:459]: io_watch_add(): epoll_ctl failed: Bad file descriptor [9] May 25 14:50:15 kammel /usr/sbin/kamailio[24987]: CRITICAL: <core> [tcp_read.c:1747]: tcp_receive_loop(): failed to add tcp main socket to the fd list May 25 14:50:15 kammel /usr/sbin/kamailio[24987]: CRITICAL: <core> [tcp_read.c:1815]: tcp_receive_loop(): exiting...
I have no idea, how this part of the code could lead to this error, but it is reproducable, that at least on this system setting or disabling this define fixes or breaks the startup.
Does anybody have an idea, what's happening there?
Best Regards, Sebastian
Hello,
so it is not a crash, right? No coredump or some segfault report, but just it doesn't start -- did I get it correctly?
Given you run a lot of instances, maybe you run out of file descriptors, can you check the OS limits for them?
Also, running out of memory might result in such behaviour.
Cheers, Daniel
On 25/05/16 15:03, Sebastian Damm wrote:
Hi,
we have a machine running 16 Kamailio instances, and while upgrading to 4.4.1 (from 4.3.5), 8 of them wouldn't start. When downgrading to 4.3.5, they all start again.
All of them have pretty identical configuration files, except IPs, ports and some code enabled or disabled via defines. After comparing a working and non-working configuration and adjusting setting by setting, we finally ended up with a working configuration. The difference is, that it won't start when a part of the code DOES NOT get included. If it gets included, it will start.
This is the mentioned part of our main route:
#!ifdef ENABLE_INV_RATELIMIT # Check for INVITE limit if (is_method("INVITE") && $au == $null && !($ua =~ "sipgate") ) { $var(invcount) = $shtcn(invcount=>%~$fU); xlog("L_INFO", "INVITE Requests from $fU in last 30 seconds: $var(invcount)\n");
if ($var(invcount) < 12) { $var(uniqcid) = $ci + $Ts + $ft; $var(tkey) = $fU + '-' +
$(var(uniqcid){s.md5}{s.substr,0,10}); $sht(invcount=>$(var(tkey))) = 1; $var(uniqcid) = $null; $var(tkey) = $null; }
if ($var(invcount) > 10) { if ($var(invcount) == 11 ) { xlog("L_NOTICE", "User $fU
($var(domain2use)) over ratelimit for new calls, rejecting.\n"); } # Enable this only after evaluating the impact! append_to_reply("Retry-After: 30\r\n"); sl_send_reply("503", "Call Rate Limit Exceeded"); exit; } } #!endif
If we put this line at the top of the configuration file, everything works:
#!define ENABLE_INV_RATELIMIT
If we delete this line, startup does not work. It just sits in ps for one minute without forking, and then gets terminated.
We enabled a bit of debugging, and this is apparently the error causing Kamailio to shutdown:
May 25 14:50:15 kammel /usr/sbin/kamailio[24989]: DEBUG: <core> [sr_module.c:920]: init_mod_child(): rank 53: nathelper May 25 14:50:15 kammel /usr/sbin/kamailio[24987]: DEBUG: <core> [local_timer.c:61]: init_local_timer(): timer_list between 0x9f0428 and 0xa34428 May 25 14:50:15 kammel /usr/sbin/kamailio[24987]: DEBUG: <core> [io_wait.h:376]: io_watch_add(): DBG: io_watch_add(0x9f0240, 82, 1, (nil)), fd_no=0 May 25 14:50:15 kammel /usr/sbin/kamailio[24987]: ERROR: <core> [io_wait.h:459]: io_watch_add(): epoll_ctl failed: Bad file descriptor [9] May 25 14:50:15 kammel /usr/sbin/kamailio[24987]: CRITICAL: <core> [tcp_read.c:1747]: tcp_receive_loop(): failed to add tcp main socket to the fd list May 25 14:50:15 kammel /usr/sbin/kamailio[24987]: CRITICAL: <core> [tcp_read.c:1815]: tcp_receive_loop(): exiting...
I have no idea, how this part of the code could lead to this error, but it is reproducable, that at least on this system setting or disabling this define fixes or breaks the startup.
Does anybody have an idea, what's happening there?
Best Regards, Sebastian
SIP Express Router (SER) and Kamailio (OpenSER) - sr-users mailing list sr-users@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-users
Hi,
it is definitely no memory or open files issue. I stopped all processes and tried to start one of the faulty instances by itself, without luck.
Actually, there are core files. I don't know whether they say something interesting, but I just saw that each start attempt produced a core file.
This is the backtrace of one of them:
root@kammel:~# gdb /usr/sbin/kamailio /var/cores/core_kamailio_24894_6_110_1464180675 GNU gdb (GDB) 7.4.1-debian Copyright (C) 2012 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". For bug reporting instructions, please see: http://www.gnu.org/software/gdb/bugs/... Reading symbols from /usr/sbin/kamailio...Reading symbols from /usr/lib/debug/.build-id/0e/a2651a480b84540fc37d4d0d640aa3a4db078c.debug...done. done. [New LWP 24894]
warning: Can't read pathname for load map: Input/output error. [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `/usr/sbin/kamailio -f /etc/kamailio/kamailio_sip_lb_sipconnect_de_v6_1.cfg -P /'. Program terminated with signal 6, Aborted. #0 0x00007fef1812b125 in raise () from /lib/x86_64-linux-gnu/libc.so.6 (gdb) bt #0 0x00007fef1812b125 in raise () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00007fef1812e3a0 in abort () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x00000000004ab7d9 in sig_alarm_abort (signo=<optimized out>) at main.c:649 #3 <signal handler called> #4 0x00007fef181d3b77 in syscall () from /lib/x86_64-linux-gnu/libc.so.6 #5 0x00007fef0fc69d50 in futex_get (lock=<optimized out>) at ../../parser/../mem/../futexlock.h:108 #6 mod_destroy () at rtpengine.c:1970 #7 0x0000000000580092 in destroy_modules () at sr_module.c:811 #8 0x00000000004ac407 in cleanup (show_status=show_status@entry=1) at main.c:524 #9 0x00000000004acf4f in shutdown_children (show_status=1, sig=15) at main.c:666 #10 0x00000000004adac7 in handle_sigs () at main.c:758 #11 0x00000000004b22e6 in main_loop () at main.c:1733 #12 0x0000000000427e2b in main (argc=<optimized out>, argv=<optimized out>) at main.c:2616
Best Regards, Sebastian
Hello,
the backtrace is from doing shutdown cleanup, triggered because it couldn't start for other reason, but it doesn't provide any useful information. Are all the cores giving similar backtrace?
Can you try with '-x qm' just to see if there is a memory overwriting issue around?
Cheers, Daniel
On 25/05/16 15:42, Sebastian Damm wrote:
Hi,
it is definitely no memory or open files issue. I stopped all processes and tried to start one of the faulty instances by itself, without luck.
Actually, there are core files. I don't know whether they say something interesting, but I just saw that each start attempt produced a core file.
This is the backtrace of one of them:
root@kammel:~# gdb /usr/sbin/kamailio /var/cores/core_kamailio_24894_6_110_1464180675 GNU gdb (GDB) 7.4.1-debian Copyright (C) 2012 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". For bug reporting instructions, please see: http://www.gnu.org/software/gdb/bugs/... Reading symbols from /usr/sbin/kamailio...Reading symbols from /usr/lib/debug/.build-id/0e/a2651a480b84540fc37d4d0d640aa3a4db078c.debug...done. done. [New LWP 24894]
warning: Can't read pathname for load map: Input/output error. [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `/usr/sbin/kamailio -f /etc/kamailio/kamailio_sip_lb_sipconnect_de_v6_1.cfg -P /'. Program terminated with signal 6, Aborted. #0 0x00007fef1812b125 in raise () from /lib/x86_64-linux-gnu/libc.so.6 (gdb) bt #0 0x00007fef1812b125 in raise () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00007fef1812e3a0 in abort () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x00000000004ab7d9 in sig_alarm_abort (signo=<optimized out>) at main.c:649 #3 <signal handler called> #4 0x00007fef181d3b77 in syscall () from /lib/x86_64-linux-gnu/libc.so.6 #5 0x00007fef0fc69d50 in futex_get (lock=<optimized out>) at ../../parser/../mem/../futexlock.h:108 #6 mod_destroy () at rtpengine.c:1970 #7 0x0000000000580092 in destroy_modules () at sr_module.c:811 #8 0x00000000004ac407 in cleanup (show_status=show_status@entry=1) at main.c:524 #9 0x00000000004acf4f in shutdown_children (show_status=1, sig=15) at main.c:666 #10 0x00000000004adac7 in handle_sigs () at main.c:758 #11 0x00000000004b22e6 in main_loop () at main.c:1733 #12 0x0000000000427e2b in main (argc=<optimized out>, argv=<optimized out>) at main.c:2616
Best Regards, Sebastian
After examining this issue a bit more, I'm even more confused.
Findings:
* When changing the memory manager to qm, Kamailio starts. Tried 3 times with fm, 3 times with qm, had three failures and three successes.
* When running with fm memory manager, I can get it running by adding these lines basically somewhere in the main route:
if (is_method("INVITE")) { xlog("L_INFO", "Hello World\n"); }
Again: Tried it three times.
Do you have any idea on how to debug this any further? I'm all out of ideas. And it was running without problems under 4.3.
I do have two strace outputs from a successful and a failed start, which I could send you if you want to look into this.
Best Regards, Sebastian