It would be good to add some sip keepalive monitoring (e.g., cron job with sipsak sending options) that will alert/restart in case of no response. The monit tool can also send sip keepalives and take actions on no response.
On a deadlock, checking process table is not enough. There should have been high cpu usage, though, if you monitored that.
Cheers, Daniel
On 27/03/15 12:47, Alex Balashov wrote:
This was a rather peculiar crash:
From the logs, it would appear that Kamailio simply stopped processing messages at some point. There's about 8 minutes of zero log output at a time of constantly incoming traffic.
At some point, this situation is resolved when all Kamailio processes die with a normal SIGTERM, when someone manually restarted it:
Mar 26 20:40:10 Proxy1 /usr/local/sbin/kamailio[27498]: NOTICE: <core> [main.c:739]: handle_sigs(): Thank you for flying kamailio!!! Mar 26 20:40:10 Proxy1 /usr/local/sbin/kamailio[27535]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received. ...
But there are a few things here that are difficult to explain from the log:
- Why was there no SIP stack response for 8 minutes, no logging
activity, etc?
- We have a script that checks if Kamailio processes are running
every 1 second, and restarts Kamailio if it's not. It sends an e-mail informing us of that development also.
It's a rather naive check:
ps aux | grep kamailio | grep -v 'grep kamailio' | wc -l
But in this case, the script was not triggered, which would imply that some Kamailio processes--perhaps all--remained running.
There is no indication in the logs that any process died for any reason, except for the 'signal 15' received by all processes at the time of manual restart.
- Why was a core dump generated at the time of the restart, if
nothing crashed?
#3 is most interesting to me, because if it were some other problem, e.g. blocking of SIP worker threads for some reason, then I wouldn't expect a core dump upon service shutdown.
There is no other indication of any child process dying with SIGSEGV or SIGABRT.
-- Alex
On 03/27/2015 06:17 AM, Alex Balashov wrote:
Hello,
The system experienced another crash yesterday, but unfortunately the core dump is not very insightful, possibly due to being incomplete:
BFD: Warning: /tmp/./core.kamailio.500.1427402410.27498 is truncated: expected core file size >= 8602058752, found: 1769852928. [New Thread 27498] Cannot access memory at address 0x7f52891e3168 Cannot access memory at address 0x7f52891e3168 Cannot access memory at address 0x7f52891e3168 Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 Failed to read a valid object file image from memory. Core was generated by `/usr/local/sbin/kamailio -P /var/run/kamailio.pid -m 8192 -u evaristesys -g eva'. Program terminated with signal 11, Segmentation fault. #0 0x00007f5286d97e45 in ?? () Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.149.el6_6.5.x86_64 (gdb) where #0 0x00007f5286d97e45 in ?? () Cannot access memory at address 0x7fffbe32a210
That's not much help at all, so I cannot possibly say it is for the same reasons as before.