Hello,
Very occasionally, we encounter what appear to be deadlocks in all UDP receiver threads. All Kamailio processes are running, but no SIP messages are being processed.
On one of our high-volume installation, this happens extremely infrequently -- maybe once every month or two. On these occasions, the operator restarts the proxy before we get a chance to go in and figure out what's going on.
So, I'm trying to provide the operator with a procedure to execute prior to restarting the proxy on these occasions, so that we can see a snapshot of where the receiver threads are stuck. As far as I can tell, unless Kamailio itself segfaults, there's no specific PID that one can attach GDB to in order to get an overhead snapshot of all the child processes.
Here's what I came up with:
--------------------------------------------- #!/bin/bash
kamcmd -s /tmp/kamailio_ctl ps > thread_log.txt echo >> thread_log.txt
while read PID; do gdb --pid=$PID<<EOF>>thread_log.txt set print elements 0 thread apply all bt full generate-core-file detach EOF done < <(kamcmd -s /tmp/kamailio_ctl ps | grep 'udp receiver' | awk '{print $1}') ---------------------------------------------
As far as I can tell, this should give me the most ample visibility into the state of the threads, with further core dumps to inspect if follow-up is needed. Hopefully this will result in some fixes back to the project.
However, if there are any other suggestions for information to grab in such a scenario, I'm all ears.
Thanks in advance!
-- Alex
We just encountered another one of these famed deadlocks. Any suggestions for how to analyse them beyond what I've already trotted out here?
On 09/14/2015 05:47 PM, Alex Balashov wrote:
Hello,
Very occasionally, we encounter what appear to be deadlocks in all UDP receiver threads. All Kamailio processes are running, but no SIP messages are being processed.
On one of our high-volume installation, this happens extremely infrequently -- maybe once every month or two. On these occasions, the operator restarts the proxy before we get a chance to go in and figure out what's going on.
So, I'm trying to provide the operator with a procedure to execute prior to restarting the proxy on these occasions, so that we can see a snapshot of where the receiver threads are stuck. As far as I can tell, unless Kamailio itself segfaults, there's no specific PID that one can attach GDB to in order to get an overhead snapshot of all the child processes.
Here's what I came up with:
#!/bin/bash
kamcmd -s /tmp/kamailio_ctl ps > thread_log.txt echo >> thread_log.txt
while read PID; do gdb --pid=$PID<<EOF>>thread_log.txt set print elements 0 thread apply all bt full generate-core-file detach EOF done < <(kamcmd -s /tmp/kamailio_ctl ps | grep 'udp receiver' | awk '{print $1}')
As far as I can tell, this should give me the most ample visibility into the state of the threads, with further core dumps to inspect if follow-up is needed. Hopefully this will result in some fixes back to the project.
However, if there are any other suggestions for information to grab in such a scenario, I'm all ears.
Thanks in advance!
-- Alex
There is 'kamctl trap' which does a backtrace on all kamailio processes, similar with what your script does. Use top to identify which processes are locked (100% CPU utilization) and after that ... code inspection.
-ovidiu
On Mon, Sep 28, 2015 at 1:26 PM, Alex Balashov abalashov@evaristesys.com wrote:
We just encountered another one of these famed deadlocks. Any suggestions for how to analyse them beyond what I've already trotted out here?
On 09/14/2015 05:47 PM, Alex Balashov wrote:
Hello,
Very occasionally, we encounter what appear to be deadlocks in all UDP receiver threads. All Kamailio processes are running, but no SIP messages are being processed.
On one of our high-volume installation, this happens extremely infrequently -- maybe once every month or two. On these occasions, the operator restarts the proxy before we get a chance to go in and figure out what's going on.
So, I'm trying to provide the operator with a procedure to execute prior to restarting the proxy on these occasions, so that we can see a snapshot of where the receiver threads are stuck. As far as I can tell, unless Kamailio itself segfaults, there's no specific PID that one can attach GDB to in order to get an overhead snapshot of all the child processes.
Here's what I came up with:
#!/bin/bash
kamcmd -s /tmp/kamailio_ctl ps > thread_log.txt echo >> thread_log.txt
while read PID; do gdb --pid=$PID<<EOF>>thread_log.txt set print elements 0 thread apply all bt full generate-core-file detach EOF done < <(kamcmd -s /tmp/kamailio_ctl ps | grep 'udp receiver' | awk '{print $1}')
As far as I can tell, this should give me the most ample visibility into the state of the threads, with further core dumps to inspect if follow-up is needed. Hopefully this will result in some fixes back to the project.
However, if there are any other suggestions for information to grab in such a scenario, I'm all ears.
Thanks in advance!
-- Alex
-- Alex Balashov | Principal | Evariste Systems LLC 303 Perimeter Center North, Suite 300 Atlanta, GA 30346 United States
Tel: +1-800-250-5920 (toll-free) / +1-678-954-0671 (direct) Web: http://www.evaristesys.com/, http://www.csrpswitch.com/
SIP Express Router (SER) and Kamailio (OpenSER) - sr-users mailing list sr-users@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-users
Were you pulling the backtraces based on the script you pasted in your previous email? That should be good source of information to analyze if what kamailio was doing.
I already said, if the is a mutex deadlock, it will be also noticed by high cpu usage. Was it the case, or you don't have any access to cpu usage history?
If it is just no more sip message routing, but no high cpu usage, then:
- maybe processed were blocked in a lengthily I/O operation (e.g., query to database) - maybe someone/something was resetting the network interface (the sockets were bound to previous address) -- e.g., it can be done by some upgrades of OS or dhcp - maybe some limits of OS were reached, the packets were filtered by kernel (if you have centos with selinux, be sure it is properly configured)
Cheers, Daniel
On 28/09/15 19:26, Alex Balashov wrote:
We just encountered another one of these famed deadlocks. Any suggestions for how to analyse them beyond what I've already trotted out here?
On 09/14/2015 05:47 PM, Alex Balashov wrote:
Hello,
Very occasionally, we encounter what appear to be deadlocks in all UDP receiver threads. All Kamailio processes are running, but no SIP messages are being processed.
On one of our high-volume installation, this happens extremely infrequently -- maybe once every month or two. On these occasions, the operator restarts the proxy before we get a chance to go in and figure out what's going on.
So, I'm trying to provide the operator with a procedure to execute prior to restarting the proxy on these occasions, so that we can see a snapshot of where the receiver threads are stuck. As far as I can tell, unless Kamailio itself segfaults, there's no specific PID that one can attach GDB to in order to get an overhead snapshot of all the child processes.
Here's what I came up with:
#!/bin/bash
kamcmd -s /tmp/kamailio_ctl ps > thread_log.txt echo >> thread_log.txt
while read PID; do gdb --pid=$PID<<EOF>>thread_log.txt set print elements 0 thread apply all bt full generate-core-file detach EOF done < <(kamcmd -s /tmp/kamailio_ctl ps | grep 'udp receiver' | awk '{print $1}')
As far as I can tell, this should give me the most ample visibility into the state of the threads, with further core dumps to inspect if follow-up is needed. Hopefully this will result in some fixes back to the project.
However, if there are any other suggestions for information to grab in such a scenario, I'm all ears.
Thanks in advance!
-- Alex
Hi,
Thanks very much to you and Ovidiu for the responses. I didn't mean to leave this thread hanging. See inline:
On 09/28/2015 05:51 PM, Daniel-Constantin Mierla wrote:
Were you pulling the backtraces based on the script you pasted in your previous email? That should be good source of information to analyze if what kamailio was doing.
Yes, although as yet I have not been able to actually get the operator to run a backtrace at the time of the deadlock. It's a psychological and political problem: they are so eager to restore service that they do not have the discipline to run my debug script, and jump straight to restarting Kamailio.
However, the biggest problem that I see is that if the backtraces reveal something interesting, it may invite follow-up, e.g. examination of other frames and values. That would require a core dump. Dumping core for all 8-12 child processes would take several minutes, as the shm pool is quite large (4 GB). This is a very high-volume installation. The operator would never go for that.
So, if I do get an intriguing backtrace, I don't really know what else to do to elaborate.
I already said, if the is a mutex deadlock, it will be also noticed by high cpu usage. Was it the case, or you don't have any access to cpu usage history?
I don't have CPU usage history, but I will try to get one next time this happens.
If it is just no more sip message routing, but no high cpu usage, then:
- maybe processed were blocked in a lengthily I/O operation (e.g., query
to database)
That's certainly possible. The backtrace will surely reveal that.
- maybe someone/something was resetting the network interface (the
sockets were bound to previous address) -- e.g., it can be done by some upgrades of OS or dhcp
No, that definitely is not the case.
- maybe some limits of OS were reached, the packets were filtered by
kernel (if you have centos with selinux, be sure it is properly configured)
I am aware of CentOS's ridiculous default ulimits in CentOS 6.6, and all of these have been appropriately set to infinity. SELinux is disabled.
I'll let you know what I find. Thanks for the input!
-- Alex
A backtrace should provide enough information to know where to look for issues and that should not take a long time. Maybe you can use monit to monitor the cpu and on failure run 'kamctl trap' to get the backtrace. if cpu is greater than 50% for 5 cycles then exec "/usr/sbin/kamctl trap" Make sure that you have the debug rpm installed.
-ovidiu
On Tue, Sep 29, 2015 at 1:40 PM, Alex Balashov abalashov@evaristesys.com wrote:
Hi,
Thanks very much to you and Ovidiu for the responses. I didn't mean to leave this thread hanging. See inline:
On 09/28/2015 05:51 PM, Daniel-Constantin Mierla wrote:
Were you pulling the backtraces based on the script you pasted in your previous email? That should be good source of information to analyze if what kamailio was doing.
Yes, although as yet I have not been able to actually get the operator to run a backtrace at the time of the deadlock. It's a psychological and political problem: they are so eager to restore service that they do not have the discipline to run my debug script, and jump straight to restarting Kamailio.
However, the biggest problem that I see is that if the backtraces reveal something interesting, it may invite follow-up, e.g. examination of other frames and values. That would require a core dump. Dumping core for all 8-12 child processes would take several minutes, as the shm pool is quite large (4 GB). This is a very high-volume installation. The operator would never go for that.
So, if I do get an intriguing backtrace, I don't really know what else to do to elaborate.
I already said, if the is a mutex deadlock, it will be also noticed by high cpu usage. Was it the case, or you don't have any access to cpu usage history?
I don't have CPU usage history, but I will try to get one next time this happens.
If it is just no more sip message routing, but no high cpu usage, then:
- maybe processed were blocked in a lengthily I/O operation (e.g., query
to database)
That's certainly possible. The backtrace will surely reveal that.
- maybe someone/something was resetting the network interface (the
sockets were bound to previous address) -- e.g., it can be done by some upgrades of OS or dhcp
No, that definitely is not the case.
- maybe some limits of OS were reached, the packets were filtered by
kernel (if you have centos with selinux, be sure it is properly configured)
I am aware of CentOS's ridiculous default ulimits in CentOS 6.6, and all of these have been appropriately set to infinity. SELinux is disabled.
I'll let you know what I find. Thanks for the input!
-- Alex
-- Alex Balashov | Principal | Evariste Systems LLC 303 Perimeter Center North, Suite 300 Atlanta, GA 30346 United States
Tel: +1-800-250-5920 (toll-free) / +1-678-954-0671 (direct) Web: http://www.evaristesys.com/, http://www.csrpswitch.com/
SIP Express Router (SER) and Kamailio (OpenSER) - sr-users mailing list sr-users@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-users