[SR-Users] Best practices for troubleshooting deadlocks?

Alex Balashov abalashov at evaristesys.com
Tue Sep 29 19:40:29 CEST 2015


Hi,

Thanks very much to you and Ovidiu for the responses. I didn't mean to 
leave this thread hanging. See inline:

On 09/28/2015 05:51 PM, Daniel-Constantin Mierla wrote:

> Were you pulling the backtraces based on the script you pasted in your
> previous email? That should be good source of information to analyze if
> what kamailio was doing.

Yes, although as yet I have not been able to actually get the operator 
to run a backtrace at the time of the deadlock. It's a psychological and 
political problem: they are so eager to restore service that they do not 
have the discipline to run my debug script, and jump straight to 
restarting Kamailio.

However, the biggest problem that I see is that if the backtraces reveal 
something interesting, it may invite follow-up, e.g. examination of 
other frames and values. That would require a core dump. Dumping core 
for all 8-12 child processes would take several minutes, as the shm pool 
is quite large (4 GB). This is a very high-volume installation. The 
operator would never go for that.

So, if I do get an intriguing backtrace, I don't really know what else 
to do to elaborate.

> I already said, if the is a mutex deadlock, it will be also noticed by
> high cpu usage. Was it the case, or you don't have any access to cpu
> usage history?

I don't have CPU usage history, but I will try to get one next time this 
happens.

> If it is just no more sip message routing, but no high cpu usage, then:
>
> - maybe processed were blocked in a lengthily I/O operation (e.g., query
> to database)

That's certainly possible. The backtrace will surely reveal that.

> - maybe someone/something was resetting the network interface (the
> sockets were bound to previous address) -- e.g., it can be done by some
> upgrades of OS or dhcp

No, that definitely is not the case.

> - maybe some limits of OS were reached, the packets were filtered by
> kernel (if you have centos with selinux, be sure it is properly configured)

I am aware of CentOS's ridiculous default ulimits in CentOS 6.6, and all 
of these have been appropriately set to infinity. SELinux is disabled.

I'll let you know what I find. Thanks for the input!

-- Alex

-- 
Alex Balashov | Principal | Evariste Systems LLC
303 Perimeter Center North, Suite 300
Atlanta, GA 30346
United States

Tel: +1-800-250-5920 (toll-free) / +1-678-954-0671 (direct)
Web: http://www.evaristesys.com/, http://www.csrpswitch.com/



More information about the sr-users mailing list