[SR-Users] Best practices for troubleshooting deadlocks?
Alex Balashov
abalashov at evaristesys.com
Tue Sep 29 19:40:29 CEST 2015
Hi,
Thanks very much to you and Ovidiu for the responses. I didn't mean to
leave this thread hanging. See inline:
On 09/28/2015 05:51 PM, Daniel-Constantin Mierla wrote:
> Were you pulling the backtraces based on the script you pasted in your
> previous email? That should be good source of information to analyze if
> what kamailio was doing.
Yes, although as yet I have not been able to actually get the operator
to run a backtrace at the time of the deadlock. It's a psychological and
political problem: they are so eager to restore service that they do not
have the discipline to run my debug script, and jump straight to
restarting Kamailio.
However, the biggest problem that I see is that if the backtraces reveal
something interesting, it may invite follow-up, e.g. examination of
other frames and values. That would require a core dump. Dumping core
for all 8-12 child processes would take several minutes, as the shm pool
is quite large (4 GB). This is a very high-volume installation. The
operator would never go for that.
So, if I do get an intriguing backtrace, I don't really know what else
to do to elaborate.
> I already said, if the is a mutex deadlock, it will be also noticed by
> high cpu usage. Was it the case, or you don't have any access to cpu
> usage history?
I don't have CPU usage history, but I will try to get one next time this
happens.
> If it is just no more sip message routing, but no high cpu usage, then:
>
> - maybe processed were blocked in a lengthily I/O operation (e.g., query
> to database)
That's certainly possible. The backtrace will surely reveal that.
> - maybe someone/something was resetting the network interface (the
> sockets were bound to previous address) -- e.g., it can be done by some
> upgrades of OS or dhcp
No, that definitely is not the case.
> - maybe some limits of OS were reached, the packets were filtered by
> kernel (if you have centos with selinux, be sure it is properly configured)
I am aware of CentOS's ridiculous default ulimits in CentOS 6.6, and all
of these have been appropriately set to infinity. SELinux is disabled.
I'll let you know what I find. Thanks for the input!
-- Alex
--
Alex Balashov | Principal | Evariste Systems LLC
303 Perimeter Center North, Suite 300
Atlanta, GA 30346
United States
Tel: +1-800-250-5920 (toll-free) / +1-678-954-0671 (direct)
Web: http://www.evaristesys.com/, http://www.csrpswitch.com/
More information about the sr-users
mailing list