A backtrace should provide enough information to know where to look
for issues and that should not take a long time.
Maybe you can use monit to monitor the cpu and on failure run 'kamctl
trap' to get the backtrace.
if cpu is greater than 50% for 5 cycles then exec "/usr/sbin/kamctl trap"
Make sure that you have the debug rpm installed.
-ovidiu
On Tue, Sep 29, 2015 at 1:40 PM, Alex Balashov
<abalashov(a)evaristesys.com> wrote:
Hi,
Thanks very much to you and Ovidiu for the responses. I didn't mean to leave
this thread hanging. See inline:
On 09/28/2015 05:51 PM, Daniel-Constantin Mierla wrote:
Were you pulling the backtraces based on the
script you pasted in your
previous email? That should be good source of information to analyze if
what kamailio was doing.
Yes, although as yet I have not been able to actually get the operator to
run a backtrace at the time of the deadlock. It's a psychological and
political problem: they are so eager to restore service that they do not
have the discipline to run my debug script, and jump straight to restarting
Kamailio.
However, the biggest problem that I see is that if the backtraces reveal
something interesting, it may invite follow-up, e.g. examination of other
frames and values. That would require a core dump. Dumping core for all 8-12
child processes would take several minutes, as the shm pool is quite large
(4 GB). This is a very high-volume installation. The operator would never go
for that.
So, if I do get an intriguing backtrace, I don't really know what else to do
to elaborate.
I already said, if the is a mutex deadlock, it
will be also noticed by
high cpu usage. Was it the case, or you don't have any access to cpu
usage history?
I don't have CPU usage history, but I will try to get one next time this
happens.
If it is just no more sip message routing, but no
high cpu usage, then:
- maybe processed were blocked in a lengthily I/O operation (e.g., query
to database)
That's certainly possible. The backtrace will surely reveal that.
- maybe someone/something was resetting the
network interface (the
sockets were bound to previous address) -- e.g., it can be done by some
upgrades of OS or dhcp
No, that definitely is not the case.
- maybe some limits of OS were reached, the
packets were filtered by
kernel (if you have centos with selinux, be sure it is properly
configured)
I am aware of CentOS's ridiculous default ulimits in CentOS 6.6, and all of
these have been appropriately set to infinity. SELinux is disabled.
I'll let you know what I find. Thanks for the input!
-- Alex
--
Alex Balashov | Principal | Evariste Systems LLC
303 Perimeter Center North, Suite 300
Atlanta, GA 30346
United States
Tel: +1-800-250-5920 (toll-free) / +1-678-954-0671 (direct)
Web:
http://www.evaristesys.com/,
http://www.csrpswitch.com/
_______________________________________________
SIP Express Router (SER) and Kamailio (OpenSER) - sr-users mailing list
sr-users(a)lists.sip-router.org
http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-users
--
VoIP Embedded, Inc.