Today we had an incident where SER (0.9.4) children drained all the CPUs
of one of our servers.
Top Showed:
PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND
17925 root 25 0 5644 5644 3888 R 25.5 0.2 6:26 1 ser
17929 root 25 0 5672 5672 3880 R 24.7 0.2 6:48 0 ser
17928 root 25 0 5688 5688 3872 R 24.3 0.2 6:25 1 ser
17933 root 25 0 4540 4540 3740 R 22.8 0.2 6:00 0 ser
And ..
# ps -Al | grep ser
1 S 0 17901 1 0 85 0 - 14200 pause ? 00:00:00 ser
1 S 0 17916 17901 0 75 0 - 14200 pipe_w ? 00:00:00 ser
1 S 0 17917 17901 0 75 0 - 14418 schedu ? 00:00:22 ser
1 S 0 17918 17901 0 75 0 - 14422 schedu ? 00:00:23 ser
1 S 0 17919 17901 0 75 0 - 14423 schedu ? 00:00:24 ser
1 S 0 17920 17901 0 75 0 - 14447 schedu ? 00:00:22 ser
1 S 0 17921 17901 0 75 0 - 14421 schedu ? 00:00:22 ser
1 S 0 17922 17901 0 75 0 - 14424 schedu ? 00:00:22 ser
1 S 0 17923 17901 0 75 0 - 14428 schedu ? 00:00:21 ser
1 S 0 17924 17901 0 75 0 - 14424 schedu ? 00:00:22 ser
1 R 0 17925 17901 0 85 0 - 14448 - ? 00:06:22 ser
1 S 0 17926 17901 0 75 0 - 14457 schedu ? 00:00:49 ser
1 S 0 17927 17901 0 75 0 - 14453 schedu ? 00:00:50 ser
1 R 0 17928 17901 0 85 0 - 14477 - ? 00:06:20 ser
1 R 0 17929 17901 0 85 0 - 14455 - ? 00:06:44 ser
1 S 0 17930 17901 0 75 0 - 14452 schedu ? 00:00:50 ser
1 S 0 17931 17901 0 75 0 - 14448 schedu ? 00:00:50 ser
1 S 0 17932 17901 0 76 0 - 14448 schedu ? 00:00:49 ser
1 R 0 17933 17901 0 85 0 - 14235 - ? 00:05:55 ser
As you can see it looks like 4 children dropped out of the scheduler.
The only thing suspicious is that RTPProxy became non-responsive around
that time. At least thats the only thing the log shows:
Nov 22 15:56:17 /usr/local/sbin/ser[17931]: ERROR: send_rtpp_command:
timeout waiting reply from a RTP proxy
Any idea why these 4 children dropped out? Any hints on how to
troubleshoot this?
Thanks,
--
Andres
Network Admin
http://www.telesip.net