Today we had an incident where SER (0.9.4) children drained all the CPUs of one of our servers.
Top Showed: PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND 17925 root 25 0 5644 5644 3888 R 25.5 0.2 6:26 1 ser 17929 root 25 0 5672 5672 3880 R 24.7 0.2 6:48 0 ser 17928 root 25 0 5688 5688 3872 R 24.3 0.2 6:25 1 ser 17933 root 25 0 4540 4540 3740 R 22.8 0.2 6:00 0 ser
And .. # ps -Al | grep ser 1 S 0 17901 1 0 85 0 - 14200 pause ? 00:00:00 ser 1 S 0 17916 17901 0 75 0 - 14200 pipe_w ? 00:00:00 ser 1 S 0 17917 17901 0 75 0 - 14418 schedu ? 00:00:22 ser 1 S 0 17918 17901 0 75 0 - 14422 schedu ? 00:00:23 ser 1 S 0 17919 17901 0 75 0 - 14423 schedu ? 00:00:24 ser 1 S 0 17920 17901 0 75 0 - 14447 schedu ? 00:00:22 ser 1 S 0 17921 17901 0 75 0 - 14421 schedu ? 00:00:22 ser 1 S 0 17922 17901 0 75 0 - 14424 schedu ? 00:00:22 ser 1 S 0 17923 17901 0 75 0 - 14428 schedu ? 00:00:21 ser 1 S 0 17924 17901 0 75 0 - 14424 schedu ? 00:00:22 ser 1 R 0 17925 17901 0 85 0 - 14448 - ? 00:06:22 ser 1 S 0 17926 17901 0 75 0 - 14457 schedu ? 00:00:49 ser 1 S 0 17927 17901 0 75 0 - 14453 schedu ? 00:00:50 ser 1 R 0 17928 17901 0 85 0 - 14477 - ? 00:06:20 ser 1 R 0 17929 17901 0 85 0 - 14455 - ? 00:06:44 ser 1 S 0 17930 17901 0 75 0 - 14452 schedu ? 00:00:50 ser 1 S 0 17931 17901 0 75 0 - 14448 schedu ? 00:00:50 ser 1 S 0 17932 17901 0 76 0 - 14448 schedu ? 00:00:49 ser 1 R 0 17933 17901 0 85 0 - 14235 - ? 00:05:55 ser
As you can see it looks like 4 children dropped out of the scheduler. The only thing suspicious is that RTPProxy became non-responsive around that time. At least thats the only thing the log shows: Nov 22 15:56:17 /usr/local/sbin/ser[17931]: ERROR: send_rtpp_command: timeout waiting reply from a RTP proxy
Any idea why these 4 children dropped out? Any hints on how to troubleshoot this?
Thanks,
Andres, I believe I remember this issue being brought up a while back. If I remember correctly, ser children locked up when communicating to a locked rtpproxy over socket interface. The "solution" was to use udp over loopback to communicate as this would fail that specific call, but not lock the ser process. g-)
----- Original Message ----- From: "Andres" andres@telesip.net To: serusers@lists.iptel.org Sent: Tuesday, November 22, 2005 10:19 PM Subject: [Serusers] SER Children Misbehaving
Today we had an incident where SER (0.9.4) children drained all the CPUs of one of our servers. Top Showed: PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND 17925 root 25 0 5644 5644 3888 R 25.5 0.2 6:26 1 ser 17929 root 25 0 5672 5672 3880 R 24.7 0.2 6:48 0 ser 17928 root 25 0 5688 5688 3872 R 24.3 0.2 6:25 1 ser 17933 root 25 0 4540 4540 3740 R 22.8 0.2 6:00 0 ser
And .. # ps -Al | grep ser 1 S 0 17901 1 0 85 0 - 14200 pause ? 00:00:00 ser 1 S 0 17916 17901 0 75 0 - 14200 pipe_w ? 00:00:00 ser 1 S 0 17917 17901 0 75 0 - 14418 schedu ? 00:00:22 ser 1 S 0 17918 17901 0 75 0 - 14422 schedu ? 00:00:23 ser 1 S 0 17919 17901 0 75 0 - 14423 schedu ? 00:00:24 ser 1 S 0 17920 17901 0 75 0 - 14447 schedu ? 00:00:22 ser 1 S 0 17921 17901 0 75 0 - 14421 schedu ? 00:00:22 ser 1 S 0 17922 17901 0 75 0 - 14424 schedu ? 00:00:22 ser 1 S 0 17923 17901 0 75 0 - 14428 schedu ? 00:00:21 ser 1 S 0 17924 17901 0 75 0 - 14424 schedu ? 00:00:22 ser 1 R 0 17925 17901 0 85 0 - 14448 - ? 00:06:22 ser 1 S 0 17926 17901 0 75 0 - 14457 schedu ? 00:00:49 ser 1 S 0 17927 17901 0 75 0 - 14453 schedu ? 00:00:50 ser 1 R 0 17928 17901 0 85 0 - 14477 - ? 00:06:20 ser 1 R 0 17929 17901 0 85 0 - 14455 - ? 00:06:44 ser 1 S 0 17930 17901 0 75 0 - 14452 schedu ? 00:00:50 ser 1 S 0 17931 17901 0 75 0 - 14448 schedu ? 00:00:50 ser 1 S 0 17932 17901 0 76 0 - 14448 schedu ? 00:00:49 ser 1 R 0 17933 17901 0 85 0 - 14235 - ? 00:05:55 ser
As you can see it looks like 4 children dropped out of the scheduler. The only thing suspicious is that RTPProxy became non-responsive around that time. At least thats the only thing the log shows: Nov 22 15:56:17 /usr/local/sbin/ser[17931]: ERROR: send_rtpp_command: timeout waiting reply from a RTP proxy
Any idea why these 4 children dropped out? Any hints on how to troubleshoot this?
Thanks,
-- Andres Network Admin http://www.telesip.net
Serusers mailing list serusers@lists.iptel.org http://lists.iptel.org/mailman/listinfo/serusers
Hi Greger,
For this same reason we switched to a UDP interface about 6 months ago. But as you can see we still managed to get 4 Children locked up. It still was able to do millions of transaction since the last lockup (2 months ago) but we are still clueless about why this is happening. We have rtpproxy running again in the foreground outputting all messages to a file. Hopefully this will provide enough info to get to the bottom of this.
Thanks,
Ah, of course you knew about that :-) Well, then I cannot really see why the SER children locked up. Do you have any SER logs right before locking up? You are sure it's related to a non-responding rtpproxy? I remember that Maxim was not able to reproduce the rtpproxy lockup, so that's one hard bug to nail down. Hopefully you will get something useful out of the debugging. g-) ----- Original Message ----- From: "Andres" andres@telesip.net To: "Greger V. Teigre" greger@teigre.com Cc: serusers@lists.iptel.org Sent: Wednesday, November 23, 2005 4:15 PM Subject: Re: [Serusers] SER Children Misbehaving
Hi Greger,
For this same reason we switched to a UDP interface about 6 months ago. But as you can see we still managed to get 4 Children locked up. It still was able to do millions of transaction since the last lockup (2 months ago) but we are still clueless about why this is happening. We have rtpproxy running again in the foreground outputting all messages to a file. Hopefully this will provide enough info to get to the bottom of this.
Thanks,
-- Andres Network Admin http://www.telesip.net
Greger V. Teigre wrote:
Andres, I believe I remember this issue being brought up a while back. If I remember correctly, ser children locked up when communicating to a locked rtpproxy over socket interface. The "solution" was to use udp over loopback to communicate as this would fail that specific call, but not lock the ser process. g-)
----- Original Message ----- From: "Andres" andres@telesip.net To: serusers@lists.iptel.org Sent: Tuesday, November 22, 2005 10:19 PM Subject: [Serusers] SER Children Misbehaving
Today we had an incident where SER (0.9.4) children drained all the CPUs of one of our servers. Top Showed: PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND 17925 root 25 0 5644 5644 3888 R 25.5 0.2 6:26 1 ser 17929 root 25 0 5672 5672 3880 R 24.7 0.2 6:48 0 ser 17928 root 25 0 5688 5688 3872 R 24.3 0.2 6:25 1 ser 17933 root 25 0 4540 4540 3740 R 22.8 0.2 6:00 0 ser
And .. # ps -Al | grep ser 1 S 0 17901 1 0 85 0 - 14200 pause ? 00:00:00 ser 1 S 0 17916 17901 0 75 0 - 14200 pipe_w ? 00:00:00 ser 1 S 0 17917 17901 0 75 0 - 14418 schedu ? 00:00:22 ser 1 S 0 17918 17901 0 75 0 - 14422 schedu ? 00:00:23 ser 1 S 0 17919 17901 0 75 0 - 14423 schedu ? 00:00:24 ser 1 S 0 17920 17901 0 75 0 - 14447 schedu ? 00:00:22 ser 1 S 0 17921 17901 0 75 0 - 14421 schedu ? 00:00:22 ser 1 S 0 17922 17901 0 75 0 - 14424 schedu ? 00:00:22 ser 1 S 0 17923 17901 0 75 0 - 14428 schedu ? 00:00:21 ser 1 S 0 17924 17901 0 75 0 - 14424 schedu ? 00:00:22 ser 1 R 0 17925 17901 0 85 0 - 14448 - ? 00:06:22 ser 1 S 0 17926 17901 0 75 0 - 14457 schedu ? 00:00:49 ser 1 S 0 17927 17901 0 75 0 - 14453 schedu ? 00:00:50 ser 1 R 0 17928 17901 0 85 0 - 14477 - ? 00:06:20 ser 1 R 0 17929 17901 0 85 0 - 14455 - ? 00:06:44 ser 1 S 0 17930 17901 0 75 0 - 14452 schedu ? 00:00:50 ser 1 S 0 17931 17901 0 75 0 - 14448 schedu ? 00:00:50 ser 1 S 0 17932 17901 0 76 0 - 14448 schedu ? 00:00:49 ser 1 R 0 17933 17901 0 85 0 - 14235 - ? 00:05:55 ser
As you can see it looks like 4 children dropped out of the scheduler. The only thing suspicious is that RTPProxy became non-responsive around that time. At least thats the only thing the log shows: Nov 22 15:56:17 /usr/local/sbin/ser[17931]: ERROR: send_rtpp_command: timeout waiting reply from a RTP proxy
Any idea why these 4 children dropped out? Any hints on how to troubleshoot this?
Thanks,
-- Andres Network Admin http://www.telesip.net
Serusers mailing list serusers@lists.iptel.org http://lists.iptel.org/mailman/listinfo/serusers
Greger V. Teigre wrote:
Ah, of course you knew about that :-) Well, then I cannot really see why the SER children locked up. Do you have any SER logs right before locking up?
I looked at them but there was really nothing. Only the errors indicating the RTPPROXY stopped responding.
You are sure it's related to a non-responding rtpproxy?
No, but its too much of a coincidence. We monitor the RTPPROXY from a Nagios script every 5 minutes so we can confirm externally that it was not responding around the same time as the 4 children locked up.
I remember that Maxim was not able to reproduce the rtpproxy lockup, so that's one hard bug to nail down. Hopefully you will get something useful out of the debugging.
Yea I know. We left RTPROXY running in the foreground (to gather info for Maxim) for like 2 months. Millions of calls and all was perfect. We restarted it normally about 3 weeks ago and we got this problem. I am starting to think its more stable running in the foreground.
Thanks,
Hm. Strange indeed. I cannot really see why SER should lock up without an error. It will be interesting to hear if you find anything. g-) ----- Original Message ----- From: "Andres" andres@telesip.net To: "Greger V. Teigre" greger@teigre.com Cc: serusers@lists.iptel.org Sent: Thursday, November 24, 2005 5:46 PM Subject: Re: [Serusers] SER Children Misbehaving
Greger V. Teigre wrote:
Ah, of course you knew about that :-) Well, then I cannot really see why the SER children locked up. Do you have any SER logs right before locking up?
I looked at them but there was really nothing. Only the errors indicating the RTPPROXY stopped responding.
You are sure it's related to a non-responding rtpproxy?
No, but its too much of a coincidence. We monitor the RTPPROXY from a Nagios script every 5 minutes so we can confirm externally that it was not responding around the same time as the 4 children locked up.
I remember that Maxim was not able to reproduce the rtpproxy lockup, so that's one hard bug to nail down. Hopefully you will get something useful out of the debugging.
Yea I know. We left RTPROXY running in the foreground (to gather info for Maxim) for like 2 months. Millions of calls and all was perfect. We restarted it normally about 3 weeks ago and we got this problem. I am starting to think its more stable running in the foreground.
Thanks,
-- Andres Network Admin http://www.telesip.net