I'm using the latest rtpengine master and still getting crashes once every 1-2 days on a high call volume installation:
(gdb) where #0 0x000000304e232925 in raise () from /lib64/libc.so.6 #1 0x000000304e234105 in abort () from /lib64/libc.so.6 #2 0x000000304e22ba4e in __assert_fail_base () from /lib64/libc.so.6 #3 0x000000304e22bb10 in __assert_fail () from /lib64/libc.so.6 #4 0x0000000000412d50 in stream_packet (fd=<value optimized out>, p= 0x7f1dcc1b1030, u=<value optimized out>) at call.c:547 #5 stream_fd_readable (fd=<value optimized out>, p=0x7f1dcc1b1030, u=<value optimized out>) at call.c:819 #6 0x000000000040b4ce in poller_poll (p=0x10d3750, timeout=<value optimized out>) at poller.c:354 #7 0x000000000040722d in poller_loop (d=0x10d3750) at main.c:542 #8 0x000000000040bb5f in thread_detach_func (d=<value optimized out>) at aux.c:160 #9 0x000000304e6079d1 in start_thread () from /lib64/libpthread.so.0 #10 0x000000304e2e8b6d in clone () from /lib64/libc.so.6
Unfortunately, this packet capture is not very insightful. What's the easiest way to build rtpengine with debug symbols so that some of these values can be filled in?
On 06/27/14 14:35, Alex Balashov wrote:
I'm using the latest rtpengine master and still getting crashes once every 1-2 days on a high call volume installation:
(gdb) where #0 0x000000304e232925 in raise () from /lib64/libc.so.6 #1 0x000000304e234105 in abort () from /lib64/libc.so.6 #2 0x000000304e22ba4e in __assert_fail_base () from /lib64/libc.so.6 #3 0x000000304e22bb10 in __assert_fail () from /lib64/libc.so.6 #4 0x0000000000412d50 in stream_packet (fd=<value optimized out>, p= 0x7f1dcc1b1030, u=<value optimized out>) at call.c:547 #5 stream_fd_readable (fd=<value optimized out>, p=0x7f1dcc1b1030, u=<value optimized out>) at call.c:819 #6 0x000000000040b4ce in poller_poll (p=0x10d3750, timeout=<value optimized out>) at poller.c:354 #7 0x000000000040722d in poller_loop (d=0x10d3750) at main.c:542 #8 0x000000000040bb5f in thread_detach_func (d=<value optimized out>) at aux.c:160 #9 0x000000304e6079d1 in start_thread () from /lib64/libpthread.so.0 #10 0x000000304e2e8b6d in clone () from /lib64/libc.so.6
Unfortunately, this packet capture is not very insightful. What's the easiest way to build rtpengine with debug symbols so that some of these values can be filled in?
The easiest way would be to build it with "make debug", this should make everything visible.
But this backtrace is quite helpful already, as it's a deliberate assertion that triggers. If you're lucky, you can find the call-id through the "sfd" struct ("print sfd->call->callid.s"), and then check the logs to see what was going on. If that doesn't work, you can send me both the binary and the core, I may be able to get some details out of it.
cheers
Hi Richard,
Yes indeed, I am able to pull sfd->call->callid.s in frame 4.
The assert is, of course, stream != NULL:
547 assert(stream != NULL);
And that's where it blows up:
(gdb) print stream $6 = (struct packet_stream *) 0x0
I seem to have some "interesting" calls going through this system, such that I can deliver you a "monologue == NULL" crash bug one week and a stream-NULL abrt/exit the next. :-)
I looked up that Call-ID in the logs and saw this kind of call flow:
Jun 27 17:52:59 rtphost /usr/local/sbin/kamailio[30745]: INFO: [R-OUTBOUND-VENDOR-TRY:33DEF60F-53ADAF79000E9F75-1C569700] -> Channel utilisation for this gateway: 112/200 Jun 27 17:52:59 rtphost /usr/local/sbin/kamailio[30745]: INFO: [R-OUTBOUND-VENDOR-ROUTE-TRY:33DEF60F-53ADAF79000E9F75-1C569700] Sending media through rtpproxy Jun 27 17:52:59 rtphost /usr/local/sbin/kamailio[30745]: INFO: [R-OUTBOUND-VENDOR-ROUTE-TRY:33DEF60F-53ADAF79000E9F75-1C569700] -> Final request URI: sip:14045372369@18.54.79.82:5060;user=phone Jun 27 17:52:59 rtphost /usr/local/sbin/kamailio[30745]: INFO: [R-OUTBOUND-VENDOR-ROUTE-TRY-BRANCH:33DEF60F-53ADAF79000E9F75-1C569700] -> Sending P-Asserted-Identity Jun 27 17:52:59 rtphost /usr/local/sbin/kamailio[30745]: INFO: [R-OUTBOUND-VENDOR-ROUTE-TRY-BRANCH:33DEF60F-53ADAF79000E9F75-1C569700] -> Sending NPDI parameters Jun 27 17:53:01 rtphost /usr/local/sbin/kamailio[30745]: INFO: [R-MAIN:33DEF60F-53ADAF79000E9F75-1C569700] CANCEL received from 104.251.251.8 (RURI=sip:14045372369;npdi=yes;rn=14047890081@205.97.81.46:5060;user=phone) Jun 27 17:53:02 rtphost /usr/local/sbin/kamailio[30749]: INFO: [R-OUTBOUND-VENDOR-ROUTE-ADVANCE:33DEF60F-53ADAF79000E9F75-1C569700] Transaction cancelled
I have another crash with this exact same assertion failure. I'll try to look up its Call-ID (if I still have the logs) and see if it follows a similar pattern.
On 06/27/14 17:26, Alex Balashov wrote:
Hi Richard,
Yes indeed, I am able to pull sfd->call->callid.s in frame 4.
The assert is, of course, stream != NULL:
547 assert(stream != NULL);
And that's where it blows up:
(gdb) print stream $6 = (struct packet_stream *) 0x0
I seem to have some "interesting" calls going through this system, such that I can deliver you a "monologue == NULL" crash bug one week and a stream-NULL abrt/exit the next. :-)
I looked up that Call-ID in the logs and saw this kind of call flow:
...
Also look up the call-id in the rtpengine logs please. The full log (esp including the SDP) might allow me to reproduce this.
cheers
On 06/27/2014 05:28 PM, Richard Fuchs wrote:
On 06/27/14 17:26, Alex Balashov wrote:
Hi Richard,
Yes indeed, I am able to pull sfd->call->callid.s in frame 4.
The assert is, of course, stream != NULL:
547 assert(stream != NULL);
And that's where it blows up:
(gdb) print stream $6 = (struct packet_stream *) 0x0
I seem to have some "interesting" calls going through this system, such that I can deliver you a "monologue == NULL" crash bug one week and a stream-NULL abrt/exit the next. :-)
I looked up that Call-ID in the logs and saw this kind of call flow:
...
Also look up the call-id in the rtpengine logs please. The full log (esp including the SDP) might allow me to reproduce this.
I will. I'm just afraid they may have been rotated out or that the entry for this was rate-limited out. We forgot to disable rate-limiting on rsyslogd, and this is a very high call pace installation.
On 27/06/14 05:29 PM, Alex Balashov wrote:
I will. I'm just afraid they may have been rotated out or that the entry for this was rate-limited out. We forgot to disable rate-limiting on rsyslogd, and this is a very high call pace installation.
After carefully considering the code again, I believe the assertion trigger may be a rare but valid case after all. I have thus removed the assertion and the code is now simply ignoring this case. I expect that fixes your problem.
https://github.com/sipwise/rtpengine/commit/2e9ae75fb0eeddaa877f27c9c4ba98a2...
cheers
On 06/30/2014 09:48 AM, Richard Fuchs wrote:
After carefully considering the code again, I believe the assertion trigger may be a rare but valid case after all. I have thus removed the assertion and the code is now simply ignoring this case. I expect that fixes your problem.
https://github.com/sipwise/rtpengine/commit/2e9ae75fb0eeddaa877f27c9c4ba98a2...
Thank you, Richard. Can you elaborate as to what this case might be?
If I get another crash in the next few days (seems quite likely), I will send you log output.
On 30/06/14 12:04 PM, Alex Balashov wrote:
On 06/30/2014 09:48 AM, Richard Fuchs wrote:
After carefully considering the code again, I believe the assertion trigger may be a rare but valid case after all. I have thus removed the assertion and the code is now simply ignoring this case. I expect that fixes your problem.
https://github.com/sipwise/rtpengine/commit/2e9ae75fb0eeddaa877f27c9c4ba98a2...
Thank you, Richard. Can you elaborate as to what this case might be?
It's about how many ports are used for a particular media stream, normally it's two, but it can be different. If there's a disagreement about this between the two sides, then I believe some of the pointers could be left NULLed. I haven't actually tested this though.
There's also the issue of a race condition due to incorrect locking. I've just pushed another commit to make this more robust. Either case requires reception of a stray UDP packet on a local port which is either unused or not yet used.
cheers
Hi Richard, Alex,
On Tue, Jul 1, 2014 at 6:58 AM, Richard Fuchs rfuchs@sipwise.com wrote:
On 30/06/14 12:04 PM, Alex Balashov wrote:
On 06/30/2014 09:48 AM, Richard Fuchs wrote:
After carefully considering the code again, I believe the assertion
trigger may be a rare but valid case after all. I have thus removed the assertion and the code is now simply ignoring this case. I expect that fixes your problem.
https://github.com/sipwise/rtpengine/commit/ 2e9ae75fb0eeddaa877f27c9c4ba98a2c64c2d15
Thank you, Richard. Can you elaborate as to what this case might be?
It's about how many ports are used for a particular media stream, normally it's two, but it can be different. If there's a disagreement about this between the two sides, then I believe some of the pointers could be left NULLed. I haven't actually tested this though.
I'm curious about it. Is there a way of simulating this? Under what conditions should this happen?
Regards, Carlos
There's also the issue of a race condition due to incorrect locking. I've just pushed another commit to make this more robust. Either case requires reception of a stray UDP packet on a local port which is either unused or not yet used.
cheers
sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev
On 01/07/14 12:59 PM, Carlos Ruiz Díaz wrote:
Hi Richard, Alex,
I'm curious about it. Is there a way of simulating this? Under what conditions should this happen?
The first case described (differing number of ports) I'm not 100% sure about. I'm thinking the little used slash notation in the SDP m= line might be to blame... one side using it (say, 1234/2 to advertise 4 ports being used), the other side not using it. You'd end up with 4 ports open for each side, but some of them unused. And then UDP packets coming in on those ports.
The race condition is harder to test. You'd have to predict which ports will be opened when an "offer" is received, and have these ports hammered with UDP packets ahead of time, while the "offer" is being processed.
cheers
Thank you Richard.
Maybe Alex can tell us what amount of stress were put on rtpengine to make it crash.
Regards, Carlos
On Tue, Jul 1, 2014 at 1:57 PM, Richard Fuchs rfuchs@sipwise.com wrote:
On 01/07/14 12:59 PM, Carlos Ruiz Díaz wrote:
Hi Richard, Alex,
I'm curious about it. Is there a way of simulating this? Under what
conditions should this happen?
The first case described (differing number of ports) I'm not 100% sure about. I'm thinking the little used slash notation in the SDP m= line might be to blame... one side using it (say, 1234/2 to advertise 4 ports being used), the other side not using it. You'd end up with 4 ports open for each side, but some of them unused. And then UDP packets coming in on those ports.
The race condition is harder to test. You'd have to predict which ports will be opened when an "offer" is received, and have these ports hammered with UDP packets ahead of time, while the "offer" is being processed.
cheers
sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev
Related:
Every once in a while, rtpengine will go unresponsive despite continuing to run. Here, for instance, ping commands from Kamailio can be observed, but are going unreplied:
17:00:08.498680 IP 209.75.186.86.55305 > 209.75.186.78.mmcc: UDP, length 29 E..9u.@.@.$RBQ.VBQ.N. ...%.u22440_43727 d7:command4:pinge 17:00:08.525165 IP 209.75.186.86.40538 > 209.75.186.78.mmcc: UDP, length 29 E..9u.@.@.$SBQ.VBQ.N.Z...%."22439_43262 d7:command4:pinge 17:00:08.534777 IP 209.75.186.86.44189 > 209.75.186.78.mmcc: UDP, length 29 E..9u.@.@.$ZBQ.VBQ.N.....%..22435_43463 d7:command4:pinge 17:00:08.568766 IP 209.75.186.86.44141 > 209.75.186.78.mmcc: UDP, length 29 E..9u.@.@.$XBQ.VBQ.N.m...%..22434_43510 d7:command4:pinge 17:00:08.577805 IP 209.75.186.86.60187 > 209.75.186.78.mmcc: UDP, length 29 E..9u.@.@.$OBQ.VBQ.N.....%.d22437_43626 d7:command4:pinge 17:00:08.599840 IP 209.75.186.86.52326 > 209.75.186.78.mmcc: UDP, length 29 E..9u.@.@.$TBQ.VBQ.N.f...%..22438_43914 d7:command4:pinge 17:00:08.614967 IP 209.75.186.86.55637 > 209.75.186.78.mmcc: UDP, length 29 E..9u.@.@.$RBQ.VBQ.N.U...%.'22441_43745 d7:command4:pinge 17:00:09.497714 IP 209.75.186.86.49158 > 209.75.186.78.mmcc: UDP, length 29 E..9u.@.@.$XBQ.VBQ.N.....%.x22436_43834 d7:command4:pinge 17:00:09.499864 IP 209.75.186.86.55305 > 209.75.186.78.mmcc: UDP, length 29 E..9u.@.@.$QBQ.VBQ.N. ...%.u22440_43727 d7:command4:pinge 17:00:09.526350 IP 209.75.186.86.40538 > 209.75.186.78.mmcc: UDP, length 29 E..9u.@.@.$RBQ.VBQ.N.Z...%."22439_43262 d7:command4:pinge 17:00:09.535925 IP 209.75.186.86.44189 > 209.75.186.78.mmcc: UDP, length 29 E..9u.@.@.$YBQ.VBQ.N.....%..22435_43463 d7:command4:pinge 17:00:09.569949 IP 209.75.186.86.44141 > 209.75.186.78.mmcc: UDP, length 29 E..9u.@.@.$WBQ.VBQ.N.m...%..22434_43510 d7:command4:pinge 17:00:09.578973 IP 209.75.186.86.60187 > 209.75.186.78.mmcc: UDP, length 29 E..9u @.@.$NBQ.VBQ.N.....%.d22437_43626 d7:command4:pinge 17:00:09.601007 IP 209.75.186.86.52326 > 209.75.186.78.mmcc: UDP, length 29 E..9u.@.@.$SBQ.VBQ.N.f...%..22438_43914 d7:command4:pinge 17:00:09.616110 IP 209.75.186.86.55637 > 209.75.186.78.mmcc: UDP, length 29 E..9u.@.@.$QBQ.VBQ.N.U...%.'22441_43745 d7:command4:pinge
There is no output in syslogs.
Other causes (e.g. firewall) have been ruled out.
On 07/02/14 13:01, Alex Balashov wrote:
Related:
Every once in a while, rtpengine will go unresponsive despite continuing to run. Here, for instance, ping commands from Kamailio can be observed, but are going unreplied:
Probably a deadlock. Either jump into the running process with gdb, or send a core-dumping signal and then load the core in gdb. Execute "thread apply all bt" to see where it's hung up on.
cheers
Interestingly enough, the polling loop of rtpengine seems to be intact:
[root@sd-rtp01 log]# strace -p 30323 Process 30323 attached - interrupt to quit restart_syscall(<... resuming interrupted call ...>) = 0 nanosleep({0, 100000000}, NULL) = 0 nanosleep({0, 100000000}, NULL) = 0 nanosleep({0, 100000000}, NULL) = 0 nanosleep({0, 100000000}, NULL) = 0 nanosleep({0, 100000000}, NULL) = 0 nanosleep({0, 100000000}, NULL) = 0 nanosleep({0, 100000000}, NULL) = 0 nanosleep({0, 100000000}, NULL) = 0 nanosleep({0, 100000000}, NULL) = 0 nanosleep({0, 100000000}, NULL) = 0 nanosleep({0, 100000000}, NULL) = 0 nanosleep({0, 100000000}, ^C <unfinished ...> Process 30323 detached
So, I'm not sure why it wouldn't be processing the ping messages. It must be a programmatic issue. Maybe? When I ran this strace and waited for one of these:
Jul 2 17:04:14 proxyhost /usr/local/sbin/kamailio[22442]: ERROR: rtpproxy-ng [rtpproxy.c:1549]: send_rtpp_command(): proxy udp:209.75.186.78:5050 does not respond, disable it
I saw nothing but this:
nanosleep({0, 100000000}, NULL) = 0 nanosleep({0, 100000000}, NULL) = 0 nanosleep({0, 100000000}, NULL) = 0 nanosleep({0, 100000000}, NULL) = 0 nanosleep({0, 100000000}, NULL) = 0 nanosleep({0, 100000000}, NULL) = 0 nanosleep({0, 100000000}, NULL) = 0 nanosleep({0, 100000000}, NULL) = 0 nanosleep({0, 100000000}, NULL) = 0
...
Re: the current thread state with attached GDB:
(gdb) thread apply all bt
Thread 7 (Thread 0x7f874f280700 (LWP 30324)): #0 0x000000304e233a1d in sigtimedwait () from /lib64/libc.so.6 #1 0x00000000004072f1 in sighandler (x=<value optimized out>) at main.c:114 #2 0x000000000040bb5f in thread_detach_func (d=<value optimized out>) at aux.c:160 #3 0x000000304e6079d1 in start_thread () from /lib64/libpthread.so.0 #4 0x000000304e2e8b6d in clone () from /lib64/libc.so.6
Thread 6 (Thread 0x7f874e87f700 (LWP 30325)): #0 0x000000304e2accdd in nanosleep () from /lib64/libc.so.6 #1 0x000000304e2e1e54 in usleep () from /lib64/libc.so.6 #2 0x000000000040af82 in poller_timers_wait_run (p=0x269a740, max=100000) at poller.c:500 #3 0x000000000040725d in timer_loop (d=0x269a740) at main.c:535 #4 0x000000000040bb5f in thread_detach_func (d=<value optimized out>) at aux.c:160 #5 0x000000304e6079d1 in start_thread () from /lib64/libpthread.so.0 #6 0x000000304e2e8b6d in clone () from /lib64/libc.so.6
Thread 5 (Thread 0x7f874de7e700 (LWP 30326)): #0 0x000000304e60b5bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 ---Type <return> to continue, or q <return> to quit--- #1 0x0000000000415240 in cookie_cache_lookup (c=0x269e050, s=0x7f874de6d650) at cookie_cache.c:57 #2 0x0000000000415b20 in control_ng_incoming (obj=0x269e030, buf=<value optimized out>, sin=0x7f874de7d700, addr= 0x7f874de7d6c0 "209.75.186.86:39889") at control_ng.c:86 #3 0x000000000041573e in udp_listener_incoming (fd=6, p=0x269d670, x=<value optimized out>) at udp_listener.c:52 #4 0x000000000040b4ce in poller_poll (p=0x269a740, timeout=<value optimized out>) at poller.c:354 #5 0x000000000040722d in poller_loop (d=0x269a740) at main.c:542 #6 0x000000000040bb5f in thread_detach_func (d=<value optimized out>) at aux.c:160 #7 0x000000304e6079d1 in start_thread () from /lib64/libpthread.so.0 #8 0x000000304e2e8b6d in clone () from /lib64/libc.so.6
Thread 4 (Thread 0x7f874d47d700 (LWP 30327)): #0 0x000000304e60ac50 in pthread_rwlock_wrlock () from /lib64/libpthread.so.0 #1 0x000000000040f29f in call_destroy (c=0x7f87400b25c0) at call.c:1981 #2 0x0000000000411c16 in call_delete_branch (m=<value optimized out>, callid=<value optimized out>, branch=<value optimized out>, fromtag=<value optimized out>, totag=0x7f874d46c540, output= 0x7f8740062c28) at call.c:2486 #3 0x000000000041f901 in call_delete_ng (input=<value optimized out>, m= ---Type <return> to continue, or q <return> to quit--- 0x269b400, output=0x7f8740062c28) at call_interfaces.c:649 #4 0x0000000000415fc8 in control_ng_incoming (obj=0x269e030, buf=<value optimized out>, sin=0x7f874d47c700, addr= 0x7f874d47c6c0 "209.75.186.86:41930") at control_ng.c:117 #5 0x000000000041573e in udp_listener_incoming (fd=6, p=0x269d670, x=<value optimized out>) at udp_listener.c:52 #6 0x000000000040b4ce in poller_poll (p=0x269a740, timeout=<value optimized out>) at poller.c:354 #7 0x000000000040722d in poller_loop (d=0x269a740) at main.c:542 #8 0x000000000040bb5f in thread_detach_func (d=<value optimized out>) at aux.c:160 #9 0x000000304e6079d1 in start_thread () from /lib64/libpthread.so.0 #10 0x000000304e2e8b6d in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7f874ca7c700 (LWP 30328)): #0 0x000000304e60ac50 in pthread_rwlock_wrlock () from /lib64/libpthread.so.0 #1 0x000000000040f29f in call_destroy (c=0x7f8740a2c120) at call.c:1981 #2 0x0000000000411c16 in call_delete_branch (m=<value optimized out>, callid=<value optimized out>, branch=<value optimized out>, fromtag=<value optimized out>, totag=0x7f874ca6b540, output= 0x7f8728222828) at call.c:2486 #3 0x000000000041f901 in call_delete_ng (input=<value optimized out>, m= 0x269b400, output=0x7f8728222828) at call_interfaces.c:649 ---Type <return> to continue, or q <return> to quit--- #4 0x0000000000415fc8 in control_ng_incoming (obj=0x269e030, buf=<value optimized out>, sin=0x7f874ca7b700, addr= 0x7f874ca7b6c0 "209.75.186.86:39889") at control_ng.c:117 #5 0x000000000041573e in udp_listener_incoming (fd=6, p=0x269d670, x=<value optimized out>) at udp_listener.c:52 #6 0x000000000040b4ce in poller_poll (p=0x269a740, timeout=<value optimized out>) at poller.c:354 #7 0x000000000040722d in poller_loop (d=0x269a740) at main.c:542 #8 0x000000000040bb5f in thread_detach_func (d=<value optimized out>) at aux.c:160 #9 0x000000304e6079d1 in start_thread () from /lib64/libpthread.so.0 #10 0x000000304e2e8b6d in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7f8747fff700 (LWP 30329)): #0 0x000000304e60ac50 in pthread_rwlock_wrlock () from /lib64/libpthread.so.0 #1 0x000000000040f29f in call_destroy (c=0x7f872c22a1d0) at call.c:1981 #2 0x0000000000411c16 in call_delete_branch (m=<value optimized out>, callid=<value optimized out>, branch=<value optimized out>, fromtag=<value optimized out>, totag=0x7f8747fee540, output= 0x7f872c2d4828) at call.c:2486 #3 0x000000000041f901 in call_delete_ng (input=<value optimized out>, m= 0x269b400, output=0x7f872c2d4828) at call_interfaces.c:649 #4 0x0000000000415fc8 in control_ng_incoming (obj=0x269e030, ---Type <return> to continue, or q <return> to quit--- buf=<value optimized out>, sin=0x7f8747ffe700, addr= 0x7f8747ffe6c0 "209.75.186.86:36511") at control_ng.c:117 #5 0x000000000041573e in udp_listener_incoming (fd=6, p=0x269d670, x=<value optimized out>) at udp_listener.c:52 #6 0x000000000040b4ce in poller_poll (p=0x269a740, timeout=<value optimized out>) at poller.c:354 #7 0x000000000040722d in poller_loop (d=0x269a740) at main.c:542 #8 0x000000000040bb5f in thread_detach_func (d=<value optimized out>) at aux.c:160 #9 0x000000304e6079d1 in start_thread () from /lib64/libpthread.so.0 #10 0x000000304e2e8b6d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f874f2817c0 (LWP 30323)): #0 0x000000304e2accdd in nanosleep () from /lib64/libc.so.6 #1 0x000000304e2e1e54 in usleep () from /lib64/libc.so.6 #2 0x000000000040a98a in main (argc=1, argv=0x7fffc99076d8) at main.c:562
-- Alex
I need to restart this process to get it back into production, but I'll get it to dump the core in case you have any other questions.
On 07/02/14 13:07, Alex Balashov wrote:
Interestingly enough, the polling loop of rtpengine seems to be intact:
[root@sd-rtp01 log]# strace -p 30323 Process 30323 attached - interrupt to quit restart_syscall(<... resuming interrupted call ...>) = 0 nanosleep({0, 100000000}, NULL) = 0 nanosleep({0, 100000000}, NULL) = 0 nanosleep({0, 100000000}, NULL) = 0
Strace only hooks into the main thread of a process, others don't show up. If you use "ps -efL" to show the process list, you get to see all threads of a process including their LWP IDs. Use those IDs to see what's going on in them with strace (you can specify multiple at the same time, strace -p $X -p $Y -p $Z ...)
Re: the current thread state with attached GDB:
Would you happen to know the git commit ID of your build, so I can correlate the line numbers.
cheers
On 07/02/2014 01:15 PM, Richard Fuchs wrote:
On 07/02/14 13:07, Alex Balashov wrote:
Interestingly enough, the polling loop of rtpengine seems to be intact:
[root@sd-rtp01 log]# strace -p 30323 Process 30323 attached - interrupt to quit restart_syscall(<... resuming interrupted call ...>) = 0 nanosleep({0, 100000000}, NULL) = 0 nanosleep({0, 100000000}, NULL) = 0 nanosleep({0, 100000000}, NULL) = 0
Strace only hooks into the main thread of a process, others don't show up. If you use "ps -efL" to show the process list, you get to see all threads of a process including their LWP IDs. Use those IDs to see what's going on in them with strace (you can specify multiple at the same time, strace -p $X -p $Y -p $Z ...)
Ah, I was wondering if 'ps' could show LWPs. Thanks!
Would you happen to know the git commit ID of your build, so I can correlate the line numbers.
Sure, it's master:b0c0e396d62c4a04d1600fc5eb8a3462a0e6cefe.
On 07/02/14 13:17, Alex Balashov wrote:
On 07/02/2014 01:15 PM, Richard Fuchs wrote:
On 07/02/14 13:07, Alex Balashov wrote:
Interestingly enough, the polling loop of rtpengine seems to be intact:
[root@sd-rtp01 log]# strace -p 30323 Process 30323 attached - interrupt to quit restart_syscall(<... resuming interrupted call ...>) = 0 nanosleep({0, 100000000}, NULL) = 0 nanosleep({0, 100000000}, NULL) = 0 nanosleep({0, 100000000}, NULL) = 0
Strace only hooks into the main thread of a process, others don't show up. If you use "ps -efL" to show the process list, you get to see all threads of a process including their LWP IDs. Use those IDs to see what's going on in them with strace (you can specify multiple at the same time, strace -p $X -p $Y -p $Z ...)
Ah, I was wondering if 'ps' could show LWPs. Thanks!
Would you happen to know the git commit ID of your build, so I can correlate the line numbers.
Sure, it's master:b0c0e396d62c4a04d1600fc5eb8a3462a0e6cefe.
Found one instance of a missing unlock. Do you perhaps see the error message "Invalid dialogue association" in your rtpengine logs?
cheers
On 07/02/2014 01:36 PM, Richard Fuchs wrote:
Found one instance of a missing unlock. Do you perhaps see the error message "Invalid dialogue association" in your rtpengine logs?
Yes, indeed, from yesterday, which is when it is thought to have stopped functioning in actuality:
Jul 1 16:22:38 rtphost rtpengine[30323]: [03C7B67A-53B2E03D0002BE22-091CD700] Protocol error in packet from 209.75.186.86:36639: Invalid dialogue association [d3:sdp355:v=0#015#012o=Sonus_UAC 15054 20579 IN IP4 97.41.84.17.148#015#012s=SIP Media Capabilities#015#012c=IN IP4 97.41.84.17.148#015#012t=0 0#015#012m=audio 10594 RTP/AVP 0#015#012a=direction: passive#015#012a=rtpmap:0 PCMU/8000#015#012a=sendrecv#015#012a=ptime:20#015#012m=audio 0 RTP/AVP 0 8 101#015#012a=direction: both#015#012a=rtpmap:0 PCMU/8000#015#012a=rtpmap:8 PCMA/8000#015#012a=rtpmap:101 telephone-event/8000#015#012a=sendrecv#015#012a=maxptime:20#015#0125:flagsl13:trust-addresse7:replacel6:origine7:call-id34:03C7B67A-53B2E03D0002BE22-091CD70013:received-froml3:IP414:97.41.84.17.148e8:from-tag34:76D0517D-53B2E03D0002BED6-091CD7006:to-tag20:SDhivi899-gK0a90b54e7:command6:answere] Jul 1 16:22:38 rtphost rtpengine[30323]: [03C7B67A-53B2E03D0002BE22-091CD700] Returning to SIP proxy: d6:result5:error12:error-reason28:Invalid dialogue associatione Jul 1 16:53:51 rtphost rtpengine[30323]: [644986138-0-1622069164@110.4.8.1] Protocol error in packet from 209.75.186.86:34626: Invalid dialogue association [d3:sdp230:v=0#015#012o=Sansay-VSXi 188 1 IN IP4 154.50.206.12#015#012s=Session Controller#015#012c=IN IP4 154.50.206.15#015#012t=0 0#015#012m=audio 25376 RTP/AVP 0 101#015#012a=rtpmap:0 PCMU/8000#015#012a=rtpmap:101 telephone-event/8000#015#012a=fmtp:101 0-15#015#012a=sendrecv#015#012a=maxptime:20#015#0125:flagsl13:trust-addresse7:replacel6:origine7:call-id36:644986138-0-1622069164@110.4.8.113:received-froml3:IP414:107.23.121.178e8:from-tag24:sansay2472712924rdb840496:to-tag23:sansay1041790983rdb43617:command6:answere] Jul 1 16:53:51 rtphost rtpengine[30323]: [644986138-0-1622069164@110.4.8.1] Returning to SIP proxy: d6:result5:error12:error-reason28:Invalid dialogue associatione Jul 1 17:28:21 rtphost rtpengine[30323]: [68A5EB10-53B2EFB30005599A-BE083700] Protocol error in packet from 209.75.186.86:36511: Invalid dialogue association [d3:sdp354:v=0#015#012o=Sonus_UAC 28564 27914 IN IP4 97.41.84.17.148#015#012s=SIP Media Capabilities#015#012c=IN IP4 97.41.84.17.148#015#012t=0 0#015#012m=audio 10034 RTP/AVP 0#015#012a=direction: passive#015#012a=rtpmap:0 PCMU/8000#015#012a=sendrecv#015#012a=ptime:20#015#012m=audio 0 RTP/AVP 0 18 101#015#012a=direction: both#015#012a=rtpmap:0 PCMU/8000#015#012a=rtpmap:18 G729/8000#015#012a=rtpmap:101 telephone-event/8000#015#012a=sendrecv#015#012a=ptime:20#015#0125:flagsl13:trust-addresse7:replacel6:origine7:call-id34:68A5EB10-53B2EFB30005599A-BE08370013:received-froml3:IP414:97.41.84.17.148e8:from-tag34:3438543C-53B2EFB300055A36-BE0837006:to-tag20:SD2sl6999-gK07ba92697:command6:answere] Jul 1 17:28:21 rtphost rtpengine[30323]: [68A5EB10-53B2EFB30005599A-BE083700] Returning to SIP proxy: d6:result5:error12:error-reason28:Invalid dialogue associatione Jul 1 19:06:11 rtphost rtpengine[30323]: [5F059A5C-53B306A00009A3D6-5BFDE700] Protocol error in packet from 209.75.186.86:36511: Invalid dialogue association [d3:sdp328:v=0#015#012o=Sonus_UAC 30494 32324 IN IP4 97.41.84.17.148#015#012s=SIP Media Capabilities#015#012c=IN IP4 97.41.84.17.148#015#012t=0 0#015#012m=audio 11446 RTP/AVP 0#015#012a=direction: passive#015#012a=rtpmap:0 PCMU/8000#015#012a=sendrecv#015#012a=ptime:20#015#012m=audio 0 RTP/AVP 0 101#015#012a=direction: both#015#012a=rtpmap:0 PCMU/8000#015#012a=rtpmap:101 telephone-event/8000#015#012a=sendrecv#015#012a=ptime:20#015#0125:flagsl13:trust-addresse7:replacel6:origine7:call-id34:5F059A5C-53B306A00009A3D6-5BFDE70013:received-froml3:IP414:97.41.84.17.148e8:from-tag34:6243EDDE-53B306A00009A48C-5BFDE7006:to-tag20:SDe3pca99-gK09abfad37:command6:answere] Jul 1 19:06:11 rtphost rtpengine[30323]: [5F059A5C-53B306A00009A3D6-5BFDE700] Returning to SIP proxy: d6:result5:error12:error-reason28:Invalid dialogue associatione
On 07/02/14 13:44, Alex Balashov wrote:
On 07/02/2014 01:36 PM, Richard Fuchs wrote:
Found one instance of a missing unlock. Do you perhaps see the error message "Invalid dialogue association" in your rtpengine logs?
Yes, indeed, from yesterday, which is when it is thought to have stopped functioning in actuality:
In that case, this should be for you:
https://github.com/sipwise/rtpengine/commit/15640991d7fc9d398749f77460b82129...
cheers
On 07/02/2014 01:45 PM, Richard Fuchs wrote:
On 07/02/14 13:44, Alex Balashov wrote:
On 07/02/2014 01:36 PM, Richard Fuchs wrote:
Found one instance of a missing unlock. Do you perhaps see the error message "Invalid dialogue association" in your rtpengine logs?
Yes, indeed, from yesterday, which is when it is thought to have stopped functioning in actuality:
In that case, this should be for you:
https://github.com/sipwise/rtpengine/commit/15640991d7fc9d398749f77460b82129...
Am I right to assume that this is a missing unlock introduced by error handling in the previous patch? :-)
On 07/02/14 13:46, Alex Balashov wrote:
On 07/02/2014 01:45 PM, Richard Fuchs wrote:
On 07/02/14 13:44, Alex Balashov wrote:
On 07/02/2014 01:36 PM, Richard Fuchs wrote:
Found one instance of a missing unlock. Do you perhaps see the error message "Invalid dialogue association" in your rtpengine logs?
Yes, indeed, from yesterday, which is when it is thought to have stopped functioning in actuality:
In that case, this should be for you:
https://github.com/sipwise/rtpengine/commit/15640991d7fc9d398749f77460b82129...
Am I right to assume that this is a missing unlock introduced by error handling in the previous patch? :-)
No, completely unrelated screw-up, from the beginning of June actually :-P
cheers
On 07/02/2014 01:49 PM, Richard Fuchs wrote:
Am I right to assume that this is a missing unlock introduced by error handling in the previous patch? :-)
No, completely unrelated screw-up, from the beginning of June actually :-P
Ah well.
Thanks, your help and prompt responses are deeply appreciated.