Hello,
I'm running kamailio:4.2:df86f2a9a09339687af5914b85fe8bd8f8f1f575 and am getting a periodic crash, once every few days, in response to a CANCEL message.
The basic back trace is like this:
(gdb) where #0 0x000000000044fed2 in del_nonshm_lump (lump_list=0x7f48f5681440) at data_lump.c:677 #1 0x00007f48f5404a15 in free_faked_req (faked_req=0x7f48f5680ea0, t=0x7f46ee6faeb0) at t_reply.c:975 #2 0x00007f48f5405bdf in run_failure_handlers (t=0x7f46ee6faeb0, rpl=0xffffffffffffffff, code=487, extra_flags=0) at t_reply.c:1061 #3 0x00007f48f54084e4 in t_should_relay_response (Trans=0x7f46ee6faeb0, new_code=487, branch=0, should_store=0x7fffce5605e0, should_relay=0x7fffce5605e4, cancel_data=0x7fffce5606b0, reply=0xffffffffffffffff) at t_reply.c:1406 #4 0x00007f48f540b045 in relay_reply (t=0x7f46ee6faeb0, p_msg=0xffffffffffffffff, branch=0, msg_status=487, cancel_data=0x7fffce5606b0, do_put_on_wait=1) at t_reply.c:1809 #5 0x00007f48f5386832 in cancel_branch (t=0x7f46ee6faeb0, branch=0, reason=0x0, flags=10) at t_cancel.c:276 #6 0x00007f48f53aff4a in e2e_cancel (cancel_msg=0x7f48f68d69d8, t_cancel=0x7f46ee8d9c30, t_invite=0x7f46ee6faeb0) at t_fwd.c:1373 #7 0x00007f48f53b4bd0 in t_relay_cancel (p_msg=0x7f48f68d69d8) at t_fwd.c:1967 #8 0x00007f48f53deaa7 in w_t_relay_cancel (p_msg=0x7f48f68d69d8, _foo=0x0, _bar=0x0) at tm.c:1743 #9 0x000000000041d364 in do_action (h=0x7fffce560fc0, a=0x7f48f6689f70, msg=0x7f48f68d69d8) at action.c:1088 #10 0x0000000000429a7a in run_actions (h=0x7fffce560fc0, a=0x7f48f6689f70, msg=0x7f48f68d69d8) at action.c:1583 #11 0x000000000042a0df in run_actions_safe (h=0x7fffce5622b0, a=0x7f48f6689f70, msg=0x7f48f68d69d8) at action.c:1648 #12 0x0000000000541158 in rval_get_int (h=0x7fffce5622b0, msg=0x7f48f68d69d8, i=0x7fffce561498, rv=0x7f48f668a1e0, cache=0x0) at rvalue.c:924 #13 0x0000000000545390 in rval_expr_eval_int (h=0x7fffce5622b0, msg=0x7f48f68d69d8, res=0x7fffce561498, rve=0x7f48f668a1d8) at rvalue.c:1918 #14 0x0000000000545786 in rval_expr_eval_int (h=0x7fffce5622b0, msg=0x7f48f68d69d8, res=0x7fffce561920, rve=0x7f48f668a948) at rvalue.c:1926 #15 0x000000000041ce4e in do_action (h=0x7fffce5622b0, a=0x7f48f668c260, msg=0x7f48f68d69d8) at action.c:1064 #16 0x0000000000429a7a in run_actions (h=0x7fffce5622b0, a=0x7f48f6689808, msg=0x7f48f68d69d8) at action.c:1583 #17 0x000000000041d2cd in do_action (h=0x7fffce5622b0, a=0x7f48f668c960, msg=0x7f48f68d69d8) at action.c:1079 #18 0x0000000000429a7a in run_actions (h=0x7fffce5622b0, a=0x7f48f667c628, msg=0x7f48f68d69d8) at action.c:1583 #19 0x000000000042a1a7 in run_top_route (a=0x7f48f667c628, msg=0x7f48f68d69d8, c=0x0) at action.c:1669 #20 0x0000000000507e1a in receive_msg ( buf=0xa6f760 "CANCEL sip:yyy@xxx:5060 SIP/2.0\r\nVia: SIP /2.0/UDP xxx:5060;branch=z9hG4bK08f.3dc6f0e1.0\r\nFrom: "yyy" sip:yyy@xxx;tag=D78eD8FB3SDgc\r\nCall-ID: e5e48a99-48dd-1233-96b7-782bcb13da6a\r\nTo: sip:xxx@xxx:5060\r\nCSeq: 73049624 CANCEL\r\nMax-Forwards: 32\r\nUser-Agent: OpenSIPS (1.9.1-notls (x86_64/linux))\r\nContent-Length: 0\r\n\r\n", len=394, rcv_info=0x7fffce5625a0) at receive.c:216 #21 0x00000000006074ae in udp_rcv_loop () at udp_server.c:521 #22 0x00000000004a5f0b in main_loop () at main.c:1629 #23 0x00000000004ab8bf in main (argc=11, argv=0x7fffce5629c8) at main.c:2578
I'll send a 'thread apply all bt full' privately due to the amount of private addresses in there, but a quick glance suggests a possible problem is here:
#5 0x00007f48f5386832 in cancel_branch (t=0x7f46ee6faeb0, branch=0, reason=0x0, flags=10) at t_cancel.c:276 cancel = 0x1 <Address 0x1 out of bounds> len = 32584 crb = 0x7f46ee6fb0b0 irb = 0x7f46ee6fb028 ret = 1 tmp_cd = {cancel_bitmap = 0, reason = {cause = 0, u = {text = { s = 0x0, len = 0}, e2e_cancel = 0x0, packed_hdrs = {s = 0x0, len = 0}}}} pcbuf = 0x7f46ee6fb0c0 __FUNCTION__ = "cancel_branch" #6 0x00007f48f53aff4a in e2e_cancel (cancel_msg=0x7f48f68d69d8, t_cancel=0x7f46ee8d9c30, t_invite=0x7f46ee6faeb0) at t_fwd.c:1373 cancel_bm = 1 reason = 0x0 free_reason = 0 i = 0 lowest_error = 0 ret = 32584 tmcb = {req = 0x137f66ce710, rpl = 0x7f48f68d69d8, param = 0xf48ab828, code = -158504488, flags = 32584, branch = 0, t_rbuf = 0xf80f668f9a0, dst = 0xce5622b0, send_buf = { s = 0x1ffffffff <Address 0x1ffffffff out of bounds>, len = -304107664}} __FUNCTION__ = "e2e_cancel" #7 0x00007f48f53b4bd0 in t_relay_cancel (p_msg=0x7f48f68d69d8) at t_fwd.c:1967 t_invite = 0x7f46ee6faeb0 t = 0x7f46ee8d9c30 ret = -323705680 new_tran = 1
Thanks,
-- Alex
Hello,
why do you think it is a problem in frame 5?
Send the bt full for at least the first 3 frames.
Cheers, Daniel
On 19/03/15 19:49, Alex Balashov wrote:
Hello,
I'm running kamailio:4.2:df86f2a9a09339687af5914b85fe8bd8f8f1f575 and am getting a periodic crash, once every few days, in response to a CANCEL message.
The basic back trace is like this:
(gdb) where #0 0x000000000044fed2 in del_nonshm_lump (lump_list=0x7f48f5681440) at data_lump.c:677 #1 0x00007f48f5404a15 in free_faked_req (faked_req=0x7f48f5680ea0, t=0x7f46ee6faeb0) at t_reply.c:975 #2 0x00007f48f5405bdf in run_failure_handlers (t=0x7f46ee6faeb0, rpl=0xffffffffffffffff, code=487, extra_flags=0) at t_reply.c:1061 #3 0x00007f48f54084e4 in t_should_relay_response (Trans=0x7f46ee6faeb0, new_code=487, branch=0, should_store=0x7fffce5605e0, should_relay=0x7fffce5605e4, cancel_data=0x7fffce5606b0, reply=0xffffffffffffffff) at t_reply.c:1406 #4 0x00007f48f540b045 in relay_reply (t=0x7f46ee6faeb0, p_msg=0xffffffffffffffff, branch=0, msg_status=487, cancel_data=0x7fffce5606b0, do_put_on_wait=1) at t_reply.c:1809 #5 0x00007f48f5386832 in cancel_branch (t=0x7f46ee6faeb0, branch=0, reason=0x0, flags=10) at t_cancel.c:276 #6 0x00007f48f53aff4a in e2e_cancel (cancel_msg=0x7f48f68d69d8, t_cancel=0x7f46ee8d9c30, t_invite=0x7f46ee6faeb0) at t_fwd.c:1373 #7 0x00007f48f53b4bd0 in t_relay_cancel (p_msg=0x7f48f68d69d8) at t_fwd.c:1967 #8 0x00007f48f53deaa7 in w_t_relay_cancel (p_msg=0x7f48f68d69d8, _foo=0x0, _bar=0x0) at tm.c:1743 #9 0x000000000041d364 in do_action (h=0x7fffce560fc0, a=0x7f48f6689f70, msg=0x7f48f68d69d8) at action.c:1088 #10 0x0000000000429a7a in run_actions (h=0x7fffce560fc0, a=0x7f48f6689f70, msg=0x7f48f68d69d8) at action.c:1583 #11 0x000000000042a0df in run_actions_safe (h=0x7fffce5622b0, a=0x7f48f6689f70, msg=0x7f48f68d69d8) at action.c:1648 #12 0x0000000000541158 in rval_get_int (h=0x7fffce5622b0, msg=0x7f48f68d69d8, i=0x7fffce561498, rv=0x7f48f668a1e0, cache=0x0) at rvalue.c:924 #13 0x0000000000545390 in rval_expr_eval_int (h=0x7fffce5622b0, msg=0x7f48f68d69d8, res=0x7fffce561498, rve=0x7f48f668a1d8) at rvalue.c:1918 #14 0x0000000000545786 in rval_expr_eval_int (h=0x7fffce5622b0, msg=0x7f48f68d69d8, res=0x7fffce561920, rve=0x7f48f668a948) at rvalue.c:1926 #15 0x000000000041ce4e in do_action (h=0x7fffce5622b0, a=0x7f48f668c260, msg=0x7f48f68d69d8) at action.c:1064 #16 0x0000000000429a7a in run_actions (h=0x7fffce5622b0, a=0x7f48f6689808, msg=0x7f48f68d69d8) at action.c:1583 #17 0x000000000041d2cd in do_action (h=0x7fffce5622b0, a=0x7f48f668c960, msg=0x7f48f68d69d8) at action.c:1079 #18 0x0000000000429a7a in run_actions (h=0x7fffce5622b0, a=0x7f48f667c628, msg=0x7f48f68d69d8) at action.c:1583 #19 0x000000000042a1a7 in run_top_route (a=0x7f48f667c628, msg=0x7f48f68d69d8, c=0x0) at action.c:1669 #20 0x0000000000507e1a in receive_msg ( buf=0xa6f760 "CANCEL sip:yyy@xxx:5060 SIP/2.0\r\nVia: SIP /2.0/UDP xxx:5060;branch=z9hG4bK08f.3dc6f0e1.0\r\nFrom: "yyy" sip:yyy@xxx;tag=D78eD8FB3SDgc\r\nCall-ID: e5e48a99-48dd-1233-96b7-782bcb13da6a\r\nTo: sip:xxx@xxx:5060\r\nCSeq: 73049624 CANCEL\r\nMax-Forwards: 32\r\nUser-Agent: OpenSIPS (1.9.1-notls (x86_64/linux))\r\nContent-Length: 0\r\n\r\n", len=394, rcv_info=0x7fffce5625a0) at receive.c:216 #21 0x00000000006074ae in udp_rcv_loop () at udp_server.c:521 #22 0x00000000004a5f0b in main_loop () at main.c:1629 #23 0x00000000004ab8bf in main (argc=11, argv=0x7fffce5629c8) at main.c:2578
I'll send a 'thread apply all bt full' privately due to the amount of private addresses in there, but a quick glance suggests a possible problem is here:
#5 0x00007f48f5386832 in cancel_branch (t=0x7f46ee6faeb0, branch=0, reason=0x0, flags=10) at t_cancel.c:276 cancel = 0x1 <Address 0x1 out of bounds> len = 32584 crb = 0x7f46ee6fb0b0 irb = 0x7f46ee6fb028 ret = 1 tmp_cd = {cancel_bitmap = 0, reason = {cause = 0, u = {text = { s = 0x0, len = 0}, e2e_cancel = 0x0, packed_hdrs = {s = 0x0, len = 0}}}} pcbuf = 0x7f46ee6fb0c0 __FUNCTION__ = "cancel_branch" #6 0x00007f48f53aff4a in e2e_cancel (cancel_msg=0x7f48f68d69d8, t_cancel=0x7f46ee8d9c30, t_invite=0x7f46ee6faeb0) at t_fwd.c:1373 cancel_bm = 1 reason = 0x0 free_reason = 0 i = 0 lowest_error = 0 ret = 32584 tmcb = {req = 0x137f66ce710, rpl = 0x7f48f68d69d8, param = 0xf48ab828, code = -158504488, flags = 32584, branch = 0, t_rbuf = 0xf80f668f9a0, dst = 0xce5622b0, send_buf = { s = 0x1ffffffff <Address 0x1ffffffff out of bounds>, len = -304107664}} __FUNCTION__ = "e2e_cancel" #7 0x00007f48f53b4bd0 in t_relay_cancel (p_msg=0x7f48f68d69d8) at t_fwd.c:1967 t_invite = 0x7f46ee6faeb0 t = 0x7f46ee8d9c30 ret = -323705680 new_tran = 1
Thanks,
-- Alex
On 03/19/2015 04:38 PM, Daniel-Constantin Mierla wrote:
Hello,
why do you think it is a problem in frame 5?
Send the bt full for at least the first 3 frames.
The appearance of out-of-bounds access there. But you're right, that is not necessarily a justifiable conclusion.
I will send bt full privately to you.
On 19/03/15 21:40, Alex Balashov wrote:
On 03/19/2015 04:38 PM, Daniel-Constantin Mierla wrote:
Hello,
why do you think it is a problem in frame 5?
Send the bt full for at least the first 3 frames.
The appearance of out-of-bounds access there. But you're right, that is not necessarily a justifiable conclusion.
That's a variable declared without initial value and not used according to the backtrace.
Are you doing any operations in the failure route for a canceled invite? Like add/remove headers to the invite?
Cheers, Daniel
On 03/19/2015 04:52 PM, Daniel-Constantin Mierla wrote:
Are you doing any operations in the failure route for a canceled invite? Like add/remove headers to the invite?
No; only handling for a CANCEL in main route block is:
if(is_method("CANCEL")) { xlog("L_INFO", "[R-MAIN:$ci] CANCEL received from $si (RURI=$ru)\n");
#!ifdef WITH_RTPPROXY
set_rtp_proxy_set("1"); unforce_rtp_proxy(); #!endif
if(!t_relay_cancel()) { sl_send_reply("500", "Internal Server Error"); exit; } exit;
}
And in failure_route:
if(t_is_canceled()) { xlog("L_INFO", "[R-OUTBOUND-VENDOR-LNP-DIP-CATCH:$ci] Transaction cancelled\n");
# Nothing further necessary - CANCEL was already processed # in TM handling in main request route.
exit; }
The only other nuance, if it might be relevant, is that the initial INVITE was processed and relayed out of an rtimer process, after being t_suspended and being put on an mqueue and t_consumed out of rtimer. However, this is not the case with CANCEL or any other requests, sequential or initial.
-- Alex
On 03/19/2015 04:56 PM, Alex Balashov wrote:
t_consumed out of rtimer
Sorry, I meant t_continued().
The issue seems to be related to the invite transaction (which was canceled), apparently after running its failure handlers. By chance, do you know if the invite was relayed?
Daniel
On 19/03/15 21:56, Alex Balashov wrote:
On 03/19/2015 04:52 PM, Daniel-Constantin Mierla wrote:
Are you doing any operations in the failure route for a canceled invite? Like add/remove headers to the invite?
No; only handling for a CANCEL in main route block is:
if(is_method("CANCEL")) { xlog("L_INFO", "[R-MAIN:$ci] CANCEL received from $si
(RURI=$ru)\n");
#!ifdef WITH_RTPPROXY
set_rtp_proxy_set("1"); unforce_rtp_proxy();
#!endif
if(!t_relay_cancel()) { sl_send_reply("500", "Internal Server Error"); exit; } exit; }
And in failure_route:
if(t_is_canceled()) { xlog("L_INFO", "[R-OUTBOUND-VENDOR-LNP-DIP-CATCH:$ci]
Transaction cancelled\n");
# Nothing further necessary - CANCEL was already processed # in TM handling in main request route. exit; }
The only other nuance, if it might be relevant, is that the initial INVITE was processed and relayed out of an rtimer process, after being t_suspended and being put on an mqueue and t_consumed out of rtimer. However, this is not the case with CANCEL or any other requests, sequential or initial.
-- Alex
On 03/19/2015 04:59 PM, Daniel-Constantin Mierla wrote:
The issue seems to be related to the invite transaction (which was canceled), apparently after running its failure handlers. By chance, do you know if the invite was relayed?
It does appear to have been:
[R-OUTBOUND-VENDOR-ROUTE-TRY:7842d8b9-4915-1233-75a9-180373eddd8e] -> Final request URI: sip:xxx@yyy:5060 Mar 19 20:00:44 Proxy1 /usr/local/sbin/kamailio[23976]: INFO: [R-INVITE-OUTBOUND-HANDLER-POSTROUTE:7842d8b9-4915-1233-75a9-180373eddd8e] -> Request processing delay: 53 ms
These are the last log lines emitted before my t_relay() call, so I would safely assume t_relay() was called.
-- Alex
I have two core dumps from the last few days with the same crash, and I took a look at the call flows for both Call-IDs.
Both seem to play out the same way as the crash I sent you:
Mar 19 13:22:57 Proxy1 /usr/local/sbin/kamailio[21580]: INFO: [R-OUTBOUND-VENDOR-ROUTE-TRY:e5e48a99-48dd-1233-96b7-782bcb13da6a] -> Final request URI: sip:yyy@xxx:5060
Mar 19 13:22:57 Proxy1 /usr/local/sbin/kamailio[21580]: INFO: [R-INVITE-OUTBOUND-HANDLER-POSTROUTE:e5e48a99-48dd-1233-96b7-782bcb13da6a] -> Request processing delay: 27 ms
Mar 19 13:22:57 Proxy1 /usr/local/sbin/kamailio[21580]: INFO: [R-OUTBOUND-VENDOR-ROUTE-TRY-BRANCH:e5e48a99-48dd-
1233-96b7-782bcb13da6a] Sending media through rtpproxy Mar 19 13:22:57 Proxy1 /usr/local/sbin/kamailio[21572]: INFO: [R-MAIN:e5e48a99-48dd-1233-96b7-782bcb13da6a] CANCEL received from xxx (RURI=sip:yyy@xxx:5060)
Mar 19 13:22:58 Proxy1 /usr/local/sbin/kamailio[21569]: INFO: [R-MAIN:e5e48a99-48dd-1233-96b7-782bcb13da6a] CANCEL received from xxx (RURI=sip:yyy@xxx:5060)
Mar 19 13:23:00 Proxy1 /usr/local/sbin/kamailio[21566]: INFO: [R-MAIN:e5e48a99-48dd-1233-96b7-782bcb13da6a] CANCEL received from xxx (RURI=sip:yyy@xxx:5060)
They also have another common characteristic:
The normal call flow is to relay the INVITE on branch .0 to an LNP gateway, then to capture the LRN from a redirect and relay the INVITE to its actual destination via branch .1. In all cases, however, there was a Redis cache hit on the LRN that did not require branch .0.
-- Alex
Does it happen that you create the transaction for INVITE with t_newtran(), then add/remove headers and after that call t_suspend()?
Daniel
On 19/03/15 22:09, Alex Balashov wrote:
I have two core dumps from the last few days with the same crash, and I took a look at the call flows for both Call-IDs.
Both seem to play out the same way as the crash I sent you:
Mar 19 13:22:57 Proxy1 /usr/local/sbin/kamailio[21580]: INFO: [R-OUTBOUND-VENDOR-ROUTE-TRY:e5e48a99-48dd-1233-96b7-782bcb13da6a] -> Final request URI: sip:yyy@xxx:5060
Mar 19 13:22:57 Proxy1 /usr/local/sbin/kamailio[21580]: INFO: [R-INVITE-OUTBOUND-HANDLER-POSTROUTE:e5e48a99-48dd-1233-96b7-782bcb13da6a] -> Request processing delay: 27 ms
Mar 19 13:22:57 Proxy1 /usr/local/sbin/kamailio[21580]: INFO: [R-OUTBOUND-VENDOR-ROUTE-TRY-BRANCH:e5e48a99-48dd-
1233-96b7-782bcb13da6a] Sending media through rtpproxy Mar 19 13:22:57 Proxy1 /usr/local/sbin/kamailio[21572]: INFO: [R-MAIN:e5e48a99-48dd-1233-96b7-782bcb13da6a] CANCEL received from xxx (RURI=sip:yyy@xxx:5060)
Mar 19 13:22:58 Proxy1 /usr/local/sbin/kamailio[21569]: INFO: [R-MAIN:e5e48a99-48dd-1233-96b7-782bcb13da6a] CANCEL received from xxx (RURI=sip:yyy@xxx:5060)
Mar 19 13:23:00 Proxy1 /usr/local/sbin/kamailio[21566]: INFO: [R-MAIN:e5e48a99-48dd-1233-96b7-782bcb13da6a] CANCEL received from xxx (RURI=sip:yyy@xxx:5060)
They also have another common characteristic:
The normal call flow is to relay the INVITE on branch .0 to an LNP gateway, then to capture the LRN from a redirect and relay the INVITE to its actual destination via branch .1. In all cases, however, there was a Redis cache hit on the LRN that did not require branch .0.
-- Alex
On 03/19/2015 05:13 PM, Daniel-Constantin Mierla wrote:
Does it happen that you create the transaction for INVITE with t_newtran(), then add/remove headers and after that call t_suspend()?
Yes, although both the t_newtran() call and any header manipulation happens after t_continue().
Furthermore, t_newtran() is only called if(!t_lookup_request()).
-- Alex
On 03/19/2015 05:16 PM, Alex Balashov wrote:
On 03/19/2015 05:13 PM, Daniel-Constantin Mierla wrote:
Does it happen that you create the transaction for INVITE with t_newtran(), then add/remove headers and after that call t_suspend()?
Yes, although both the t_newtran() call and any header manipulation happens after t_continue().
Furthermore, t_newtran() is only called if(!t_lookup_request()).
Well, that's not _entirely_ true. It turns out we do remove_hf("Route") before t_suspend().
On 19/03/15 22:16, Alex Balashov wrote:
On 03/19/2015 05:13 PM, Daniel-Constantin Mierla wrote:
Does it happen that you create the transaction for INVITE with t_newtran(), then add/remove headers and after that call t_suspend()?
Yes, although both the t_newtran() call and any header manipulation happens after t_continue().
But then is No -- t_continue() should be after t_suspend().
Daniel
Furthermore, t_newtran() is only called if(!t_lookup_request()).
-- Alex
On 03/19/2015 05:18 PM, Daniel-Constantin Mierla wrote:
On 19/03/15 22:16, Alex Balashov wrote:
On 03/19/2015 05:13 PM, Daniel-Constantin Mierla wrote:
Does it happen that you create the transaction for INVITE with t_newtran(), then add/remove headers and after that call t_suspend()?
Yes, although both the t_newtran() call and any header manipulation happens after t_continue().
But then is No -- t_continue() should be after t_suspend().
Indeed, it is. If you're asking if anything happens between t_suspend() and t_continue(), the answer is no. All further processing occurs after the rtimer thread dequeues and t_continues the transaction.
On 19/03/15 22:20, Alex Balashov wrote:
On 03/19/2015 05:18 PM, Daniel-Constantin Mierla wrote:
On 19/03/15 22:16, Alex Balashov wrote:
On 03/19/2015 05:13 PM, Daniel-Constantin Mierla wrote:
Does it happen that you create the transaction for INVITE with t_newtran(), then add/remove headers and after that call t_suspend()?
Yes, although both the t_newtran() call and any header manipulation happens after t_continue().
But then is No -- t_continue() should be after t_suspend().
Indeed, it is. If you're asking if anything happens between t_suspend() and t_continue(), the answer is no. All further processing occurs after the rtimer thread dequeues and t_continues the transaction.
I asked about what happens before t_suspend() -- if you have there t_newtran() and then some header/msg changes.
Daniel
On 03/19/2015 05:22 PM, Daniel-Constantin Mierla wrote:
I asked about what happens before t_suspend() -- if you have there t_newtran() and then some header/msg changes.
Ah, I see. In that case, no, no transaction is created then.
In some other call flows - not these - it is possible for us to create a transaction in order to support nonce protection on authentication challenges, i.e.
t_lookup_request();
if(!auth_check("$Ri", "subscriber", "0")) { ...
This happens before t_suspend(), although no header manipulations apart from removing Route header if present. Is this a bad idea?
However, in this case the source is IP-trusted, so this does not take place.
-- Alex
Can you cherry pick the next commit and give it a try with branch 4.2?
- https://github.com/kamailio/kamailio/commit/0ee3dc5e3edc49cf62f97ddd87a40b12...
It will be backported if all goes fine.
Cheers, Daniel
On 19/03/15 22:25, Alex Balashov wrote:
On 03/19/2015 05:22 PM, Daniel-Constantin Mierla wrote:
I asked about what happens before t_suspend() -- if you have there t_newtran() and then some header/msg changes.
Ah, I see. In that case, no, no transaction is created then.
In some other call flows - not these - it is possible for us to create a transaction in order to support nonce protection on authentication challenges, i.e.
t_lookup_request(); if(!auth_check("$Ri", "subscriber", "0")) { ...
This happens before t_suspend(), although no header manipulations apart from removing Route header if present. Is this a bad idea?
However, in this case the source is IP-trusted, so this does not take place.
-- Alex
Hello,
On 03/20/2015 08:00 AM, Daniel-Constantin Mierla wrote:
Can you cherry pick the next commit and give it a try with branch 4.2?
https://github.com/kamailio/kamailio/commit/0ee3dc5e3edc49cf62f97ddd87a40b12...
It will be backported if all goes fine.
I've put this into production now, and will let you know if there are any more crashes. Thank you very much for your hard work in identifying the problem.
Hello,
The system experienced another crash yesterday, but unfortunately the core dump is not very insightful, possibly due to being incomplete:
BFD: Warning: /tmp/./core.kamailio.500.1427402410.27498 is truncated: expected core file size >= 8602058752, found: 1769852928. [New Thread 27498] Cannot access memory at address 0x7f52891e3168 Cannot access memory at address 0x7f52891e3168 Cannot access memory at address 0x7f52891e3168 Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 Failed to read a valid object file image from memory. Core was generated by `/usr/local/sbin/kamailio -P /var/run/kamailio.pid -m 8192 -u evaristesys -g eva'. Program terminated with signal 11, Segmentation fault. #0 0x00007f5286d97e45 in ?? () Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.149.el6_6.5.x86_64 (gdb) where #0 0x00007f5286d97e45 in ?? () Cannot access memory at address 0x7fffbe32a210
That's not much help at all, so I cannot possibly say it is for the same reasons as before.
This was a rather peculiar crash:
From the logs, it would appear that Kamailio simply stopped processing messages at some point. There's about 8 minutes of zero log output at a time of constantly incoming traffic.
At some point, this situation is resolved when all Kamailio processes die with a normal SIGTERM, when someone manually restarted it:
Mar 26 20:40:10 Proxy1 /usr/local/sbin/kamailio[27498]: NOTICE: <core> [main.c:739]: handle_sigs(): Thank you for flying kamailio!!! Mar 26 20:40:10 Proxy1 /usr/local/sbin/kamailio[27535]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received. ...
But there are a few things here that are difficult to explain from the log:
1. Why was there no SIP stack response for 8 minutes, no logging activity, etc?
2. We have a script that checks if Kamailio processes are running every 1 second, and restarts Kamailio if it's not. It sends an e-mail informing us of that development also.
It's a rather naive check:
ps aux | grep kamailio | grep -v 'grep kamailio' | wc -l
But in this case, the script was not triggered, which would imply that some Kamailio processes--perhaps all--remained running.
There is no indication in the logs that any process died for any reason, except for the 'signal 15' received by all processes at the time of manual restart.
3. Why was a core dump generated at the time of the restart, if nothing crashed?
#3 is most interesting to me, because if it were some other problem, e.g. blocking of SIP worker threads for some reason, then I wouldn't expect a core dump upon service shutdown.
There is no other indication of any child process dying with SIGSEGV or SIGABRT.
-- Alex
On 03/27/2015 06:17 AM, Alex Balashov wrote:
Hello,
The system experienced another crash yesterday, but unfortunately the core dump is not very insightful, possibly due to being incomplete:
BFD: Warning: /tmp/./core.kamailio.500.1427402410.27498 is truncated: expected core file size >= 8602058752, found: 1769852928. [New Thread 27498] Cannot access memory at address 0x7f52891e3168 Cannot access memory at address 0x7f52891e3168 Cannot access memory at address 0x7f52891e3168 Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 Failed to read a valid object file image from memory. Core was generated by `/usr/local/sbin/kamailio -P /var/run/kamailio.pid -m 8192 -u evaristesys -g eva'. Program terminated with signal 11, Segmentation fault. #0 0x00007f5286d97e45 in ?? () Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.149.el6_6.5.x86_64 (gdb) where #0 0x00007f5286d97e45 in ?? () Cannot access memory at address 0x7fffbe32a210
That's not much help at all, so I cannot possibly say it is for the same reasons as before.
Well, I do have this in 'dmesg':
kamailio[27498]: segfault at 42 ip 00007f5286d97e45 sp 00007fffbe32a210 error 4 in tm.so[7f5286d40000+10d000]
But I don't really know how to look up those offsets easily.
[root@Proxy1 ~]# nm /usr/local/sbin/kamailio | grep 7f5 [root@Proxy1 ~]# nm /usr/local/lib64/kamailio/modules/* | grep 7f5 0000000000007f52 t futex_release 0000000000087f58 r __FUNCTION__.23413 0000000000057f51 r __FUNCTION__.4494 0000000000017f5e T init_rpid_avp 000000000007f5da t w_t_replicate
You should be able to find some hints on google about resolving the instruction pointer address into source code line.
As a lead meanwhile, I guess you using t_replicate() after t_continue(), right? Once I get some time, I will check to see if there is something that can have negative impact on such scenario...
Cheers, Daniel
On 27/03/15 12:49, Alex Balashov wrote:
Well, I do have this in 'dmesg':
kamailio[27498]: segfault at 42 ip 00007f5286d97e45 sp 00007fffbe32a210 error 4 in tm.so[7f5286d40000+10d000]
But I don't really know how to look up those offsets easily.
[root@Proxy1 ~]# nm /usr/local/sbin/kamailio | grep 7f5 [root@Proxy1 ~]# nm /usr/local/lib64/kamailio/modules/* | grep 7f5 0000000000007f52 t futex_release 0000000000087f58 r __FUNCTION__.23413 0000000000057f51 r __FUNCTION__.4494 0000000000017f5e T init_rpid_avp 000000000007f5da t w_t_replicate
On 03/27/2015 09:58 AM, Daniel-Constantin Mierla wrote:
As a lead meanwhile, I guess you using t_replicate() after t_continue(), right? Once I get some time, I will check to see if there is something that can have negative impact on such scenario...
No, actually, t_replicate() is not used.
On 27/03/15 14:59, Alex Balashov wrote:
On 03/27/2015 09:58 AM, Daniel-Constantin Mierla wrote:
As a lead meanwhile, I guess you using t_replicate() after t_continue(), right? Once I get some time, I will check to see if there is something that can have negative impact on such scenario...
No, actually, t_replicate() is not used.
Then the logs were misleading... maybe some corrupted core file ...
Daniel
On 03/27/2015 10:08 AM, Daniel-Constantin Mierla wrote:
Then the logs were misleading... maybe some corrupted core file ...
It does look that way.
This crash may just have to be created as "idiopathic", since the core dump contains nothing of interest. Not unless you know how to get something useful from a core file deemed to be "truncated" ...
BFD: Warning: /tmp/core.kamailio.500.1427402410.27498 is truncated: expected core file size >= 8602058752, found: 1769852928. [New Thread 27498] Cannot access memory at address 0x7f52891e3168 Cannot access memory at address 0x7f52891e3168 Cannot access memory at address 0x7f52891e3168 Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 Failed to read a valid object file image from memory. Core was generated by `/usr/local/sbin/kamailio -P /var/run/kamailio.pid -m 8192 -u evaristesys -g eva'. Program terminated with signal 11, Segmentation fault. #0 0x00007f5286d97e45 in ?? () Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.149.el6_6.5.x86_64
Daniel,
Is there anything in master:0ee3dc5e3edc49cf62f97ddd87a40b12c59b73ff that might cause this kind of deadlock? I am wondering because I still have the intuitive sense that all the problems I am experiencing with Kamailio 4.2 here are caused in some way by my switch to asynchronous call processing.
I don't have any proof that this deadlock is also an extension of that general problem domain. I just have that feeling, because the Kamailio upgrade corresponded to a config upgrade where the major change made was precisely that.
-- Alex
Very unlikely, that commit restored something that used to be in older version.
On the other hand, I remembered about the report done by Jason that a CANCEL can end up in some very particular cases in a double lock over the reply mutex. In your traces sent to me, it was about some cancel handling, so could be that case. It created a crash because before the commit 0ee3dc, accessing the branch was rather invalid, then went through that code and ended up in the lock.
I already pushed a patch for it: 3957db5fb51e23535a89b15c8f05463e5702424d
Maybe you can backport and report the results. It will be part of the upcoming minor release later this week.
Cheers, Daniel
On 27/03/15 19:11, Alex Balashov wrote:
Daniel,
Is there anything in master:0ee3dc5e3edc49cf62f97ddd87a40b12c59b73ff that might cause this kind of deadlock? I am wondering because I still have the intuitive sense that all the problems I am experiencing with Kamailio 4.2 here are caused in some way by my switch to asynchronous call processing.
I don't have any proof that this deadlock is also an extension of that general problem domain. I just have that feeling, because the Kamailio upgrade corresponded to a config upgrade where the major change made was precisely that.
-- Alex
Thank you for that, Daniel. I will certainly backport it and keep you apprised of developments.
On 03/30/2015 11:07 AM, Daniel-Constantin Mierla wrote:
Very unlikely, that commit restored something that used to be in older version.
On the other hand, I remembered about the report done by Jason that a CANCEL can end up in some very particular cases in a double lock over the reply mutex. In your traces sent to me, it was about some cancel handling, so could be that case. It created a crash because before the commit 0ee3dc, accessing the branch was rather invalid, then went through that code and ended up in the lock.
I already pushed a patch for it: 3957db5fb51e23535a89b15c8f05463e5702424d
Maybe you can backport and report the results. It will be part of the upcoming minor release later this week.
Cheers, Daniel
On 27/03/15 19:11, Alex Balashov wrote:
Daniel,
Is there anything in master:0ee3dc5e3edc49cf62f97ddd87a40b12c59b73ff that might cause this kind of deadlock? I am wondering because I still have the intuitive sense that all the problems I am experiencing with Kamailio 4.2 here are caused in some way by my switch to asynchronous call processing.
I don't have any proof that this deadlock is also an extension of that general problem domain. I just have that feeling, because the Kamailio upgrade corresponded to a config upgrade where the major change made was precisely that.
-- Alex
I just encountered another crash with a truncated core dump which was useless, but I did receive this log output:
Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[6979]: NOTICE: <core> [main.c:739]: handle_sigs(): Thank you for flying kamailio!!! Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[7009]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[7007]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[7004]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[7006]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[7000]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[6980]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[7008]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[6998]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[7005]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[6984]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[6989]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[7003]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[6988]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[7002]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[7001]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[6999]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[6992]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[6994]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[6981]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[6993]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[6991]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[6990]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[6987]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[6982]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[6995]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[6997]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[6996]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[6985]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[6986]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[6983]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[6979]: : <core> [mem/f_malloc.c:586]: fm_free(): BUG: fm_free: bad pointer 0x4c36294300000000 (out of memory block!), called from tm: h_table.c: free_cell(162) - aborting
Any interpretation here? Is this an OOM error? I have given Kamailio 8 GB of SHM; that ought to be enough for anyone, right?
The pid 6979 is the main process here (being the one handling the signals according to the log), not doing anything at runtime, so it doesn't crash at runtime. The last error is because of the cleanup it tried to do, which can happen if another process was doing a cleanup of the structure but didn't finish because it was killed by main process.
The reason for crash should before the message with "Thank you for flying ..." -- can you check if you get other relevant log messages before it?
Cheers, Daniel
On 30/03/15 21:40, Alex Balashov wrote:
I just encountered another crash with a truncated core dump which was useless, but I did receive this log output:
Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[6979]: NOTICE: <core> [main.c:739]: handle_sigs(): Thank you for flying kamailio!!! Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[7009]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[7007]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[7004]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[7006]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[7000]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[6980]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[7008]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[6998]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[7005]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[6984]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[6989]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[7003]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[6988]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[7002]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[7001]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[6999]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[6992]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[6994]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[6981]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[6993]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[6991]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[6990]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[6987]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[6982]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[6995]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[6997]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[6996]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[6985]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[6986]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[6983]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received Mar 30 18:32:04 Proxy1 /usr/local/sbin/kamailio[6979]: : <core> [mem/f_malloc.c:586]: fm_free(): BUG: fm_free: bad pointer 0x4c36294300000000 (out of memory block!), called from tm: h_table.c: free_cell(162) - aborting
Any interpretation here? Is this an OOM error? I have given Kamailio 8 GB of SHM; that ought to be enough for anyone, right?
On 03/30/2015 04:39 PM, Daniel-Constantin Mierla wrote:
The reason for crash should before the message with "Thank you for flying ..." -- can you check if you get other relevant log messages before it?
I did, but unfortunately there are no relevant messages. Just normal call processing, and a sudden shutdown. I'll check again to see if a process might have crashed earlier.
On 03/30/2015 04:41 PM, Alex Balashov wrote:
I did, but unfortunately there are no relevant messages. Just normal call processing, and a sudden shutdown. I'll check again to see if a process might have crashed earlier.
Unfortunately, there are no other informational messages, such as those related to the death of a child process earlier on. What I sent is all I have.
In the absence of any other information, I am going to chalk this up to the fact that our SIP stack responsiveness check, which was added to the monitoring script to monitor for the deadlock I reported earlier, is too sensitive. It probably restarted Kamailio on a hair trigger, and the memory error in PID 6979 is just a side effect.
I'm going to disable the SIP check, and hope your patch fixes any possible deadlock.
Checking the code, the line corresponds to a SIGTERM or SIGINT handling, which is normal termination of the application. So it seems that the application was stopped.
Cheers, Daniel
On 30/03/15 23:22, Alex Balashov wrote:
In the absence of any other information, I am going to chalk this up to the fact that our SIP stack responsiveness check, which was added to the monitoring script to monitor for the deadlock I reported earlier, is too sensitive. It probably restarted Kamailio on a hair trigger, and the memory error in PID 6979 is just a side effect.
I'm going to disable the SIP check, and hope your patch fixes any possible deadlock.
Based on the fact that I have seen no further crashes or deadlocks, I would be inclined to rule this campaign a success.
On 03/30/2015 11:07 AM, Daniel-Constantin Mierla wrote:
Very unlikely, that commit restored something that used to be in older version.
On the other hand, I remembered about the report done by Jason that a CANCEL can end up in some very particular cases in a double lock over the reply mutex. In your traces sent to me, it was about some cancel handling, so could be that case. It created a crash because before the commit 0ee3dc, accessing the branch was rather invalid, then went through that code and ended up in the lock.
I already pushed a patch for it: 3957db5fb51e23535a89b15c8f05463e5702424d
Maybe you can backport and report the results. It will be part of the upcoming minor release later this week.
Cheers, Daniel
On 27/03/15 19:11, Alex Balashov wrote:
Daniel,
Is there anything in master:0ee3dc5e3edc49cf62f97ddd87a40b12c59b73ff that might cause this kind of deadlock? I am wondering because I still have the intuitive sense that all the problems I am experiencing with Kamailio 4.2 here are caused in some way by my switch to asynchronous call processing.
I don't have any proof that this deadlock is also an extension of that general problem domain. I just have that feeling, because the Kamailio upgrade corresponded to a config upgrade where the major change made was precisely that.
-- Alex
On 09/04/15 05:10, Alex Balashov wrote:
Based on the fact that I have seen no further crashes or deadlocks, I would be inclined to rule this campaign a success.
A proof the new embedded systemd supervising the internal eco-system is worth it ;-)
Anyhow, thanks for testing and troubleshooting, it is good to know it was fixed.
Cheers, Daniel
On 03/30/2015 11:07 AM, Daniel-Constantin Mierla wrote:
Very unlikely, that commit restored something that used to be in older version.
On the other hand, I remembered about the report done by Jason that a CANCEL can end up in some very particular cases in a double lock over the reply mutex. In your traces sent to me, it was about some cancel handling, so could be that case. It created a crash because before the commit 0ee3dc, accessing the branch was rather invalid, then went through that code and ended up in the lock.
I already pushed a patch for it: 3957db5fb51e23535a89b15c8f05463e5702424d
Maybe you can backport and report the results. It will be part of the upcoming minor release later this week.
Cheers, Daniel
On 27/03/15 19:11, Alex Balashov wrote:
Daniel,
Is there anything in master:0ee3dc5e3edc49cf62f97ddd87a40b12c59b73ff that might cause this kind of deadlock? I am wondering because I still have the intuitive sense that all the problems I am experiencing with Kamailio 4.2 here are caused in some way by my switch to asynchronous call processing.
I don't have any proof that this deadlock is also an extension of that general problem domain. I just have that feeling, because the Kamailio upgrade corresponded to a config upgrade where the major change made was precisely that.
-- Alex
Unfortunately, we do keep running into total deadlocks with this approach under high load. We've had to revert async call processing (with mqueue + rtimer + t_suspend) back to non-async because the deadlocks were growing politically unacceptable.
I never managed to get a backtrace of the state of the processes at the time the deadlock occurs, because the customer would soon restart the proxy when this happened, as it would happen during busy times in the middle of production days. The only core dumps I have are the crashes on shutdown, which strikes me as incidental/irrelevant.
Your message is rather dry in the context of the previous long discussions, so it is hard to troubleshoot such a report. Can you provide details what version are you running now? Also, how often it happens, any error messages before it happens? Do you suspend more than one time the transaction?
Might not be the case, but if you use locking functions exposed by htable module, there was a recent fix for them.
Cheers, Daniel
On 08/05/15 23:43, Alex Balashov wrote:
Unfortunately, we do keep running into total deadlocks with this approach under high load. We've had to revert async call processing (with mqueue + rtimer + t_suspend) back to non-async because the deadlocks were growing politically unacceptable.
I never managed to get a backtrace of the state of the processes at the time the deadlock occurs, because the customer would soon restart the proxy when this happened, as it would happen during busy times in the middle of production days. The only core dumps I have are the crashes on shutdown, which strikes me as incidental/irrelevant.
Hello Daniel,
On 05/09/2015 03:56 AM, Daniel-Constantin Mierla wrote:
Your message is rather dry in the context of the previous long discussions, so it is hard to troubleshoot such a report.
Yes, I know. It is not a choice of attitude on my part, but rather a reflection of the constraints I myself face.
Can you provide details what version are you running now?
I am running git 4.2:2421e0b987119c083bd5cebe754ae8c97ebfa867.
Also, how often it happens, any error messages before it happens?
Oddly enough, no; at some point, call processing just stalls. It doesn't always completely stall, sometimes there is a trickle of call processing messages, and those look more or less normal. It's just a great reduction from the previous activity. This state of affairs lasts for a few minutes, until Kamailio is restarted manually by someone.
Do you suspend more than one time the transaction?
No.
Might not be the case, but if you use locking functions exposed by htable module, there was a recent fix for them.
I do not use those.
The general life cycle of an outbound INVITE in this installation is that AAA is done in the main SIP worker, including digest challenge authentication, which is used heavily in this installation. I have also previously encountered crash bugs when using this challenge scenario:
t_lookup_request(); # For nonce replay protection.
if(!auth_check("$Ri", "subscriber", "1")) { auth_challenge("$Ri", "1"); t_release(); exit; }
consume_credentials();
... anyway, after this step, the transaction is t_suspended() (once), the index and label are loaded onto an mqueue, and this is consumed in one of several rtimer processes which consume the mqueue nearly continuously (interval 50000u). There, the transaction is t_continued() and all remaining call processing, including forking for failover, is performed there.
-- Alex
-- Alex
On Sunday 10 May 2015, Alex Balashov wrote:
Can you provide details what version are you running now?
I am running git 4.2:2421e0b987119c083bd5cebe754ae8c97ebfa867.
That one doesn't seem to exist in the public repository.
Make sure you have at least 2a0fab2d66b299cb6003cead79a986324d16560d in your branch. This fixes a timer bug. There might be a lot more bugs like this as the compiler warnings for this type of errors are squelched, resulting in the hiding of tons of warnings.
Hello,
On 11/05/15 09:37, Alex Hermann wrote:
On Sunday 10 May 2015, Alex Balashov wrote:
Can you provide details what version are you running now?
I am running git 4.2:2421e0b987119c083bd5cebe754ae8c97ebfa867.
That one doesn't seem to exist in the public repository.
Make sure you have at least 2a0fab2d66b299cb6003cead79a986324d16560d in your branch. This fixes a timer bug. There might be a lot more bugs like this as the compiler warnings for this type of errors are squelched, resulting in the hiding of tons of warnings.
I pushed a patch over the weekend that is related to the report of a crash when removing from timer a suspended transaction:
- https://github.com/kamailio/kamailio/commit/96b84ba4d7f5ee4a763fe7db81bbf4f9...
I haven't had the time for deep analyze, but there were two different mutexes (async and reply loacks) used for handling the suspended transaction, and that may cause problems in some corner cases. The patch changes to use of reply lock which is re-entrant, so the process that acquired the transaction can handle it properly, no matter it does continue or reply/failure route handling.
It will be backported as soon as more testing is performed.
Cheers, Daniel
It would be good to add some sip keepalive monitoring (e.g., cron job with sipsak sending options) that will alert/restart in case of no response. The monit tool can also send sip keepalives and take actions on no response.
On a deadlock, checking process table is not enough. There should have been high cpu usage, though, if you monitored that.
Cheers, Daniel
On 27/03/15 12:47, Alex Balashov wrote:
This was a rather peculiar crash:
From the logs, it would appear that Kamailio simply stopped processing messages at some point. There's about 8 minutes of zero log output at a time of constantly incoming traffic.
At some point, this situation is resolved when all Kamailio processes die with a normal SIGTERM, when someone manually restarted it:
Mar 26 20:40:10 Proxy1 /usr/local/sbin/kamailio[27498]: NOTICE: <core> [main.c:739]: handle_sigs(): Thank you for flying kamailio!!! Mar 26 20:40:10 Proxy1 /usr/local/sbin/kamailio[27535]: INFO: <core> [main.c:850]: sig_usr(): signal 15 received. ...
But there are a few things here that are difficult to explain from the log:
- Why was there no SIP stack response for 8 minutes, no logging
activity, etc?
- We have a script that checks if Kamailio processes are running
every 1 second, and restarts Kamailio if it's not. It sends an e-mail informing us of that development also.
It's a rather naive check:
ps aux | grep kamailio | grep -v 'grep kamailio' | wc -l
But in this case, the script was not triggered, which would imply that some Kamailio processes--perhaps all--remained running.
There is no indication in the logs that any process died for any reason, except for the 'signal 15' received by all processes at the time of manual restart.
- Why was a core dump generated at the time of the restart, if
nothing crashed?
#3 is most interesting to me, because if it were some other problem, e.g. blocking of SIP worker threads for some reason, then I wouldn't expect a core dump upon service shutdown.
There is no other indication of any child process dying with SIGSEGV or SIGABRT.
-- Alex
On 03/27/2015 06:17 AM, Alex Balashov wrote:
Hello,
The system experienced another crash yesterday, but unfortunately the core dump is not very insightful, possibly due to being incomplete:
BFD: Warning: /tmp/./core.kamailio.500.1427402410.27498 is truncated: expected core file size >= 8602058752, found: 1769852928. [New Thread 27498] Cannot access memory at address 0x7f52891e3168 Cannot access memory at address 0x7f52891e3168 Cannot access memory at address 0x7f52891e3168 Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 Failed to read a valid object file image from memory. Core was generated by `/usr/local/sbin/kamailio -P /var/run/kamailio.pid -m 8192 -u evaristesys -g eva'. Program terminated with signal 11, Segmentation fault. #0 0x00007f5286d97e45 in ?? () Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.149.el6_6.5.x86_64 (gdb) where #0 0x00007f5286d97e45 in ?? () Cannot access memory at address 0x7fffbe32a210
That's not much help at all, so I cannot possibly say it is for the same reasons as before.
On 03/27/2015 10:00 AM, Daniel-Constantin Mierla wrote:
It would be good to add some sip keepalive monitoring (e.g., cron job with sipsak sending options) that will alert/restart in case of no response. The monit tool can also send sip keepalives and take actions on no response.
On a deadlock, checking process table is not enough. There should have been high cpu usage, though, if you monitored that.
I agree that this type of monitoring is insufficient. It was a short-term stop-gap measure to restart Kamailio if it crashes (due to the original crash bug that is the subject of this thread).
Normally, Kamailio is so stable we don't really worry about it crashing, so there wasn't much of a need for monitoring it or its SIP responsiveness. :-) It runs for months or years without issue.