[sr-dev] TM possible deadlock
Jason Penton
jason.penton at gmail.com
Tue Feb 3 08:29:17 CET 2015
Hi Daniel,
I have found another instance of a similar bug where the a process gets
itself into deadlock. The sequence of events:
1. INVITE request is relayed downstream
2. error response or locally generated 408 occurs
2. Failure route in cfg is armed and tries to relay to alternate
destination (voicemail/annoucnement server, etc) - using t_relay()
3. t_relay fails to destination for whatever reason (in this case the
result of the destination being blacklisted in TM)
4. Original reply is attempted to be relayed upstream (this is all done
safely - ie using t_reply_unsafe() as the REPLY_LOCK has already been
acquired in failure route
5. At or around line 562 of t_reply.c we check if the reply is > 200 (final
response). If it is and if its an INVITE transaction, we clean up timers
and cancel all UAC branches. This is where we hit a problem:
cancel_uacs( trans, &cancel_data, F_CANCEL_B_KILL, lock );
6. cancel_uacs calls cancel_branch if there was no response on the branch
by calling cancel_branch()
7. cancel branch calls LOCK_REPLIES(t) on the transaction and you hit
deadlock.... (original call to LOCK_REPLIES(t) happened in failure route
processing)
This use case is slightly different to the one you fixed related to
retransmission timers... (the beginning of this thread)
Suggestion: What about a pkg/process variable that can be used to check if
you have already acquired the reply lock to combat these non-reentrant
deadlocks. A side-effect of this would save on passing variables around in
various functions indicating intention to lock or not.
Thoughts?
Cheers
Jason
On Fri Apr 11 2014 at 14:48:59 Daniel-Constantin Mierla <miconda at gmail.com>
wrote:
>
> On 11/04/14 14:45, Jason Penton wrote:
>
> yes okay - got it. I will test and analyse.
>
> I meant I want to analyze before deciding to backport, but a second
> analysis is never bad :-)
>
> Cheers,
> Daniel
>
>
>
> Thanks Daniel!
>
>
> On Thu, Apr 10, 2014 at 4:35 PM, Daniel-Constantin Mierla <
> miconda at gmail.com> wrote:
>
>> Hello,
>>
>> iirc, there are several functions that script writer can use, like
>> t_reply_callid() from tmx. The idea is to analyze a bit in order to detect
>> if a forced reply may end up in canceling some pending branches -- the
>> reply on the branch doesnt matter anymore and should not be considered
>> anymore for relaying upstream, because the script writer already decided
>> what to send out.
>>
>> Cheers,
>> Daniel
>>
>>
>>
>> On 10/04/14 13:24, Jason Penton wrote:
>>
>> Hey Daniel,
>>
>> which reply functions are you referring to? API functions?
>>
>> Cheers
>> Jason
>>
>>
>> On Thu, Apr 10, 2014 at 12:53 PM, Daniel-Constantin Mierla <
>> miconda at gmail.com> wrote:
>>
>>> OK. I will leave it a bit in master to see if there are any new
>>> reports, then I will backport. I will also have to review the tm reply
>>> functions that can be used from config to align them to the new check.
>>>
>>> Cheers,
>>> Daniel
>>>
>>>
>>> On 10/04/14 09:06, Jason Penton wrote:
>>>
>>> oh excellent, I will look at it right away - was just getting ready to
>>> jump in myself ;)
>>>
>>> Cheers
>>> Jason
>>>
>>>
>>> On Thu, Apr 10, 2014 at 9:01 AM, Daniel-Constantin Mierla <
>>> miconda at gmail.com> wrote:
>>>
>>>> Hello Jason,
>>>>
>>>> I pushed a patch trying to fix this case, it is only on git master
>>>> branch. Can you test it? If all goes fine, we can consider backporting it.
>>>>
>>>> Cheers,
>>>> Daniel
>>>>
>>>>
>>>> On 09/04/14 23:26, Jason Penton wrote:
>>>>
>>>> Hey Daniel,
>>>>
>>>> nothing extraordinary...
>>>>
>>>> # -- TM params --
>>>> modparam("tm", "fr_timer", 20000);
>>>> modparam("tm", "fr_inv_timer", 10000)
>>>>
>>>>
>>>> Cheers
>>>> Jason
>>>>
>>>>
>>>> On Wed, Apr 9, 2014 at 10:32 PM, Jason Penton <jason.penton at gmail.com>
>>>> wrote:
>>>>
>>>>> Hey Daniel,
>>>>>
>>>>> Yes I did a test with a very basic config file and I am not able to
>>>>> re-create. However, with my *complex* cfg file I can re-create every time.
>>>>> Tomorrow I will compare what is different and report back... hopefully with
>>>>> fix ;)
>>>>>
>>>>> here is bt of timer process deadlocking itself:
>>>>>
>>>>> #0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:39
>>>>> #1 0x00007f5009f22004 in futex_get (lock=0x7f4fc55030d8) at
>>>>> ../../mem/../futexlock.h:123
>>>>> #2 0x00007f5009f223e1 in _lock (s=0x7f4fc55030d8, file=0x7f5009f90fd1
>>>>> "t_cancel.c", function=0x7f5009f91980 "cancel_branch", line=250) at
>>>>> lock.h:99
>>>>> #3 0x00007f5009f23271 in cancel_branch (t=0x7f4fc5501b40, branch=0,
>>>>> reason=0x7fff646d03a8, flags=3) at t_cancel.c:250
>>>>> #4 0x00007f5009f22c02 in cancel_uacs (t=0x7f4fc5501b40,
>>>>> cancel_data=0x7fff646d03a0, flags=1) at t_cancel.c:123
>>>>> #5 0x00007f5009f718c4 in _reply_light (trans=0x7f4fc5501b40,
>>>>> buf=0x7f500a24dc68 "SIP/2.0 500 Server error on LIR select next
>>>>> S-CSCF\r\nVia: SIP/2.0/UDP 10.0.1.167:6060;branch=z9hG4bKb7.2ae09f29ffbd0034cd6d58483053603b.1\r\nVia:
>>>>> SIP/2.0/UDP 10.0.1.166:4060;branch=z9hG4bKb7.3faa03ddea80"...,
>>>>> len=778, code=500, to_tag=0x7f500a1c7ae0
>>>>> "c82b15d7f12ef185f95fe4945457d449-8bab", to_tag_len=37, lock=0,
>>>>> bm=0x7fff646d0b60) at t_reply.c:660
>>>>> #6 0x00007f5009f7244c in _reply (trans=0x7f4fc5501b40,
>>>>> p_msg=0x7f500a1c6bc0, code=500, text=0x7f500a249a48 "Server error on LIR
>>>>> select next S-CSCF", lock=0) at t_reply.c:795
>>>>> #7 0x00007f5009f76436 in t_reply_unsafe (t=0x7f4fc5501b40,
>>>>> p_msg=0x7f500a1c6bc0, code=500, text=0x7f500a249a48 "Server error on LIR
>>>>> select next S-CSCF") at t_reply.c:1643
>>>>> #8 0x00007f5009f57621 in w_t_reply (msg=0x7f500a1c6bc0,
>>>>> p1=0x7f500a2497d8 "\340\332$\nP\177", p2=0x7f500a249870 "h\321$\nP\177") at
>>>>> tm.c:1324
>>>>> #9 0x000000000041a700 in do_action (h=0x7fff646d1d30,
>>>>> a=0x7f500a24cee8, msg=0x7f500a1c6bc0) at action.c:1119
>>>>> #10 0x0000000000423831 in run_actions (h=0x7fff646d1d30,
>>>>> a=0x7f500a24cee8, msg=0x7f500a1c6bc0) at action.c:1607
>>>>> #11 0x000000000041a5a4 in do_action (h=0x7fff646d1d30,
>>>>> a=0x7f500a24d478, msg=0x7f500a1c6bc0) at action.c:1102
>>>>> #12 0x0000000000423831 in run_actions (h=0x7fff646d1d30,
>>>>> a=0x7f500a249148, msg=0x7f500a1c6bc0) at action.c:1607
>>>>> #13 0x000000000041a54e in do_action (h=0x7fff646d1d30,
>>>>> a=0x7f500a24c500, msg=0x7f500a1c6bc0) at action.c:1098
>>>>> #14 0x0000000000423831 in run_actions (h=0x7fff646d1d30,
>>>>> a=0x7f500a247a28, msg=0x7f500a1c6bc0) at action.c:1607
>>>>> #15 0x0000000000423fdf in run_top_route (a=0x7f500a247a28,
>>>>> msg=0x7f500a1c6bc0, c=0x0) at action.c:1693
>>>>> #16 0x00007f5009f73815 in run_failure_handlers (t=0x7f4fc5501b40,
>>>>> rpl=0xffffffffffffffff, code=408, extra_flags=96) at t_reply.c:1061
>>>>> #17 0x00007f5009f7527a in t_should_relay_response
>>>>> (Trans=0x7f4fc5501b40, new_code=408, branch=1, should_store=0x7fff646d201c,
>>>>> should_relay=0x7fff646d2018, cancel_data=0x7fff646d2070,
>>>>> reply=0xffffffffffffffff) at t_reply.c:1416
>>>>> #18 0x00007f5009f76ede in relay_reply (t=0x7f4fc5501b40,
>>>>> p_msg=0xffffffffffffffff, branch=1, msg_status=408,
>>>>> cancel_data=0x7fff646d2070, do_put_on_wait=0) at t_reply.c:1819
>>>>> #19 0x00007f5009f44c88 in fake_reply (t=0x7f4fc5501b40, branch=1,
>>>>> code=408) at timer.c:354
>>>>> #20 0x00007f5009f450e7 in final_response_handler
>>>>> (r_buf=0x7f4fc5501e60, t=0x7f4fc5501b40) at timer.c:526
>>>>> #21 0x00007f5009f4518d in retr_buf_handler (ticks=260027386,
>>>>> tl=0x7f4fc5501e80, p=0x3e8) at timer.c:584
>>>>> #22 0x0000000000544119 in timer_list_expire (t=260027386,
>>>>> h=0x7f4fc527cbe0, slow_l=0x7f4fc527cdf0, slow_mark=0) at timer.c:894
>>>>> #23 0x0000000000544418 in timer_handler () at timer.c:959
>>>>> #24 0x00000000005446b2 in timer_main () at timer.c:998
>>>>> #25 0x0000000000471ddf in main_loop () at main.c:1689
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Apr 9, 2014 at 9:34 PM, Daniel-Constantin Mierla <
>>>>> miconda at gmail.com> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> that should not be a very rare case and I would expect to be caught
>>>>>> so far, anyhow ... this looks like easy to reproduce, have you tried it?
>>>>>>
>>>>>> You can have two kamailio, one relying the invite to the second,
>>>>>> which will reply with 100, then wait for the timeout on the first instance.
>>>>>> You can add some debug messages in the code to see if the lock is called
>>>>>> twice.
>>>>>>
>>>>>> Cheers,
>>>>>> Daniel
>>>>>>
>>>>>>
>>>>>> On 09/04/14 17:51, Jason Penton wrote:
>>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> I have been experiencing a deadlock when a timeout occurs on a
>>>>>> t_relayed() INVITE. Going through the code I have noticed a possible chance
>>>>>> of deadlock (without re-entrant enabled). Here is my thinking:
>>>>>>
>>>>>> t_should_relay_response() is called with REPLY_LOCK when the timer
>>>>>> process fires on the fr_inv_timer (no response from the INVITE that was
>>>>>> relayed, other than 100 provisional) and a 408 is generated. However, from
>>>>>> within that function there are calls to run_failure_handlers() which in
>>>>>> turn *could* try and lock the reply (viz. somebody having a t_reply() call
>>>>>> in the cfg file - in failure route block). This would result in another
>>>>>> lock on the same transaction's REPLY_LOCK....
>>>>>>
>>>>>> Has anybody else experienced something like this?
>>>>>>
>>>>>> this is on master btw.
>>>>>>
>>>>>> Cheers
>>>>>> Jason
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> sr-dev mailing listsr-dev at lists.sip-router.orghttp://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Daniel-Constantin Mierla - http://www.asipto.comhttp://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> sr-dev mailing list
>>>>>> sr-dev at lists.sip-router.org
>>>>>> http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev
>>>>>>
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Daniel-Constantin Mierla - http://www.asipto.comhttp://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda
>>>>
>>>>
>>>
>>> --
>>> Daniel-Constantin Mierla - http://www.asipto.comhttp://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda
>>>
>>>
>>
>> --
>> Daniel-Constantin Mierla - http://www.asipto.comhttp://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda
>>
>>
>
> --
> Daniel-Constantin Mierla - http://www.asipto.comhttp://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.sip-router.org/pipermail/sr-dev/attachments/20150203/a9705693/attachment-0001.html>
More information about the sr-dev
mailing list