[sr-dev] TM possible deadlock
Daniel-Constantin Mierla
miconda at gmail.com
Thu Apr 10 12:53:08 CEST 2014
OK. I will leave it a bit in master to see if there are any new reports,
then I will backport. I will also have to review the tm reply functions
that can be used from config to align them to the new check.
Cheers,
Daniel
On 10/04/14 09:06, Jason Penton wrote:
> oh excellent, I will look at it right away - was just getting ready to
> jump in myself ;)
>
> Cheers
> Jason
>
>
> On Thu, Apr 10, 2014 at 9:01 AM, Daniel-Constantin Mierla
> <miconda at gmail.com <mailto:miconda at gmail.com>> wrote:
>
> Hello Jason,
>
> I pushed a patch trying to fix this case, it is only on git master
> branch. Can you test it? If all goes fine, we can consider
> backporting it.
>
> Cheers,
> Daniel
>
>
> On 09/04/14 23:26, Jason Penton wrote:
>> Hey Daniel,
>>
>> nothing extraordinary...
>>
>> # -- TM params --
>> modparam("tm", "fr_timer", 20000);
>> modparam("tm", "fr_inv_timer", 10000)
>>
>>
>> Cheers
>> Jason
>>
>>
>> On Wed, Apr 9, 2014 at 10:32 PM, Jason Penton
>> <jason.penton at gmail.com <mailto:jason.penton at gmail.com>> wrote:
>>
>> Hey Daniel,
>>
>> Yes I did a test with a very basic config file and I am not
>> able to re-create. However, with my *complex* cfg file I can
>> re-create every time. Tomorrow I will compare what is
>> different and report back... hopefully with fix ;)
>>
>> here is bt of timer process deadlocking itself:
>>
>> #0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:39
>> #1 0x00007f5009f22004 in futex_get (lock=0x7f4fc55030d8) at
>> ../../mem/../futexlock.h:123
>> #2 0x00007f5009f223e1 in _lock (s=0x7f4fc55030d8,
>> file=0x7f5009f90fd1 "t_cancel.c", function=0x7f5009f91980
>> "cancel_branch", line=250) at lock.h:99
>> #3 0x00007f5009f23271 in cancel_branch (t=0x7f4fc5501b40,
>> branch=0, reason=0x7fff646d03a8, flags=3) at t_cancel.c:250
>> #4 0x00007f5009f22c02 in cancel_uacs (t=0x7f4fc5501b40,
>> cancel_data=0x7fff646d03a0, flags=1) at t_cancel.c:123
>> #5 0x00007f5009f718c4 in _reply_light (trans=0x7f4fc5501b40,
>> buf=0x7f500a24dc68 "SIP/2.0 500 Server error on LIR
>> select next S-CSCF\r\nVia: SIP/2.0/UDP
>> 10.0.1.167:6060;branch=z9hG4bKb7.2ae09f29ffbd0034cd6d58483053603b.1\r\nVia:
>> SIP/2.0/UDP
>> 10.0.1.166:4060;branch=z9hG4bKb7.3faa03ddea80"..., len=778,
>> code=500, to_tag=0x7f500a1c7ae0
>> "c82b15d7f12ef185f95fe4945457d449-8bab", to_tag_len=37,
>> lock=0, bm=0x7fff646d0b60) at t_reply.c:660
>> #6 0x00007f5009f7244c in _reply (trans=0x7f4fc5501b40,
>> p_msg=0x7f500a1c6bc0, code=500, text=0x7f500a249a48 "Server
>> error on LIR select next S-CSCF", lock=0) at t_reply.c:795
>> #7 0x00007f5009f76436 in t_reply_unsafe (t=0x7f4fc5501b40,
>> p_msg=0x7f500a1c6bc0, code=500, text=0x7f500a249a48 "Server
>> error on LIR select next S-CSCF") at t_reply.c:1643
>> #8 0x00007f5009f57621 in w_t_reply (msg=0x7f500a1c6bc0,
>> p1=0x7f500a2497d8 "\340\332$\nP\177", p2=0x7f500a249870
>> "h\321$\nP\177") at tm.c:1324
>> #9 0x000000000041a700 in do_action (h=0x7fff646d1d30,
>> a=0x7f500a24cee8, msg=0x7f500a1c6bc0) at action.c:1119
>> #10 0x0000000000423831 in run_actions (h=0x7fff646d1d30,
>> a=0x7f500a24cee8, msg=0x7f500a1c6bc0) at action.c:1607
>> #11 0x000000000041a5a4 in do_action (h=0x7fff646d1d30,
>> a=0x7f500a24d478, msg=0x7f500a1c6bc0) at action.c:1102
>> #12 0x0000000000423831 in run_actions (h=0x7fff646d1d30,
>> a=0x7f500a249148, msg=0x7f500a1c6bc0) at action.c:1607
>> #13 0x000000000041a54e in do_action (h=0x7fff646d1d30,
>> a=0x7f500a24c500, msg=0x7f500a1c6bc0) at action.c:1098
>> #14 0x0000000000423831 in run_actions (h=0x7fff646d1d30,
>> a=0x7f500a247a28, msg=0x7f500a1c6bc0) at action.c:1607
>> #15 0x0000000000423fdf in run_top_route (a=0x7f500a247a28,
>> msg=0x7f500a1c6bc0, c=0x0) at action.c:1693
>> #16 0x00007f5009f73815 in run_failure_handlers
>> (t=0x7f4fc5501b40, rpl=0xffffffffffffffff, code=408,
>> extra_flags=96) at t_reply.c:1061
>> #17 0x00007f5009f7527a in t_should_relay_response
>> (Trans=0x7f4fc5501b40, new_code=408, branch=1,
>> should_store=0x7fff646d201c, should_relay=0x7fff646d2018,
>> cancel_data=0x7fff646d2070,
>> reply=0xffffffffffffffff) at t_reply.c:1416
>> #18 0x00007f5009f76ede in relay_reply (t=0x7f4fc5501b40,
>> p_msg=0xffffffffffffffff, branch=1, msg_status=408,
>> cancel_data=0x7fff646d2070, do_put_on_wait=0) at t_reply.c:1819
>> #19 0x00007f5009f44c88 in fake_reply (t=0x7f4fc5501b40,
>> branch=1, code=408) at timer.c:354
>> #20 0x00007f5009f450e7 in final_response_handler
>> (r_buf=0x7f4fc5501e60, t=0x7f4fc5501b40) at timer.c:526
>> #21 0x00007f5009f4518d in retr_buf_handler (ticks=260027386,
>> tl=0x7f4fc5501e80, p=0x3e8) at timer.c:584
>> #22 0x0000000000544119 in timer_list_expire (t=260027386,
>> h=0x7f4fc527cbe0, slow_l=0x7f4fc527cdf0, slow_mark=0) at
>> timer.c:894
>> #23 0x0000000000544418 in timer_handler () at timer.c:959
>> #24 0x00000000005446b2 in timer_main () at timer.c:998
>> #25 0x0000000000471ddf in main_loop () at main.c:1689
>>
>>
>>
>> On Wed, Apr 9, 2014 at 9:34 PM, Daniel-Constantin Mierla
>> <miconda at gmail.com <mailto:miconda at gmail.com>> wrote:
>>
>> Hello,
>>
>> that should not be a very rare case and I would expect to
>> be caught so far, anyhow ... this looks like easy to
>> reproduce, have you tried it?
>>
>> You can have two kamailio, one relying the invite to the
>> second, which will reply with 100, then wait for the
>> timeout on the first instance. You can add some debug
>> messages in the code to see if the lock is called twice.
>>
>> Cheers,
>> Daniel
>>
>>
>> On 09/04/14 17:51, Jason Penton wrote:
>>> Hi All,
>>>
>>> I have been experiencing a deadlock when a timeout
>>> occurs on a t_relayed() INVITE. Going through the code I
>>> have noticed a possible chance of deadlock (without
>>> re-entrant enabled). Here is my thinking:
>>>
>>> t_should_relay_response() is called with REPLY_LOCK when
>>> the timer process fires on the fr_inv_timer (no response
>>> from the INVITE that was relayed, other than 100
>>> provisional) and a 408 is generated. However, from
>>> within that function there are calls
>>> to run_failure_handlers() which in turn *could* try and
>>> lock the reply (viz. somebody having a t_reply() call in
>>> the cfg file - in failure route block). This would
>>> result in another lock on the same transaction's
>>> REPLY_LOCK....
>>>
>>> Has anybody else experienced something like this?
>>>
>>> this is on master btw.
>>>
>>> Cheers
>>> Jason
>>>
>>>
>>> _______________________________________________
>>> sr-dev mailing list
>>> sr-dev at lists.sip-router.org <mailto:sr-dev at lists.sip-router.org>
>>> http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev
>>
>> --
>> Daniel-Constantin Mierla -http://www.asipto.com
>> http://twitter.com/#!/miconda <http://twitter.com/#%21/miconda> -http://www.linkedin.com/in/miconda
>>
>>
>> _______________________________________________
>> sr-dev mailing list
>> sr-dev at lists.sip-router.org
>> <mailto:sr-dev at lists.sip-router.org>
>> http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev
>>
>>
>>
>
> --
> Daniel-Constantin Mierla -http://www.asipto.com
> http://twitter.com/#!/miconda <http://twitter.com/#%21/miconda> -http://www.linkedin.com/in/miconda
>
>
--
Daniel-Constantin Mierla - http://www.asipto.com
http://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.sip-router.org/pipermail/sr-dev/attachments/20140410/ebb6165c/attachment-0001.html>
More information about the sr-dev
mailing list