[sr-dev] TM possible deadlock

Thu Apr 10 12:53:08 CEST 2014

OK. I will leave it a bit in master to see if there are any new reports, 
then I will backport. I will also have to review the tm reply functions 
that can be used from config to align them to the new check.

Cheers,
Daniel

On 10/04/14 09:06, Jason Penton wrote:
> oh excellent, I will look at it right away - was just getting ready to 
> jump in myself ;)
>
> Cheers
> Jason
>
>
> On Thu, Apr 10, 2014 at 9:01 AM, Daniel-Constantin Mierla 
> <miconda at gmail.com <mailto:miconda at gmail.com>> wrote:
>
>     Hello Jason,
>
>     I pushed a patch trying to fix this case, it is only on git master
>     branch. Can you test it? If all goes fine, we can consider
>     backporting it.
>
>     Cheers,
>     Daniel
>
>
>     On 09/04/14 23:26, Jason Penton wrote:
>>     Hey Daniel,
>>
>>     nothing extraordinary...
>>
>>     # -- TM params --
>>     modparam("tm", "fr_timer", 20000);
>>     modparam("tm", "fr_inv_timer", 10000)
>>
>>
>>     Cheers
>>     Jason
>>
>>
>>     On Wed, Apr 9, 2014 at 10:32 PM, Jason Penton
>>     <jason.penton at gmail.com <mailto:jason.penton at gmail.com>> wrote:
>>
>>         Hey Daniel,
>>
>>         Yes I did a test with a very basic config file and I am not
>>         able to re-create. However, with my *complex* cfg file I can
>>         re-create every time. Tomorrow I will compare what is
>>         different and report back... hopefully with fix ;)
>>
>>         here is bt of timer process deadlocking itself:
>>
>>         #0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:39
>>         #1  0x00007f5009f22004 in futex_get (lock=0x7f4fc55030d8) at
>>         ../../mem/../futexlock.h:123
>>         #2  0x00007f5009f223e1 in _lock (s=0x7f4fc55030d8,
>>         file=0x7f5009f90fd1 "t_cancel.c", function=0x7f5009f91980
>>         "cancel_branch", line=250) at lock.h:99
>>         #3  0x00007f5009f23271 in cancel_branch (t=0x7f4fc5501b40,
>>         branch=0, reason=0x7fff646d03a8, flags=3) at t_cancel.c:250
>>         #4  0x00007f5009f22c02 in cancel_uacs (t=0x7f4fc5501b40,
>>         cancel_data=0x7fff646d03a0, flags=1) at t_cancel.c:123
>>         #5  0x00007f5009f718c4 in _reply_light (trans=0x7f4fc5501b40,
>>             buf=0x7f500a24dc68 "SIP/2.0 500 Server error on LIR
>>         select next S-CSCF\r\nVia: SIP/2.0/UDP
>>         10.0.1.167:6060;branch=z9hG4bKb7.2ae09f29ffbd0034cd6d58483053603b.1\r\nVia:
>>         SIP/2.0/UDP
>>         10.0.1.166:4060;branch=z9hG4bKb7.3faa03ddea80"..., len=778,
>>         code=500, to_tag=0x7f500a1c7ae0
>>         "c82b15d7f12ef185f95fe4945457d449-8bab", to_tag_len=37,
>>         lock=0, bm=0x7fff646d0b60) at t_reply.c:660
>>         #6  0x00007f5009f7244c in _reply (trans=0x7f4fc5501b40,
>>         p_msg=0x7f500a1c6bc0, code=500, text=0x7f500a249a48 "Server
>>         error on LIR select next S-CSCF", lock=0) at t_reply.c:795
>>         #7  0x00007f5009f76436 in t_reply_unsafe (t=0x7f4fc5501b40,
>>         p_msg=0x7f500a1c6bc0, code=500, text=0x7f500a249a48 "Server
>>         error on LIR select next S-CSCF") at t_reply.c:1643
>>         #8  0x00007f5009f57621 in w_t_reply (msg=0x7f500a1c6bc0,
>>         p1=0x7f500a2497d8 "\340\332$\nP\177", p2=0x7f500a249870
>>         "h\321$\nP\177") at tm.c:1324
>>         #9  0x000000000041a700 in do_action (h=0x7fff646d1d30,
>>         a=0x7f500a24cee8, msg=0x7f500a1c6bc0) at action.c:1119
>>         #10 0x0000000000423831 in run_actions (h=0x7fff646d1d30,
>>         a=0x7f500a24cee8, msg=0x7f500a1c6bc0) at action.c:1607
>>         #11 0x000000000041a5a4 in do_action (h=0x7fff646d1d30,
>>         a=0x7f500a24d478, msg=0x7f500a1c6bc0) at action.c:1102
>>         #12 0x0000000000423831 in run_actions (h=0x7fff646d1d30,
>>         a=0x7f500a249148, msg=0x7f500a1c6bc0) at action.c:1607
>>         #13 0x000000000041a54e in do_action (h=0x7fff646d1d30,
>>         a=0x7f500a24c500, msg=0x7f500a1c6bc0) at action.c:1098
>>         #14 0x0000000000423831 in run_actions (h=0x7fff646d1d30,
>>         a=0x7f500a247a28, msg=0x7f500a1c6bc0) at action.c:1607
>>         #15 0x0000000000423fdf in run_top_route (a=0x7f500a247a28,
>>         msg=0x7f500a1c6bc0, c=0x0) at action.c:1693
>>         #16 0x00007f5009f73815 in run_failure_handlers
>>         (t=0x7f4fc5501b40, rpl=0xffffffffffffffff, code=408,
>>         extra_flags=96) at t_reply.c:1061
>>         #17 0x00007f5009f7527a in t_should_relay_response
>>         (Trans=0x7f4fc5501b40, new_code=408, branch=1,
>>         should_store=0x7fff646d201c, should_relay=0x7fff646d2018,
>>         cancel_data=0x7fff646d2070,
>>             reply=0xffffffffffffffff) at t_reply.c:1416
>>         #18 0x00007f5009f76ede in relay_reply (t=0x7f4fc5501b40,
>>         p_msg=0xffffffffffffffff, branch=1, msg_status=408,
>>         cancel_data=0x7fff646d2070, do_put_on_wait=0) at t_reply.c:1819
>>         #19 0x00007f5009f44c88 in fake_reply (t=0x7f4fc5501b40,
>>         branch=1, code=408) at timer.c:354
>>         #20 0x00007f5009f450e7 in final_response_handler
>>         (r_buf=0x7f4fc5501e60, t=0x7f4fc5501b40) at timer.c:526
>>         #21 0x00007f5009f4518d in retr_buf_handler (ticks=260027386,
>>         tl=0x7f4fc5501e80, p=0x3e8) at timer.c:584
>>         #22 0x0000000000544119 in timer_list_expire (t=260027386,
>>         h=0x7f4fc527cbe0, slow_l=0x7f4fc527cdf0, slow_mark=0) at
>>         timer.c:894
>>         #23 0x0000000000544418 in timer_handler () at timer.c:959
>>         #24 0x00000000005446b2 in timer_main () at timer.c:998
>>         #25 0x0000000000471ddf in main_loop () at main.c:1689
>>
>>
>>
>>         On Wed, Apr 9, 2014 at 9:34 PM, Daniel-Constantin Mierla
>>         <miconda at gmail.com <mailto:miconda at gmail.com>> wrote:
>>
>>             Hello,
>>
>>             that should not be a very rare case and I would expect to
>>             be caught so far, anyhow ... this looks like easy to
>>             reproduce, have you tried it?
>>
>>             You can have two kamailio, one relying the invite to the
>>             second, which will reply with 100, then wait for the
>>             timeout on the first instance. You can add some debug
>>             messages in the code to see if the lock is called twice.
>>
>>             Cheers,
>>             Daniel
>>
>>
>>             On 09/04/14 17:51, Jason Penton wrote:
>>>             Hi All,
>>>
>>>             I have been experiencing a deadlock when a timeout
>>>             occurs on a t_relayed() INVITE. Going through the code I
>>>             have noticed a possible chance of deadlock (without
>>>             re-entrant enabled). Here is my thinking:
>>>
>>>             t_should_relay_response() is called with REPLY_LOCK when
>>>             the timer process fires on the fr_inv_timer (no response
>>>             from the INVITE that was relayed, other than 100
>>>             provisional) and a 408 is generated. However, from
>>>             within that function there are calls
>>>             to run_failure_handlers() which in turn *could* try and
>>>             lock the reply (viz. somebody having a t_reply() call in
>>>             the cfg file - in failure route block). This would
>>>             result in another lock on the same transaction's
>>>             REPLY_LOCK....
>>>
>>>             Has anybody else experienced something like this?
>>>
>>>             this is on master btw.
>>>
>>>             Cheers
>>>             Jason
>>>
>>>
>>>             _______________________________________________
>>>             sr-dev mailing list
>>>             sr-dev at lists.sip-router.org  <mailto:sr-dev at lists.sip-router.org>
>>>             http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev
>>
>>             -- 
>>             Daniel-Constantin Mierla -http://www.asipto.com
>>             http://twitter.com/#!/miconda  <http://twitter.com/#%21/miconda>  -http://www.linkedin.com/in/miconda
>>
>>
>>             _______________________________________________
>>             sr-dev mailing list
>>             sr-dev at lists.sip-router.org
>>             <mailto:sr-dev at lists.sip-router.org>
>>             http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev
>>
>>
>>
>
>     -- 
>     Daniel-Constantin Mierla -http://www.asipto.com
>     http://twitter.com/#!/miconda  <http://twitter.com/#%21/miconda>  -http://www.linkedin.com/in/miconda
>
>

-- 
Daniel-Constantin Mierla - http://www.asipto.com
http://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.sip-router.org/pipermail/sr-dev/attachments/20140410/ebb6165c/attachment-0001.html>