[sr-dev] TM possible deadlock

Daniel-Constantin Mierla miconda at gmail.com
Fri Apr 11 14:48:55 CEST 2014


On 11/04/14 14:45, Jason Penton wrote:
> yes okay - got it. I will test and analyse.
I meant I want to analyze before deciding to backport, but a second 
analysis is never bad :-)

Cheers,
Daniel

>
> Thanks Daniel!
>
>
> On Thu, Apr 10, 2014 at 4:35 PM, Daniel-Constantin Mierla 
> <miconda at gmail.com <mailto:miconda at gmail.com>> wrote:
>
>     Hello,
>
>     iirc, there are several functions that script writer can use, like
>     t_reply_callid() from tmx. The idea is to analyze a bit in order
>     to detect if a forced reply may end up in canceling some pending
>     branches -- the reply on the branch doesnt matter anymore and
>     should not be considered anymore for relaying upstream, because
>     the script writer already decided what to send out.
>
>     Cheers,
>     Daniel
>
>
>
>     On 10/04/14 13:24, Jason Penton wrote:
>>     Hey Daniel,
>>
>>     which reply functions are you referring to? API functions?
>>
>>     Cheers
>>     Jason
>>
>>
>>     On Thu, Apr 10, 2014 at 12:53 PM, Daniel-Constantin Mierla
>>     <miconda at gmail.com <mailto:miconda at gmail.com>> wrote:
>>
>>         OK. I will leave it a bit in master to see if there are any
>>         new reports, then I will backport. I will also have to review
>>         the tm reply functions that can be used from config to align
>>         them to the new check.
>>
>>         Cheers,
>>         Daniel
>>
>>
>>         On 10/04/14 09:06, Jason Penton wrote:
>>>         oh excellent, I will look at it right away - was just
>>>         getting ready to jump in myself ;)
>>>
>>>         Cheers
>>>         Jason
>>>
>>>
>>>         On Thu, Apr 10, 2014 at 9:01 AM, Daniel-Constantin Mierla
>>>         <miconda at gmail.com <mailto:miconda at gmail.com>> wrote:
>>>
>>>             Hello Jason,
>>>
>>>             I pushed a patch trying to fix this case, it is only on
>>>             git master branch. Can you test it? If all goes fine, we
>>>             can consider backporting it.
>>>
>>>             Cheers,
>>>             Daniel
>>>
>>>
>>>             On 09/04/14 23:26, Jason Penton wrote:
>>>>             Hey Daniel,
>>>>
>>>>             nothing extraordinary...
>>>>
>>>>             # -- TM params --
>>>>             modparam("tm", "fr_timer", 20000);
>>>>             modparam("tm", "fr_inv_timer", 10000)
>>>>
>>>>
>>>>             Cheers
>>>>             Jason
>>>>
>>>>
>>>>             On Wed, Apr 9, 2014 at 10:32 PM, Jason Penton
>>>>             <jason.penton at gmail.com
>>>>             <mailto:jason.penton at gmail.com>> wrote:
>>>>
>>>>                 Hey Daniel,
>>>>
>>>>                 Yes I did a test with a very basic config file and
>>>>                 I am not able to re-create. However, with my
>>>>                 *complex* cfg file I can re-create every time.
>>>>                 Tomorrow I will compare what is different and
>>>>                 report back... hopefully with fix ;)
>>>>
>>>>                 here is bt of timer process deadlocking itself:
>>>>
>>>>                 #0  syscall () at
>>>>                 ../sysdeps/unix/sysv/linux/x86_64/syscall.S:39
>>>>                 #1  0x00007f5009f22004 in futex_get
>>>>                 (lock=0x7f4fc55030d8) at ../../mem/../futexlock.h:123
>>>>                 #2  0x00007f5009f223e1 in _lock (s=0x7f4fc55030d8,
>>>>                 file=0x7f5009f90fd1 "t_cancel.c",
>>>>                 function=0x7f5009f91980 "cancel_branch", line=250)
>>>>                 at lock.h:99
>>>>                 #3  0x00007f5009f23271 in cancel_branch
>>>>                 (t=0x7f4fc5501b40, branch=0, reason=0x7fff646d03a8,
>>>>                 flags=3) at t_cancel.c:250
>>>>                 #4  0x00007f5009f22c02 in cancel_uacs
>>>>                 (t=0x7f4fc5501b40, cancel_data=0x7fff646d03a0,
>>>>                 flags=1) at t_cancel.c:123
>>>>                 #5  0x00007f5009f718c4 in _reply_light
>>>>                 (trans=0x7f4fc5501b40,
>>>>                 buf=0x7f500a24dc68 "SIP/2.0 500 Server error on LIR
>>>>                 select next S-CSCF\r\nVia: SIP/2.0/UDP
>>>>                 10.0.1.167:6060;branch=z9hG4bKb7.2ae09f29ffbd0034cd6d58483053603b.1\r\nVia:
>>>>                 SIP/2.0/UDP
>>>>                 10.0.1.166:4060;branch=z9hG4bKb7.3faa03ddea80"...,
>>>>                 len=778, code=500, to_tag=0x7f500a1c7ae0
>>>>                 "c82b15d7f12ef185f95fe4945457d449-8bab",
>>>>                 to_tag_len=37, lock=0, bm=0x7fff646d0b60) at
>>>>                 t_reply.c:660
>>>>                 #6  0x00007f5009f7244c in _reply
>>>>                 (trans=0x7f4fc5501b40, p_msg=0x7f500a1c6bc0,
>>>>                 code=500, text=0x7f500a249a48 "Server error on LIR
>>>>                 select next S-CSCF", lock=0) at t_reply.c:795
>>>>                 #7  0x00007f5009f76436 in t_reply_unsafe
>>>>                 (t=0x7f4fc5501b40, p_msg=0x7f500a1c6bc0, code=500,
>>>>                 text=0x7f500a249a48 "Server error on LIR select
>>>>                 next S-CSCF") at t_reply.c:1643
>>>>                 #8  0x00007f5009f57621 in w_t_reply
>>>>                 (msg=0x7f500a1c6bc0, p1=0x7f500a2497d8
>>>>                 "\340\332$\nP\177", p2=0x7f500a249870
>>>>                 "h\321$\nP\177") at tm.c:1324
>>>>                 #9  0x000000000041a700 in do_action
>>>>                 (h=0x7fff646d1d30, a=0x7f500a24cee8,
>>>>                 msg=0x7f500a1c6bc0) at action.c:1119
>>>>                 #10 0x0000000000423831 in run_actions
>>>>                 (h=0x7fff646d1d30, a=0x7f500a24cee8,
>>>>                 msg=0x7f500a1c6bc0) at action.c:1607
>>>>                 #11 0x000000000041a5a4 in do_action
>>>>                 (h=0x7fff646d1d30, a=0x7f500a24d478,
>>>>                 msg=0x7f500a1c6bc0) at action.c:1102
>>>>                 #12 0x0000000000423831 in run_actions
>>>>                 (h=0x7fff646d1d30, a=0x7f500a249148,
>>>>                 msg=0x7f500a1c6bc0) at action.c:1607
>>>>                 #13 0x000000000041a54e in do_action
>>>>                 (h=0x7fff646d1d30, a=0x7f500a24c500,
>>>>                 msg=0x7f500a1c6bc0) at action.c:1098
>>>>                 #14 0x0000000000423831 in run_actions
>>>>                 (h=0x7fff646d1d30, a=0x7f500a247a28,
>>>>                 msg=0x7f500a1c6bc0) at action.c:1607
>>>>                 #15 0x0000000000423fdf in run_top_route
>>>>                 (a=0x7f500a247a28, msg=0x7f500a1c6bc0, c=0x0) at
>>>>                 action.c:1693
>>>>                 #16 0x00007f5009f73815 in run_failure_handlers
>>>>                 (t=0x7f4fc5501b40, rpl=0xffffffffffffffff,
>>>>                 code=408, extra_flags=96) at t_reply.c:1061
>>>>                 #17 0x00007f5009f7527a in t_should_relay_response
>>>>                 (Trans=0x7f4fc5501b40, new_code=408, branch=1,
>>>>                 should_store=0x7fff646d201c,
>>>>                 should_relay=0x7fff646d2018,
>>>>                 cancel_data=0x7fff646d2070,
>>>>                 reply=0xffffffffffffffff) at t_reply.c:1416
>>>>                 #18 0x00007f5009f76ede in relay_reply
>>>>                 (t=0x7f4fc5501b40, p_msg=0xffffffffffffffff,
>>>>                 branch=1, msg_status=408,
>>>>                 cancel_data=0x7fff646d2070, do_put_on_wait=0) at
>>>>                 t_reply.c:1819
>>>>                 #19 0x00007f5009f44c88 in fake_reply
>>>>                 (t=0x7f4fc5501b40, branch=1, code=408) at timer.c:354
>>>>                 #20 0x00007f5009f450e7 in final_response_handler
>>>>                 (r_buf=0x7f4fc5501e60, t=0x7f4fc5501b40) at timer.c:526
>>>>                 #21 0x00007f5009f4518d in retr_buf_handler
>>>>                 (ticks=260027386, tl=0x7f4fc5501e80, p=0x3e8) at
>>>>                 timer.c:584
>>>>                 #22 0x0000000000544119 in timer_list_expire
>>>>                 (t=260027386, h=0x7f4fc527cbe0,
>>>>                 slow_l=0x7f4fc527cdf0, slow_mark=0) at timer.c:894
>>>>                 #23 0x0000000000544418 in timer_handler () at
>>>>                 timer.c:959
>>>>                 #24 0x00000000005446b2 in timer_main () at timer.c:998
>>>>                 #25 0x0000000000471ddf in main_loop () at main.c:1689
>>>>
>>>>
>>>>
>>>>                 On Wed, Apr 9, 2014 at 9:34 PM, Daniel-Constantin
>>>>                 Mierla <miconda at gmail.com
>>>>                 <mailto:miconda at gmail.com>> wrote:
>>>>
>>>>                     Hello,
>>>>
>>>>                     that should not be a very rare case and I would
>>>>                     expect to be caught so far, anyhow ... this
>>>>                     looks like easy to reproduce, have you tried it?
>>>>
>>>>                     You can have two kamailio, one relying the
>>>>                     invite to the second, which will reply with
>>>>                     100, then wait for the timeout on the first
>>>>                     instance. You can add some debug messages in
>>>>                     the code to see if the lock is called twice.
>>>>
>>>>                     Cheers,
>>>>                     Daniel
>>>>
>>>>
>>>>                     On 09/04/14 17:51, Jason Penton wrote:
>>>>>                     Hi All,
>>>>>
>>>>>                     I have been experiencing a deadlock when a
>>>>>                     timeout occurs on a t_relayed() INVITE. Going
>>>>>                     through the code I have noticed a possible
>>>>>                     chance of deadlock (without re-entrant
>>>>>                     enabled). Here is my thinking:
>>>>>
>>>>>                     t_should_relay_response() is called with
>>>>>                     REPLY_LOCK when the timer process fires on the
>>>>>                     fr_inv_timer (no response from the INVITE that
>>>>>                     was relayed, other than 100 provisional) and a
>>>>>                     408 is generated. However, from within that
>>>>>                     function there are calls
>>>>>                     to run_failure_handlers() which in turn
>>>>>                     *could* try and lock the reply (viz. somebody
>>>>>                     having a t_reply() call in the cfg file - in
>>>>>                     failure route block). This would result in
>>>>>                     another lock on the same transaction's
>>>>>                     REPLY_LOCK....
>>>>>
>>>>>                     Has anybody else experienced something like this?
>>>>>
>>>>>                     this is on master btw.
>>>>>
>>>>>                     Cheers
>>>>>                     Jason
>>>>>
>>>>>
>>>>>                     _______________________________________________
>>>>>                     sr-dev mailing list
>>>>>                     sr-dev at lists.sip-router.org  <mailto:sr-dev at lists.sip-router.org>
>>>>>                     http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev
>>>>
>>>>                     -- 
>>>>                     Daniel-Constantin Mierla -http://www.asipto.com
>>>>                     http://twitter.com/#!/miconda  <http://twitter.com/#%21/miconda>  -http://www.linkedin.com/in/miconda
>>>>
>>>>
>>>>                     _______________________________________________
>>>>                     sr-dev mailing list
>>>>                     sr-dev at lists.sip-router.org
>>>>                     <mailto:sr-dev at lists.sip-router.org>
>>>>                     http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev
>>>>
>>>>
>>>>
>>>
>>>             -- 
>>>             Daniel-Constantin Mierla -http://www.asipto.com
>>>             http://twitter.com/#!/miconda  <http://twitter.com/#%21/miconda>  -http://www.linkedin.com/in/miconda
>>>
>>>
>>
>>         -- 
>>         Daniel-Constantin Mierla -http://www.asipto.com
>>         http://twitter.com/#!/miconda  <http://twitter.com/#%21/miconda>  -http://www.linkedin.com/in/miconda
>>
>>
>
>     -- 
>     Daniel-Constantin Mierla -http://www.asipto.com
>     http://twitter.com/#!/miconda  <http://twitter.com/#%21/miconda>  -http://www.linkedin.com/in/miconda
>
>

-- 
Daniel-Constantin Mierla - http://www.asipto.com
http://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.sip-router.org/pipermail/sr-dev/attachments/20140411/eb8f737f/attachment-0001.html>


More information about the sr-dev mailing list