[SR-Users] two Kamailio instances suddenly stopped handling packets

Daniel-Constantin Mierla miconda at gmail.com
Mon Jan 15 12:08:56 CET 2018


Hello,

good that it was sorted out and nothing unexpected inside kamailio
locking code...

Cheers,
Daniel


On 12.01.18 16:38, Moritz Maisel wrote:
> Hi Daniel,
>
> thanks for pointing out the details about the lock structure. After
> some investigation towards locking issues, we found out that the root
> cause was an external system that blocked kamailio's connection
> attempts which subsequently lead to blocked threads. Sorry for the
> noise and have a nice weekend!
>
> Regards,
> Moritz
>
>
> On Fri, Jan 12, 2018 at 1:50 PM, Daniel-Constantin Mierla
> <miconda at gmail.com <mailto:miconda at gmail.com>> wrote:
>
>     Hello,
>
>
>     On 12.01.18 11:48, Sebastian Damm wrote:
>>     Hi Daniel,
>>
>>     unless there is a silent exit from some module (those extra
>>     routes just write some information into redis), I can't find any
>>     other drop and exits than the one in the route posted. 
>>
>>     We have added some logging around the lock functions, and
>>     although we are missing some log lines for some packets, making
>>     it look like some locks don't get released, we do see that
>>     another PID can successfully lock the same key later.
>>
>>     There is no other lock/unlock in the config.
>>
>>     Is there a way to debug the lock array?
>     attaching with gdb and looking at the values is always an option.
>     Otherwise, I do not recall anything for this purpose written in
>     kamailio.
>
>>     One guess could be that the lock array is full, although we
>>     already have the lock size set to 10.
>
>     Lock array cannot become "full", if you thought like that if there
>     are 10 lock operations with different keys on an array of 10 locks.
>
>     Actually many string keys can use same lock/mutex -- a hashing
>     function is performed over the key value results in an integer
>     value which then is made modulo number of locks and the lock at
>     that index is chosen from the array. So if you do lock("abc") and
>     then you don't release it, then lock("xyz") can get blocked. Also,
>     doing unlock("xyz") without lock("xyz") is releasing the
>     lock("abc") -- note that "abc" and "xyz" are just random example
>     to explain what can happen, but does not mean it happens for these
>     keys.
>
>     Cheers,
>     Daniel
>
>
>>
>>     Regards,
>>     Sebastian
>>
>>     On Fri, Jan 12, 2018 at 11:08 AM, Daniel-Constantin Mierla
>>     <miconda at gmail.com <mailto:miconda at gmail.com>> wrote:
>>
>>         Hello,
>>
>>         have you checked that any of the route blocks used in between
>>         lock() and unlock() don't do an exit/drop (also not return 0)?
>>
>>         Do you have other lock() / unlock() in the config?
>>
>>         Cheers,
>>         Daniel
>>
>>
>>         On 12.01.18 08:49, Sebastian Damm wrote:
>>>         Hi,
>>>
>>>         this morning, two Kamailio servers suddenly stopped working
>>>         after having worked without a problem for months. Both are
>>>         running 5.0.2 on Debian Jessie. Those systems only analyze
>>>         mirrored traffic and write some information to a RabbitMQ.
>>>
>>>         I tried restarting, but after a few seconds they get stuck
>>>         the same way as before. Then I attached a gdb to some of the
>>>         UDP listeners, and they all look pretty much the same.
>>>
>>>
>>>         ===========8< process 1 =================
>>>         (gdb) bt
>>>         #0  0x00007f404ca435b9 in syscall () from
>>>         /lib/x86_64-linux-gnu/libc.so.6
>>>         #1  0x00007f4049f41260 in futex_get (lock=0x7f4041bccd00) at
>>>         ../../core/parser/../mem/../futexlock.h:108
>>>         #2  0x00007f4049f477d7 in cfg_lock_helper
>>>         (lkey=0x7fff3c048650, mode=0) at cfgutils.c:667
>>>         #3  0x00007f4049f481ed in w_cfg_lock_wrapper
>>>         (msg=0x7f404ba17f18, key=0x7f404ba15488, mode=0) at
>>>         cfgutils.c:712
>>>         #4  0x00007f4049f4823c in w_cfg_lock (msg=0x7f404ba17f18,
>>>         key=0x7f404ba15488 "\210a\241K@\177", s2=0x0) at cfgutils.c:717
>>>         #5  0x000000000045cd85 in do_action (h=0x7fff3c049410,
>>>         a=0x7f404b9d5170, msg=0x7f404ba17f18) at core/action.c:1060
>>>         #6  0x0000000000469afa in run_actions (h=0x7fff3c049410,
>>>         a=0x7f404b9d5170, msg=0x7f404ba17f18) at core/action.c:1552
>>>         #7  0x000000000045958d in do_action (h=0x7fff3c049410,
>>>         a=0x7f404b9ce558, msg=0x7f404ba17f18) at core/action.c:678
>>>         #8  0x0000000000469afa in run_actions (h=0x7fff3c049410,
>>>         a=0x7f404b9cdf70, msg=0x7f404ba17f18) at core/action.c:1552
>>>         #9  0x000000000046a2bd in run_top_route (a=0x7f404b9cdf70,
>>>         msg=0x7f404ba17f18, c=0x0) at core/action.c:1641
>>>         #10 0x0000000000580033 in receive_msg (
>>>             buf=0xa393f9 <buf+121> "REGISTER sip:domain
>>>         SIP/2.0\r\nVia: SIP/2.0/UDP
>>>         1.2.3.4;branch=z9hG4bKfc98.c762d7151f110c7eb71fc7d4e0648f1f.0\r\nVia:
>>>         SIP/2.0/UDP
>>>         192.168.0.3:5060;rport=38020;received=62.30.8.128;branch=z9hG4"...,
>>>             len=799, rcv_info=0x7fff3c049810) at core/receive.c:264
>>>         #11 0x00007f40483efcdf in parsing_hepv3_message
>>>         (buf=0xa39380 <buf> "HEP3\003\230", len=920) at hep.c:499
>>>         #12 0x00007f40483ee264 in hepv3_received (buf=0xa39380 <buf>
>>>         "HEP3\003\230", len=920, ri=0x7fff3c049a50) at hep.c:231
>>>         #13 0x00007f40483ec9cb in hep_msg_received
>>>         (data=0x7fff3c049a30) at hep.c:85
>>>         #14 0x000000000049e4e1 in sr_event_exec (type=7,
>>>         data=0x7fff3c049a30) at core/events.c:263
>>>         #15 0x000000000048731a in udp_rcv_loop () at
>>>         core/udp_server.c:466
>>>         #16 0x0000000000422d08 in main_loop () at main.c:1623
>>>         #17 0x000000000042a408 in main (argc=13,
>>>         argv=0x7fff3c049ef8) at main.c:2643
>>>         (gdb) quit
>>>         ===========8< process 1 =================
>>>
>>>
>>>         ===========8< process 2 =================
>>>         (gdb) bt
>>>         #0  0x00007f404ca435b9 in syscall () from
>>>         /lib/x86_64-linux-gnu/libc.so.6
>>>         #1  0x00007f4049f41260 in futex_get (lock=0x7f4041bccd00) at
>>>         ../../core/parser/../mem/../futexlock.h:108
>>>         #2  0x00007f4049f477d7 in cfg_lock_helper
>>>         (lkey=0x7fff3c048650, mode=0) at cfgutils.c:667
>>>         #3  0x00007f4049f481ed in w_cfg_lock_wrapper
>>>         (msg=0x7f404ba17f18, key=0x7f404ba15488, mode=0) at
>>>         cfgutils.c:712
>>>         #4  0x00007f4049f4823c in w_cfg_lock (msg=0x7f404ba17f18,
>>>         key=0x7f404ba15488 "\210a\241K@\177", s2=0x0) at cfgutils.c:717
>>>         #5  0x000000000045cd85 in do_action (h=0x7fff3c049600,
>>>         a=0x7f404b9d5170, msg=0x7f404ba17f18) at core/action.c:1060
>>>         #6  0x0000000000469afa in run_actions (h=0x7fff3c049600,
>>>         a=0x7f404b9d5170, msg=0x7f404ba17f18) at core/action.c:1552
>>>         #7  0x000000000045958d in do_action (h=0x7fff3c049600,
>>>         a=0x7f404b9cf920, msg=0x7f404ba17f18) at core/action.c:678
>>>         #8  0x0000000000469afa in run_actions (h=0x7fff3c049600,
>>>         a=0x7f404b9ce9c0, msg=0x7f404ba17f18) at core/action.c:1552
>>>         #9  0x000000000046a2bd in run_top_route (a=0x7f404b9ce9c0,
>>>         msg=0x7f404ba17f18, c=0x7fff3c049600) at core/action.c:1641
>>>         #10 0x0000000000580a69 in receive_msg (
>>>             buf=0xa393f9 <buf+121> "SIP/2.0 200 OK\r\nVia:
>>>         SIP/2.0/UDP
>>>         172.20.40.8;branch=z9hG4bKa69c.5cca5a54c4d270c515eeb4bbb5e0bb44.0\r\nVia:
>>>         SIP/2.0/UDP 2.3.4.5:5060 <http://2.3.4.5:5060>\r\nContact:
>>>         <sip:1234567 at 9.8.7.6:59669;transport=UDP>\r\nTo: "...,
>>>         len=506, rcv_info=0x7fff3c049810) at core/receive.c:327
>>>         #11 0x00007f40483efcdf in parsing_hepv3_message
>>>         (buf=0xa39380 <buf> "HEP3\002s", len=627) at hep.c:499
>>>         #12 0x00007f40483ee264 in hepv3_received (buf=0xa39380 <buf>
>>>         "HEP3\002s", len=627, ri=0x7fff3c049a50) at hep.c:231
>>>         #13 0x00007f40483ec9cb in hep_msg_received
>>>         (data=0x7fff3c049a30) at hep.c:85
>>>         #14 0x000000000049e4e1 in sr_event_exec (type=7,
>>>         data=0x7fff3c049a30) at core/events.c:263
>>>         #15 0x000000000048731a in udp_rcv_loop () at
>>>         core/udp_server.c:466
>>>         #16 0x0000000000422d08 in main_loop () at main.c:1623
>>>         #17 0x000000000042a408 in main (argc=13,
>>>         argv=0x7fff3c049ef8) at main.c:2643
>>>         (gdb) quit
>>>         ===========8< process 2 =================
>>>
>>>         Looks to me as if Kamailio gets stuck trying to get a lock
>>>         for this packet. 
>>>
>>>         The config file when handling those packets looks like this:
>>>
>>>         request_route {
>>>                 route(initialize_variables);
>>>                 route(foo);
>>>         }
>>>
>>>         onreply_route {
>>>                 route(initialize_variables);
>>>                 route(foo);
>>>         }
>>>
>>>         route[initialize_variables] {
>>>                 $vn(bar) = $null;
>>>                 $vn(baz) = $null;
>>>                 $vn(barbaz) = $null;
>>>                 $vn(foobar) = $null;
>>>
>>>         }
>>>
>>>         route[foo] {
>>>                 lock("$ci");
>>>                 xlog("L_DBG", "Obtained lock, calling lua...\n");
>>>                 if(!lua_run("handle_packet")) {
>>>                         xlog("L_ERR", "SCRIPT: failed to execute lua
>>>         function!\n");
>>>                 }
>>>
>>>                 if ($vn(bar) != $null) {
>>>                   route(mediaIpToRedis);
>>>                 }
>>>
>>>                 if ($vn(baz) != $null) {
>>>                   route(channelInfoToRedis);
>>>                 }
>>>
>>>                 if ($vn(barbaz) != $null) {
>>>                         route(sendToQueue);
>>>                 }
>>>                 xlog("L_DBG", "All finished, releasing lock...\n");
>>>                 unlock("$ci");
>>>                 xlog("L_DBG", "Released lock...\n");
>>>
>>>                 # update stats
>>>                 if ($vn(foobar) != $null) {
>>>                         update_stat("$vn(foobar)", "+1");
>>>                 }
>>>
>>>                 drop;
>>>                 exit;
>>>         }
>>>
>>>         Is there any race condition I am missing? Until this
>>>         morning, it ran without problems and only every 2 minutes or
>>>         so, 4 packets were sent to the RabbitMQ. So the system
>>>         throws away nearly 100% of the traffic (which is around 20
>>>         Mbit). 
>>>
>>>         In the log file, there are no errors at all, no aborted lua
>>>         scripts or whatsoever.
>>>
>>>         Does anybody have a hint for me? Thanks in advance.
>>>
>>>         Regards
>>>         Sebastian
>>>
>>>
>>>         _______________________________________________
>>>         Kamailio (SER) - Users Mailing List
>>>         sr-users at lists.kamailio.org <mailto:sr-users at lists.kamailio.org>
>>>         https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users
>>>         <https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users>
>>
>>         -- 
>>         Daniel-Constantin Mierla
>>         www.twitter.com/miconda <http://www.twitter.com/miconda> -- www.linkedin.com/in/miconda <http://www.linkedin.com/in/miconda>
>>         Kamailio Advanced Training - March 5-7, 2018, Berlin - www.asipto.com <http://www.asipto.com>
>>         Kamailio World Conference - May 14-16, 2018 - www.kamailioworld.com <http://www.kamailioworld.com>
>>
>>
>>
>>
>>
>>
>>     _______________________________________________
>>     Kamailio (SER) - Users Mailing List
>>     sr-users at lists.kamailio.org <mailto:sr-users at lists.kamailio.org>
>>     https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users
>>     <https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users>
>
>     -- 
>     Daniel-Constantin Mierla
>     www.twitter.com/miconda <http://www.twitter.com/miconda> -- www.linkedin.com/in/miconda <http://www.linkedin.com/in/miconda>
>     Kamailio Advanced Training - March 5-7, 2018, Berlin - www.asipto.com <http://www.asipto.com>
>     Kamailio World Conference - May 14-16, 2018 - www.kamailioworld.com <http://www.kamailioworld.com>
>
>
>     _______________________________________________
>     Kamailio (SER) - Users Mailing List
>     sr-users at lists.kamailio.org <mailto:sr-users at lists.kamailio.org>
>     https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users
>     <https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users>
>
>
>
>
> -- 
> sipgate GmbH - Gladbacher Str. 74 - 40219 Düsseldorf
> HRB Düsseldorf 39841 - Geschäftsführer: Thilo Salmon, Tim Mois
> Steuernummer: 106/5724/7147, Umsatzsteuer-ID: DE219349391
>
> www.sipgate.de <http://www.sipgate.de> - www.sipgate.co.uk
> <http://www.sipgate.co.uk>

-- 
Daniel-Constantin Mierla
www.twitter.com/miconda -- www.linkedin.com/in/miconda
Kamailio Advanced Training - March 5-7, 2018, Berlin - www.asipto.com
Kamailio World Conference - May 14-16, 2018 - www.kamailioworld.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.kamailio.org/pipermail/sr-users/attachments/20180115/280ff7a6/attachment.html>


More information about the sr-users mailing list