[SR-Users] two Kamailio instances suddenly stopped handling packets
Daniel-Constantin Mierla
miconda at gmail.com
Mon Jan 15 12:08:56 CET 2018
Hello,
good that it was sorted out and nothing unexpected inside kamailio
locking code...
Cheers,
Daniel
On 12.01.18 16:38, Moritz Maisel wrote:
> Hi Daniel,
>
> thanks for pointing out the details about the lock structure. After
> some investigation towards locking issues, we found out that the root
> cause was an external system that blocked kamailio's connection
> attempts which subsequently lead to blocked threads. Sorry for the
> noise and have a nice weekend!
>
> Regards,
> Moritz
>
>
> On Fri, Jan 12, 2018 at 1:50 PM, Daniel-Constantin Mierla
> <miconda at gmail.com <mailto:miconda at gmail.com>> wrote:
>
> Hello,
>
>
> On 12.01.18 11:48, Sebastian Damm wrote:
>> Hi Daniel,
>>
>> unless there is a silent exit from some module (those extra
>> routes just write some information into redis), I can't find any
>> other drop and exits than the one in the route posted.
>>
>> We have added some logging around the lock functions, and
>> although we are missing some log lines for some packets, making
>> it look like some locks don't get released, we do see that
>> another PID can successfully lock the same key later.
>>
>> There is no other lock/unlock in the config.
>>
>> Is there a way to debug the lock array?
> attaching with gdb and looking at the values is always an option.
> Otherwise, I do not recall anything for this purpose written in
> kamailio.
>
>> One guess could be that the lock array is full, although we
>> already have the lock size set to 10.
>
> Lock array cannot become "full", if you thought like that if there
> are 10 lock operations with different keys on an array of 10 locks.
>
> Actually many string keys can use same lock/mutex -- a hashing
> function is performed over the key value results in an integer
> value which then is made modulo number of locks and the lock at
> that index is chosen from the array. So if you do lock("abc") and
> then you don't release it, then lock("xyz") can get blocked. Also,
> doing unlock("xyz") without lock("xyz") is releasing the
> lock("abc") -- note that "abc" and "xyz" are just random example
> to explain what can happen, but does not mean it happens for these
> keys.
>
> Cheers,
> Daniel
>
>
>>
>> Regards,
>> Sebastian
>>
>> On Fri, Jan 12, 2018 at 11:08 AM, Daniel-Constantin Mierla
>> <miconda at gmail.com <mailto:miconda at gmail.com>> wrote:
>>
>> Hello,
>>
>> have you checked that any of the route blocks used in between
>> lock() and unlock() don't do an exit/drop (also not return 0)?
>>
>> Do you have other lock() / unlock() in the config?
>>
>> Cheers,
>> Daniel
>>
>>
>> On 12.01.18 08:49, Sebastian Damm wrote:
>>> Hi,
>>>
>>> this morning, two Kamailio servers suddenly stopped working
>>> after having worked without a problem for months. Both are
>>> running 5.0.2 on Debian Jessie. Those systems only analyze
>>> mirrored traffic and write some information to a RabbitMQ.
>>>
>>> I tried restarting, but after a few seconds they get stuck
>>> the same way as before. Then I attached a gdb to some of the
>>> UDP listeners, and they all look pretty much the same.
>>>
>>>
>>> ===========8< process 1 =================
>>> (gdb) bt
>>> #0 0x00007f404ca435b9 in syscall () from
>>> /lib/x86_64-linux-gnu/libc.so.6
>>> #1 0x00007f4049f41260 in futex_get (lock=0x7f4041bccd00) at
>>> ../../core/parser/../mem/../futexlock.h:108
>>> #2 0x00007f4049f477d7 in cfg_lock_helper
>>> (lkey=0x7fff3c048650, mode=0) at cfgutils.c:667
>>> #3 0x00007f4049f481ed in w_cfg_lock_wrapper
>>> (msg=0x7f404ba17f18, key=0x7f404ba15488, mode=0) at
>>> cfgutils.c:712
>>> #4 0x00007f4049f4823c in w_cfg_lock (msg=0x7f404ba17f18,
>>> key=0x7f404ba15488 "\210a\241K@\177", s2=0x0) at cfgutils.c:717
>>> #5 0x000000000045cd85 in do_action (h=0x7fff3c049410,
>>> a=0x7f404b9d5170, msg=0x7f404ba17f18) at core/action.c:1060
>>> #6 0x0000000000469afa in run_actions (h=0x7fff3c049410,
>>> a=0x7f404b9d5170, msg=0x7f404ba17f18) at core/action.c:1552
>>> #7 0x000000000045958d in do_action (h=0x7fff3c049410,
>>> a=0x7f404b9ce558, msg=0x7f404ba17f18) at core/action.c:678
>>> #8 0x0000000000469afa in run_actions (h=0x7fff3c049410,
>>> a=0x7f404b9cdf70, msg=0x7f404ba17f18) at core/action.c:1552
>>> #9 0x000000000046a2bd in run_top_route (a=0x7f404b9cdf70,
>>> msg=0x7f404ba17f18, c=0x0) at core/action.c:1641
>>> #10 0x0000000000580033 in receive_msg (
>>> buf=0xa393f9 <buf+121> "REGISTER sip:domain
>>> SIP/2.0\r\nVia: SIP/2.0/UDP
>>> 1.2.3.4;branch=z9hG4bKfc98.c762d7151f110c7eb71fc7d4e0648f1f.0\r\nVia:
>>> SIP/2.0/UDP
>>> 192.168.0.3:5060;rport=38020;received=62.30.8.128;branch=z9hG4"...,
>>> len=799, rcv_info=0x7fff3c049810) at core/receive.c:264
>>> #11 0x00007f40483efcdf in parsing_hepv3_message
>>> (buf=0xa39380 <buf> "HEP3\003\230", len=920) at hep.c:499
>>> #12 0x00007f40483ee264 in hepv3_received (buf=0xa39380 <buf>
>>> "HEP3\003\230", len=920, ri=0x7fff3c049a50) at hep.c:231
>>> #13 0x00007f40483ec9cb in hep_msg_received
>>> (data=0x7fff3c049a30) at hep.c:85
>>> #14 0x000000000049e4e1 in sr_event_exec (type=7,
>>> data=0x7fff3c049a30) at core/events.c:263
>>> #15 0x000000000048731a in udp_rcv_loop () at
>>> core/udp_server.c:466
>>> #16 0x0000000000422d08 in main_loop () at main.c:1623
>>> #17 0x000000000042a408 in main (argc=13,
>>> argv=0x7fff3c049ef8) at main.c:2643
>>> (gdb) quit
>>> ===========8< process 1 =================
>>>
>>>
>>> ===========8< process 2 =================
>>> (gdb) bt
>>> #0 0x00007f404ca435b9 in syscall () from
>>> /lib/x86_64-linux-gnu/libc.so.6
>>> #1 0x00007f4049f41260 in futex_get (lock=0x7f4041bccd00) at
>>> ../../core/parser/../mem/../futexlock.h:108
>>> #2 0x00007f4049f477d7 in cfg_lock_helper
>>> (lkey=0x7fff3c048650, mode=0) at cfgutils.c:667
>>> #3 0x00007f4049f481ed in w_cfg_lock_wrapper
>>> (msg=0x7f404ba17f18, key=0x7f404ba15488, mode=0) at
>>> cfgutils.c:712
>>> #4 0x00007f4049f4823c in w_cfg_lock (msg=0x7f404ba17f18,
>>> key=0x7f404ba15488 "\210a\241K@\177", s2=0x0) at cfgutils.c:717
>>> #5 0x000000000045cd85 in do_action (h=0x7fff3c049600,
>>> a=0x7f404b9d5170, msg=0x7f404ba17f18) at core/action.c:1060
>>> #6 0x0000000000469afa in run_actions (h=0x7fff3c049600,
>>> a=0x7f404b9d5170, msg=0x7f404ba17f18) at core/action.c:1552
>>> #7 0x000000000045958d in do_action (h=0x7fff3c049600,
>>> a=0x7f404b9cf920, msg=0x7f404ba17f18) at core/action.c:678
>>> #8 0x0000000000469afa in run_actions (h=0x7fff3c049600,
>>> a=0x7f404b9ce9c0, msg=0x7f404ba17f18) at core/action.c:1552
>>> #9 0x000000000046a2bd in run_top_route (a=0x7f404b9ce9c0,
>>> msg=0x7f404ba17f18, c=0x7fff3c049600) at core/action.c:1641
>>> #10 0x0000000000580a69 in receive_msg (
>>> buf=0xa393f9 <buf+121> "SIP/2.0 200 OK\r\nVia:
>>> SIP/2.0/UDP
>>> 172.20.40.8;branch=z9hG4bKa69c.5cca5a54c4d270c515eeb4bbb5e0bb44.0\r\nVia:
>>> SIP/2.0/UDP 2.3.4.5:5060 <http://2.3.4.5:5060>\r\nContact:
>>> <sip:1234567 at 9.8.7.6:59669;transport=UDP>\r\nTo: "...,
>>> len=506, rcv_info=0x7fff3c049810) at core/receive.c:327
>>> #11 0x00007f40483efcdf in parsing_hepv3_message
>>> (buf=0xa39380 <buf> "HEP3\002s", len=627) at hep.c:499
>>> #12 0x00007f40483ee264 in hepv3_received (buf=0xa39380 <buf>
>>> "HEP3\002s", len=627, ri=0x7fff3c049a50) at hep.c:231
>>> #13 0x00007f40483ec9cb in hep_msg_received
>>> (data=0x7fff3c049a30) at hep.c:85
>>> #14 0x000000000049e4e1 in sr_event_exec (type=7,
>>> data=0x7fff3c049a30) at core/events.c:263
>>> #15 0x000000000048731a in udp_rcv_loop () at
>>> core/udp_server.c:466
>>> #16 0x0000000000422d08 in main_loop () at main.c:1623
>>> #17 0x000000000042a408 in main (argc=13,
>>> argv=0x7fff3c049ef8) at main.c:2643
>>> (gdb) quit
>>> ===========8< process 2 =================
>>>
>>> Looks to me as if Kamailio gets stuck trying to get a lock
>>> for this packet.
>>>
>>> The config file when handling those packets looks like this:
>>>
>>> request_route {
>>> route(initialize_variables);
>>> route(foo);
>>> }
>>>
>>> onreply_route {
>>> route(initialize_variables);
>>> route(foo);
>>> }
>>>
>>> route[initialize_variables] {
>>> $vn(bar) = $null;
>>> $vn(baz) = $null;
>>> $vn(barbaz) = $null;
>>> $vn(foobar) = $null;
>>>
>>> }
>>>
>>> route[foo] {
>>> lock("$ci");
>>> xlog("L_DBG", "Obtained lock, calling lua...\n");
>>> if(!lua_run("handle_packet")) {
>>> xlog("L_ERR", "SCRIPT: failed to execute lua
>>> function!\n");
>>> }
>>>
>>> if ($vn(bar) != $null) {
>>> route(mediaIpToRedis);
>>> }
>>>
>>> if ($vn(baz) != $null) {
>>> route(channelInfoToRedis);
>>> }
>>>
>>> if ($vn(barbaz) != $null) {
>>> route(sendToQueue);
>>> }
>>> xlog("L_DBG", "All finished, releasing lock...\n");
>>> unlock("$ci");
>>> xlog("L_DBG", "Released lock...\n");
>>>
>>> # update stats
>>> if ($vn(foobar) != $null) {
>>> update_stat("$vn(foobar)", "+1");
>>> }
>>>
>>> drop;
>>> exit;
>>> }
>>>
>>> Is there any race condition I am missing? Until this
>>> morning, it ran without problems and only every 2 minutes or
>>> so, 4 packets were sent to the RabbitMQ. So the system
>>> throws away nearly 100% of the traffic (which is around 20
>>> Mbit).
>>>
>>> In the log file, there are no errors at all, no aborted lua
>>> scripts or whatsoever.
>>>
>>> Does anybody have a hint for me? Thanks in advance.
>>>
>>> Regards
>>> Sebastian
>>>
>>>
>>> _______________________________________________
>>> Kamailio (SER) - Users Mailing List
>>> sr-users at lists.kamailio.org <mailto:sr-users at lists.kamailio.org>
>>> https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users
>>> <https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users>
>>
>> --
>> Daniel-Constantin Mierla
>> www.twitter.com/miconda <http://www.twitter.com/miconda> -- www.linkedin.com/in/miconda <http://www.linkedin.com/in/miconda>
>> Kamailio Advanced Training - March 5-7, 2018, Berlin - www.asipto.com <http://www.asipto.com>
>> Kamailio World Conference - May 14-16, 2018 - www.kamailioworld.com <http://www.kamailioworld.com>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> Kamailio (SER) - Users Mailing List
>> sr-users at lists.kamailio.org <mailto:sr-users at lists.kamailio.org>
>> https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users
>> <https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users>
>
> --
> Daniel-Constantin Mierla
> www.twitter.com/miconda <http://www.twitter.com/miconda> -- www.linkedin.com/in/miconda <http://www.linkedin.com/in/miconda>
> Kamailio Advanced Training - March 5-7, 2018, Berlin - www.asipto.com <http://www.asipto.com>
> Kamailio World Conference - May 14-16, 2018 - www.kamailioworld.com <http://www.kamailioworld.com>
>
>
> _______________________________________________
> Kamailio (SER) - Users Mailing List
> sr-users at lists.kamailio.org <mailto:sr-users at lists.kamailio.org>
> https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users
> <https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users>
>
>
>
>
> --
> sipgate GmbH - Gladbacher Str. 74 - 40219 Düsseldorf
> HRB Düsseldorf 39841 - Geschäftsführer: Thilo Salmon, Tim Mois
> Steuernummer: 106/5724/7147, Umsatzsteuer-ID: DE219349391
>
> www.sipgate.de <http://www.sipgate.de> - www.sipgate.co.uk
> <http://www.sipgate.co.uk>
--
Daniel-Constantin Mierla
www.twitter.com/miconda -- www.linkedin.com/in/miconda
Kamailio Advanced Training - March 5-7, 2018, Berlin - www.asipto.com
Kamailio World Conference - May 14-16, 2018 - www.kamailioworld.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.kamailio.org/pipermail/sr-users/attachments/20180115/280ff7a6/attachment.html>
More information about the sr-users
mailing list