It has not 100% relation with this issue, but i only wanted to share the setup we have for cases where a rtpengine fails having high traffic load, to minimize the impact on the kamailio processes.

modparam("rtpengine", "queried_nodes_limit", 2)
modparam("rtpengine", "rtpengine_retr", 2)
modparam("rtpengine", "rtpengine_tout_ms", 350)

considering we don't use sets with more than 2 rtpengine instances, at least for retry attempts. And your rtpengine instances are in the same network too.

this works quite fine for us, there are some few secs of impact while the rtpengine is marked as disabled, but the system recovers quite ok.

-----Original Message-----
From: "Daniel-Constantin Mierla" <miconda@gmail.com>
Sent: Friday, December 28, 2018 9:15am
To: "Juha Heinanen" <jh@tutpro.com>
Cc: "Kamailio (SER) - Users Mailing List" <sr-users@lists.kamailio.org>
Subject: Re: [SR-Users] kamailio does not responde if an rtpengine is unreachable

I just pushed a series of commits trying to rework how loading (and
reloading) of rtpegines list is done, to avoid that sync'ed probing,
which can take long if any of the rtpengines is down.

Now, building the local (per process) structures/sockets for rtpengines
during kamailio start up is done without locking. This is guarded by the
fact a reload command can be executed only after all children were
initialized (added also with these commits). Moreover, the probing of
rtpeningesis done only by child process 1, because the status is stored
in shared memory list, so it is visible in all children. Based on my
understanding there, doing probing from all processes is useless now,
that was probably kept from the time when the list was not stored in
shared memory, from the early rtpproxy times.

There is also a restriction on how often the rtpengine list can be
reloaded, now having a 10 seconds interval guard. I added this because
the reload is done over the old list, not building a new list to swap
with the old one. So it requires some time to walk through the existing
list and update based on the new records. I went this way for now, even
building a new list may be better/safer in long term, but it would
require more work. I also wanted to avoid being very intrusive right
now, given that those patches would need to be backported.

The last relevant change was to use a version number to discover when a
reload was done. So far, as I understood, it was relying on the number
of rtpengines, but one may trigger a reload with same rtpengines, but
different attributes (e.g., disabled or not). Having a version number is
better in detecting when each worker needs to rebuild its local list of
sockets, as well as for troubleshooting, because a value is increased
with each reload, so easier to track if it was done or now.

I didn't have time for any tests, so it would be good if you can test
and report if works as expected.

All related commits are in master, if they prove to work fine, we can
backport all those patches.

Cheers,
Daniel

On 26.12.18 12:46, Juha Heinanen wrote:
> Daniel-Constantin Mierla writes:
>
>> I pushed a quick fix for the case when db support is not enabled,
>> because these locks are useless in that case, so all children will do
>> the rtpengine init at the same time, without waiting for the others:
> Still took in rtpengine db mode about 2 minutes before kamailio became
> responsive after start.
>
> -- Juha

--
Daniel-Constantin Mierla -- www.asipto.com
www.twitter.com/miconda -- www.linkedin.com/in/miconda
Kamailio World Conference - May 6-8, 2019 -- www.kamailioworld.com
Kamailio Advanced Training - Mar 4-6, 2019 in Berlin; Mar 25-27, 2019, in Washington, DC, USA -- www.asipto.com

_______________________________________________
Kamailio (SER) - Users Mailing List
sr-users@lists.kamailio.org
https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users