hello Daniel
Thanks a lot for the update. We will also test it.
It has not 100% relation with this issue, but i only wanted to share the setup we have for cases where a rtpengine fails having high traffic load, to minimize the impact on the kamailio processes.
modparam("rtpengine", "queried_nodes_limit", 2) modparam("rtpengine", "rtpengine_retr", 2) modparam("rtpengine", "rtpengine_tout_ms", 350)
considering we don't use sets with more than 2 rtpengine instances, at least for retry attempts. And your rtpengine instances are in the same network too.
this works quite fine for us, there are some few secs of impact while the rtpengine is marked as disabled, but the system recovers quite ok.
best regards david
-----Original Message----- From: "Daniel-Constantin Mierla" miconda@gmail.com Sent: Friday, December 28, 2018 9:15am To: "Juha Heinanen" jh@tutpro.com Cc: "Kamailio (SER) - Users Mailing List" sr-users@lists.kamailio.org Subject: Re: [SR-Users] kamailio does not responde if an rtpengine is unreachable
I just pushed a series of commits trying to rework how loading (and reloading) of rtpegines list is done, to avoid that sync'ed probing, which can take long if any of the rtpengines is down.
Now, building the local (per process) structures/sockets for rtpengines during kamailio start up is done without locking. This is guarded by the fact a reload command can be executed only after all children were initialized (added also with these commits). Moreover, the probing of rtpeningesis done only by child process 1, because the status is stored in shared memory list, so it is visible in all children. Based on my understanding there, doing probing from all processes is useless now, that was probably kept from the time when the list was not stored in shared memory, from the early rtpproxy times.
There is also a restriction on how often the rtpengine list can be reloaded, now having a 10 seconds interval guard. I added this because the reload is done over the old list, not building a new list to swap with the old one. So it requires some time to walk through the existing list and update based on the new records. I went this way for now, even building a new list may be better/safer in long term, but it would require more work. I also wanted to avoid being very intrusive right now, given that those patches would need to be backported.
The last relevant change was to use a version number to discover when a reload was done. So far, as I understood, it was relying on the number of rtpengines, but one may trigger a reload with same rtpengines, but different attributes (e.g., disabled or not). Having a version number is better in detecting when each worker needs to rebuild its local list of sockets, as well as for troubleshooting, because a value is increased with each reload, so easier to track if it was done or now.
I didn't have time for any tests, so it would be good if you can test and report if works as expected.
All related commits are in master, if they prove to work fine, we can backport all those patches.
Cheers, Daniel
On 26.12.18 12:46, Juha Heinanen wrote:
Daniel-Constantin Mierla writes:
I pushed a quick fix for the case when db support is not enabled, because these locks are useless in that case, so all children will do the rtpengine init at the same time, without waiting for the others:
Still took in rtpengine db mode about 2 minutes before kamailio became responsive after start.
-- Juha