## Description
When one rtpengine instance becomes unavailable on a Kamailio instance with moderate call traffic, even where there are multiple rtpengine instances configured for high availability, all processes can become blocked and Kamailio stops responding to all requests.
We produced this in a real world test, with two Kamailio servers running (20 child processes) and two rtpengine servers running. We loaded the Kamailio servers with calls, some answered, some ringing/busy/etc, and then made one rtpengine server drop all packets. Both Kamailio servers were able to briefly detect that the rtpengine instance was unavailable when it attempted to send an offer to it, but then as existing calls began to complete, both Kamailio servers became congested attempting to send "delete" commands to rtpengine. Both Kamailio servers then became unavailable and stopped responding to all messages for 10 minutes, until we restored rtpengine.
As such, an rtpengine server that crashes can cause a complete Kamailio outage.
## Troubleshooting
By using the reproduction steps below, we were easily able to make Kamailio stop responding to all incoming SIP messages when taking down an rtpengine instance.
The logs below show the sequence of events when this is reproduced for a single call.
Configuration:
- `rtpengine_tout_ms` was not configured and had the default value of 1000ms. - `rtpengine_retr` was not configured and had the default value of 5. - `queried_nodes_limit` was not configured and had the default value (of 30?). - TM's `exec_time_check` was not configured and had the default value of 5000ms.
**A call is forwarded using t_relay and rtpengine_manage. rtpengine is killed mid-call. The call ends with 486.**
When the t_on_failure route is called, the config makes a singular call to rtpengine_manage, however what follows is 30 errors, all mentioning the same rtpengine instance:
``` ERROR: rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 ```
Followed by the below:
``` ERROR: rtpengine [rtpengine.c:2729]: rtpp_function_call(): queried nodes limit reached WARNING: tm [t_reply.c:1094]: run_failure_handlers(): failure route execution took too long: 150153261 us ```
Each timeout is 5 seconds (5x retries of 1000ms), which corresponds to about 150 seconds over 30 attempts. This matches the warning generated by tm.
In this case, there was about 5 minutes of time where a process was blocked. 150 seconds for the child leg's failure route, and 150 seconds for the parent leg's failure route. When this happens for many calls at once, congestion is guaranteed.
The 30 attempts to delete the call from rtpengine comes from the default value of `queried_nodes_limit`. This behavior doesn't make sense to me -- it's a retry loop that runs a retry loop.
## Reproduction
_tldr: kill rtpengine via iptables mid-call with the default rtpengine settings and you'll see lots of errors._
You can reproduce this with one or multiple rtpengine instances configured in Kamailio. It's easier with just one.
Standard setup, but importantly:
- forwarding a call using t_relay - t_on_branch is configured which calls rtpengine_manage - t_on_failure is configured which calls rtpengine_manage
To reproduce easily, we configured the destination server to send a 100, a 180, wait 20 seconds, then send a 486.
Place a number of calls to Kamailio (we used sipp) -- importantly, exceeding the number of child processes configured. You can reproduce the issue with just one call, but this wont make Kamailio unresponsive.
Once the calls are placed, and still in progress (ie before the 486 is returned), use iptables on the rtpengine server to drop all traffic on its control port.
When the call(s) start receiving the 486, you will notice a flood of error messages `timeout waiting reply for command "delete" from RTPEngine` for approximately 5 minutes per call. This blocks the process.
If it's now attempting to send the delete command to rtpengine for a number of calls greater than your child process count: attempt to send any other SIP messages to Kamailio and you will not receive a response.
## Log Messages
This is a log for a single call, forwarded using t_relay. All calls to rtpengine_manage are immediately surrounded by info log messages "calling rtpengine_manage" and "completed calling rtpengine_manage".
``` Feb 12 23:40:54 debian-kamailio kamailio[27715]: INFO: {1 1 INVITE 1-27864@127.0.1.1} <script>: new branch [0] to sip:0421XXXXXX@192.168.X.X:9060 Feb 12 23:40:54 debian-kamailio kamailio[27715]: INFO: {1 1 INVITE 1-27864@127.0.1.1} <script>: branch_route: calling rtpengine_manage Feb 12 23:40:54 debian-kamailio kamailio[27715]: INFO: {1 1 INVITE 1-27864@127.0.1.1} <script>: branch_route: completed calling rtpengine_manage Feb 12 23:40:54 debian-kamailio kamailio[27732]: INFO: {2 1 INVITE 1-27864@127.0.1.1} <script>: onreply_route: calling rtpengine_manage Feb 12 23:40:54 debian-kamailio kamailio[27732]: INFO: {2 1 INVITE 1-27864@127.0.1.1} <script>: onreply_route: completed calling rtpengine_manage Feb 12 23:40:55 debian-kamailio kamailio[27730]: INFO: {1 63577835 INVITE 720bf6d9-25fb-123c-cb91-80fa5d718101} <script>: new branch [0] to sip:61421764018@192.168.X.X:9092;transport=udp Feb 12 23:40:55 debian-kamailio kamailio[27730]: INFO: {1 63577835 INVITE 720bf6d9-25fb-123c-cb91-80fa5d718101} <script>: branch_route: calling rtpengine_manage Feb 12 23:40:55 debian-kamailio kamailio[27730]: INFO: {1 63577835 INVITE 720bf6d9-25fb-123c-cb91-80fa5d718101} <script>: branch_route: completed calling rtpengine_manage Feb 12 23:40:55 debian-kamailio kamailio[27717]: INFO: {2 63577835 INVITE 720bf6d9-25fb-123c-cb91-80fa5d718101} <script>: onreply_route: calling rtpengine_manage Feb 12 23:40:55 debian-kamailio kamailio[27717]: INFO: {2 63577835 INVITE 720bf6d9-25fb-123c-cb91-80fa5d718101} <script>: onreply_route: completed calling rtpengine_manage Feb 12 23:40:55 debian-kamailio kamailio[27714]: INFO: {2 63577835 INVITE 720bf6d9-25fb-123c-cb91-80fa5d718101} <script>: onreply_route: calling rtpengine_manage Feb 12 23:40:55 debian-kamailio kamailio[27714]: INFO: {2 63577835 INVITE 720bf6d9-25fb-123c-cb91-80fa5d718101} <script>: onreply_route: completed calling rtpengine_manage Feb 12 23:40:55 debian-kamailio kamailio[27735]: INFO: {2 1 INVITE 1-27864@127.0.1.1} <script>: onreply_route: calling rtpengine_manage Feb 12 23:40:55 debian-kamailio kamailio[27735]: INFO: {2 1 INVITE 1-27864@127.0.1.1} <script>: onreply_route: completed calling rtpengine_manage Feb 12 23:41:15 debian-kamailio kamailio[27713]: INFO: {1 63577835 INVITE 720bf6d9-25fb-123c-cb91-80fa5d718101} <script>: failure_route: calling rtpengine_manage Feb 12 23:41:20 debian-kamailio kamailio[27713]: ERROR: {1 63577835 INVITE 720bf6d9-25fb-123c-cb91-80fa5d718101} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:41:25 debian-kamailio kamailio[27713]: ERROR: {1 63577835 INVITE 720bf6d9-25fb-123c-cb91-80fa5d718101} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:41:30 debian-kamailio kamailio[27713]: ERROR: {1 63577835 INVITE 720bf6d9-25fb-123c-cb91-80fa5d718101} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:41:35 debian-kamailio kamailio[27713]: ERROR: {1 63577835 INVITE 720bf6d9-25fb-123c-cb91-80fa5d718101} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:41:40 debian-kamailio kamailio[27713]: ERROR: {1 63577835 INVITE 720bf6d9-25fb-123c-cb91-80fa5d718101} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:41:45 debian-kamailio kamailio[27713]: ERROR: {1 63577835 INVITE 720bf6d9-25fb-123c-cb91-80fa5d718101} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:41:50 debian-kamailio kamailio[27713]: ERROR: {1 63577835 INVITE 720bf6d9-25fb-123c-cb91-80fa5d718101} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:41:55 debian-kamailio kamailio[27740]: INFO: {2 1 INVITE 1-27864@127.0.1.1} <script>: onreply_route: calling rtpengine_manage Feb 12 23:41:55 debian-kamailio kamailio[27740]: INFO: {2 1 INVITE 1-27864@127.0.1.1} <script>: onreply_route: completed calling rtpengine_manage Feb 12 23:41:55 debian-kamailio kamailio[27713]: ERROR: {1 63577835 INVITE 720bf6d9-25fb-123c-cb91-80fa5d718101} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:42:00 debian-kamailio kamailio[27713]: ERROR: {1 63577835 INVITE 720bf6d9-25fb-123c-cb91-80fa5d718101} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:42:05 debian-kamailio kamailio[27713]: ERROR: {1 63577835 INVITE 720bf6d9-25fb-123c-cb91-80fa5d718101} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:42:10 debian-kamailio kamailio[27713]: ERROR: {1 63577835 INVITE 720bf6d9-25fb-123c-cb91-80fa5d718101} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:42:16 debian-kamailio kamailio[27713]: ERROR: {1 63577835 INVITE 720bf6d9-25fb-123c-cb91-80fa5d718101} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:42:21 debian-kamailio kamailio[27713]: ERROR: {1 63577835 INVITE 720bf6d9-25fb-123c-cb91-80fa5d718101} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:42:26 debian-kamailio kamailio[27713]: ERROR: {1 63577835 INVITE 720bf6d9-25fb-123c-cb91-80fa5d718101} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:42:31 debian-kamailio kamailio[27713]: ERROR: {1 63577835 INVITE 720bf6d9-25fb-123c-cb91-80fa5d718101} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:42:36 debian-kamailio kamailio[27713]: ERROR: {1 63577835 INVITE 720bf6d9-25fb-123c-cb91-80fa5d718101} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:42:41 debian-kamailio kamailio[27713]: ERROR: {1 63577835 INVITE 720bf6d9-25fb-123c-cb91-80fa5d718101} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:42:46 debian-kamailio kamailio[27713]: ERROR: {1 63577835 INVITE 720bf6d9-25fb-123c-cb91-80fa5d718101} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:42:51 debian-kamailio kamailio[27713]: ERROR: {1 63577835 INVITE 720bf6d9-25fb-123c-cb91-80fa5d718101} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:42:55 debian-kamailio kamailio[27734]: INFO: {2 1 INVITE 1-27864@127.0.1.1} <script>: onreply_route: calling rtpengine_manage Feb 12 23:42:55 debian-kamailio kamailio[27734]: INFO: {2 1 INVITE 1-27864@127.0.1.1} <script>: onreply_route: completed calling rtpengine_manage Feb 12 23:42:56 debian-kamailio kamailio[27713]: ERROR: {1 63577835 INVITE 720bf6d9-25fb-123c-cb91-80fa5d718101} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:43:01 debian-kamailio kamailio[27713]: ERROR: {1 63577835 INVITE 720bf6d9-25fb-123c-cb91-80fa5d718101} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:43:06 debian-kamailio kamailio[27713]: ERROR: {1 63577835 INVITE 720bf6d9-25fb-123c-cb91-80fa5d718101} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:43:11 debian-kamailio kamailio[27713]: ERROR: {1 63577835 INVITE 720bf6d9-25fb-123c-cb91-80fa5d718101} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:43:16 debian-kamailio kamailio[27713]: ERROR: {1 63577835 INVITE 720bf6d9-25fb-123c-cb91-80fa5d718101} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:43:21 debian-kamailio kamailio[27713]: ERROR: {1 63577835 INVITE 720bf6d9-25fb-123c-cb91-80fa5d718101} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:43:26 debian-kamailio kamailio[27713]: ERROR: {1 63577835 INVITE 720bf6d9-25fb-123c-cb91-80fa5d718101} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:43:31 debian-kamailio kamailio[27713]: ERROR: {1 63577835 INVITE 720bf6d9-25fb-123c-cb91-80fa5d718101} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:43:36 debian-kamailio kamailio[27713]: ERROR: {1 63577835 INVITE 720bf6d9-25fb-123c-cb91-80fa5d718101} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:43:41 debian-kamailio kamailio[27713]: ERROR: {1 63577835 INVITE 720bf6d9-25fb-123c-cb91-80fa5d718101} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:43:46 debian-kamailio kamailio[27713]: ERROR: {1 63577835 INVITE 720bf6d9-25fb-123c-cb91-80fa5d718101} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:43:46 debian-kamailio kamailio[27713]: ERROR: {1 63577835 INVITE 720bf6d9-25fb-123c-cb91-80fa5d718101} rtpengine [rtpengine.c:2729]: rtpp_function_call(): queried nodes limit reached Feb 12 23:43:46 debian-kamailio kamailio[27713]: INFO: {1 63577835 INVITE 720bf6d9-25fb-123c-cb91-80fa5d718101} <script>: failure_route: completed calling rtpengine_manage Feb 12 23:43:46 debian-kamailio kamailio[27713]: WARNING: tm [t_reply.c:1094]: run_failure_handlers(): failure route execution took too long: 150153207 us Feb 12 23:43:46 debian-kamailio kamailio[27736]: INFO: {1 63577835 ACK 720bf6d9-25fb-123c-cb91-80fa5d718101} <script>: ACK: calling rtpengine_manage Feb 12 23:43:46 debian-kamailio kamailio[27736]: INFO: {1 63577835 ACK 720bf6d9-25fb-123c-cb91-80fa5d718101} <script>: ACK: completed calling rtpengine_manage Feb 12 23:43:46 debian-kamailio kamailio[27738]: INFO: {1 1 INVITE 1-27864@127.0.1.1} <script>: failure_route: calling rtpengine_manage Feb 12 23:43:51 debian-kamailio kamailio[27738]: ERROR: {1 1 INVITE 1-27864@127.0.1.1} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:43:56 debian-kamailio kamailio[27738]: ERROR: {1 1 INVITE 1-27864@127.0.1.1} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:44:01 debian-kamailio kamailio[27738]: ERROR: {1 1 INVITE 1-27864@127.0.1.1} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:44:06 debian-kamailio kamailio[27738]: ERROR: {1 1 INVITE 1-27864@127.0.1.1} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:44:11 debian-kamailio kamailio[27738]: ERROR: {1 1 INVITE 1-27864@127.0.1.1} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:44:16 debian-kamailio kamailio[27738]: ERROR: {1 1 INVITE 1-27864@127.0.1.1} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:44:21 debian-kamailio kamailio[27738]: ERROR: {1 1 INVITE 1-27864@127.0.1.1} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:44:26 debian-kamailio kamailio[27738]: ERROR: {1 1 INVITE 1-27864@127.0.1.1} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:44:31 debian-kamailio kamailio[27738]: ERROR: {1 1 INVITE 1-27864@127.0.1.1} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:44:36 debian-kamailio kamailio[27738]: ERROR: {1 1 INVITE 1-27864@127.0.1.1} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:44:41 debian-kamailio kamailio[27738]: ERROR: {1 1 INVITE 1-27864@127.0.1.1} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:44:46 debian-kamailio kamailio[27738]: ERROR: {1 1 INVITE 1-27864@127.0.1.1} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:44:51 debian-kamailio kamailio[27738]: ERROR: {1 1 INVITE 1-27864@127.0.1.1} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:44:56 debian-kamailio kamailio[27738]: ERROR: {1 1 INVITE 1-27864@127.0.1.1} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:45:01 debian-kamailio kamailio[27738]: ERROR: {1 1 INVITE 1-27864@127.0.1.1} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:45:06 debian-kamailio kamailio[27738]: ERROR: {1 1 INVITE 1-27864@127.0.1.1} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:45:11 debian-kamailio kamailio[27738]: ERROR: {1 1 INVITE 1-27864@127.0.1.1} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:45:16 debian-kamailio kamailio[27738]: ERROR: {1 1 INVITE 1-27864@127.0.1.1} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:45:21 debian-kamailio kamailio[27738]: ERROR: {1 1 INVITE 1-27864@127.0.1.1} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:45:26 debian-kamailio kamailio[27738]: ERROR: {1 1 INVITE 1-27864@127.0.1.1} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:45:31 debian-kamailio kamailio[27738]: ERROR: {1 1 INVITE 1-27864@127.0.1.1} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:45:36 debian-kamailio kamailio[27738]: ERROR: {1 1 INVITE 1-27864@127.0.1.1} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:45:41 debian-kamailio kamailio[27738]: ERROR: {1 1 INVITE 1-27864@127.0.1.1} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:45:46 debian-kamailio kamailio[27738]: ERROR: {1 1 INVITE 1-27864@127.0.1.1} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:45:51 debian-kamailio kamailio[27738]: ERROR: {1 1 INVITE 1-27864@127.0.1.1} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:45:56 debian-kamailio kamailio[27738]: ERROR: {1 1 INVITE 1-27864@127.0.1.1} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:46:01 debian-kamailio kamailio[27738]: ERROR: {1 1 INVITE 1-27864@127.0.1.1} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:46:06 debian-kamailio kamailio[27738]: ERROR: {1 1 INVITE 1-27864@127.0.1.1} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:46:11 debian-kamailio kamailio[27738]: ERROR: {1 1 INVITE 1-27864@127.0.1.1} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:46:16 debian-kamailio kamailio[27738]: ERROR: {1 1 INVITE 1-27864@127.0.1.1} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 12 23:46:16 debian-kamailio kamailio[27738]: ERROR: {1 1 INVITE 1-27864@127.0.1.1} rtpengine [rtpengine.c:2729]: rtpp_function_call(): queried nodes limit reached Feb 12 23:46:16 debian-kamailio kamailio[27738]: INFO: {1 1 INVITE 1-27864@127.0.1.1} <script>: failure_route: completed calling rtpengine_manage Feb 12 23:46:16 debian-kamailio kamailio[27738]: WARNING: tm [t_reply.c:1094]: run_failure_handlers(): failure route execution took too long: 150153261 us ```
### Possible Solutions
To minimize the issue, `rtpengine_tout_ms` and `rtpengine_retr` can be reduced. This doesn't exactly the fix the problem, as you need reasonable values here.
Additionally, `queried_nodes_limit` can be reduced to limit the number of unnecessary retries, at the expense of the functionality it gives you to initially select candidate nodes.
**Actual solution:** I think that, when a call has already selected an rtpengine instance successfully, and that's in the hash table, then `queried_nodes_limit` should not apply. It should only attempt to send commands with a total number of attempts = `rtpengine_retr`, not `rtpengine_retr * queried_nodes_limit`.
By changing the behavior as above, tweaking `rtpengine_tout_ms` and `rtpengine_retr`, and increasing the number of child workers, an rtpengine becoming unavailable should cause much less impactful blocking.
## Additional Information
**Kamailio version:** (attempted on 5.4.1 and latest master)
``` version: kamailio 5.4.1 (x86_64/linux) flags: USE_TCP, USE_TLS, USE_SCTP, TLS_HOOKS, USE_RAW_SOCKS, DISABLE_NAGLE, USE_MCAST, DNS_IP_HACK, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, TLSF_MALLOC, DBG_SR_MEMORY, USE_FUTEX, FAST_LOCK-ADAPTIVE_WAIT, USE_DNS_CACHE, USE_DNS_FAILOVER, USE_NAPTR, USE_DST_BLACKLIST, HAVE_RESOLV_RES, TLS_PTHREAD_MUTEX_SHARED ADAPTIVE_WAIT_LOOPS 1024, MAX_RECV_BUFFER_SIZE 262144, MAX_URI_SIZE 1024, BUF_SIZE 65535, DEFAULT PKG_SIZE 8MB poll method support: poll, epoll_lt, epoll_et, sigio_rt, select. id: unknown compiled on 23:22:15 Aug 15 2022 with gcc 8.3.0 ```
``` version: kamailio 5.7.0-dev3 (x86_64/linux) 2384c3 flags: USE_TCP, USE_TLS, USE_SCTP, TLS_HOOKS, USE_RAW_SOCKS, DISABLE_NAGLE, USE_MCAST, DNS_IP_HACK, SHM_MMAP, PKG_MALLOC, MEM_JOIN_FREE, Q_MALLOC, F_MALLOC, TLSF_MALLOC, DBG_SR_MEMORY, USE_FUTEX, FAST_LOCK-ADAPTIVE_WAIT, USE_DNS_CACHE, USE_DNS_FAILOVER, USE_NAPTR, USE_DST_BLOCKLIST, HAVE_RESOLV_RES, TLS_PTHREAD_MUTEX_SHARED ADAPTIVE_WAIT_LOOPS 1024, MAX_RECV_BUFFER_SIZE 262144, MAX_URI_SIZE 1024, BUF_SIZE 65535, DEFAULT PKG_SIZE 8MB poll method support: poll, epoll_lt, epoll_et, sigio_rt, select. id: 2384c3 compiled on 23:22:22 Feb 12 2023 with gcc 8.3.0 ```
**Operating System**:
<!-- Details about the operating system, the type: Linux (e.g.,: Debian 8.4, Ubuntu 16.04, CentOS 7.1, ...), MacOS, xBSD, Solaris, ...; Kernel details (output of `lsb_release -a` and `uname -a`) -->
``` Distributor ID: Debian Description: Debian GNU/Linux 10 (buster) Release: 10 Codename: buster
Linux debian-kamailio 4.19.0-20-amd64 #1 SMP Debian 4.19.235-1 (2022-03-17) x86_64 GNU/Linux ```
Thanks for the detailed report. So you are actually using 30 different rtpengine daemons? Or are you saying that you are using e.g. 1 rtpengine deamon, but rtpengine is still trying it 30 times?
In this test it was 1 rtpengine daemon. The delete method was tried 30 times to the same IP:port, each try was 5 inner retries, so 150 times in total.
You get the same effect with more rtpengine daemons where a call is attempting to complete using a failed daemon.
In the logs, 192.168.X.X represents the same IP address throughout the logs.
Thanks, this sounds indeed like a bug. Maybe @rfuchs could comment as well, if there was any special reason for this behaviour.
Much of the load-balancing code in the module isn't mine, but it sure looks like a bug to me. From a quick glance it looks like there's a break condition missing.
@smititelu I would suggest to take the test against `queried_nodes_ptr` also into account in the `select_rtpp_node_old` code path, so that it doesn't keep returning the same node again. Ack?
@shaneqld are you able to retest your scenario with 3da4029 applied?
@rfuchs Thank you for this. I can confirm that this fixes the issue.
``` Feb 14 20:33:51 debian-kamailio kamailio[5066]: INFO: {1 63658613 INVITE 98dbb579-2773-123c-48af-80fa5d718101} <script>: failure_route: calling RTPMANAGE Feb 14 20:33:56 debian-kamailio kamailio[5066]: ERROR: {1 63658613 INVITE 98dbb579-2773-123c-48af-80fa5d718101} rtpengine [rtpengine.c:3129]: send_rtpp_command(): timeout waiting reply for command "delete" from RTPEngine udp:192.168.X.X:2223 Feb 14 20:33:56 debian-kamailio kamailio[5066]: ERROR: {1 63658613 INVITE 98dbb579-2773-123c-48af-80fa5d718101} rtpengine [rtpengine.c:3382]: select_rtpp_node(): rtpengine node for callid=98dbb579-2773-123c-48af-80fa5d718101 is known (udp:192.168.X.X:2223) but it has already been queried, therefore not returning it Feb 14 20:33:56 debian-kamailio kamailio[5066]: ERROR: {1 63658613 INVITE 98dbb579-2773-123c-48af-80fa5d718101} rtpengine [rtpengine.c:2735]: rtpp_function_call(): no available proxies Feb 14 20:33:56 debian-kamailio kamailio[5066]: INFO: {1 63658613 INVITE 98dbb579-2773-123c-48af-80fa5d718101} <script>: failure_route: completed calling RTPMANAGE ```
I can also confirm everything still works normally when rtpengine is up; and when there are multiple rtpengine instances and one is unavailable (ie initial node selection works as normal).
The only very minor observation I have is that the call remains in the hash table until `hash_table_tout` as observed with `kamcmd rtpengine.get_hash_total`. I assume this happened before, and it's very minor, but thought I'd mention it.
Apologies if info on release cycles is somewhere already & my eagerness, but can I ask when I should look out for a release with this patch?
Thanks for the confirmation. Regarding the release policy question - a minor release happens usually every few month for the respective branches, there should be probably soon a release.
Sorry for later reply. @rfuchs LGTM, ty for the fix! @shaneqld ty for bug report! I probably didn't think/test what happens with existing calls, only with new calls, when implementing the selection algo.
@rfuchs if you have some time, please merge your fix to git master, then we can also backport it.
Closed #3370 as completed via bd0ce126109f40f5217053f20e687b93b2268164.