Hi,
I am using Kamailio v5.2 as a WebRTC proxy for SIP clients, as part of my setup I call a REST API before calling ws_handle_handshake() in event_route[xhttp:request] using http_client_query() to authenticate and retrieve some user details (routing etc) for the user connected via WebSockets.
This worked fine until increased load on the system meant API responses were slow, this caused a knock-on effect to existing connected users. Specifically, we saw mass WebSocket disconnects for existing connected users - believed to be due to using too aggressive proxy timeout settings on our reverse proxy in front of Kamailio.
My understanding is that event_route[xhttp:request] uses the shared SIP TCP worker threads, so potentially could slow API requests over the network block all the handlers? Could this potentially impact the keepalive processes used by the WebSocket module to check existing WebSocket connections? We were using ping keepalives, again with pretty aggressive timers...
To resolve this issue, I am looking at whether I can call API requests asynchronously so that TCP workers are not blocked, my first thought was to use http_async_query() in event_route[xhttp:request] and call ws_handle_handshake() in the HTTP_REPLY route when the API call had completed, but I get "ERROR: websocket [ws_handshake.c:143]: ws_handle_handshake(): retrieving connection". I presume then this approach is a deadend, looking at newer kamailio versions 5.5, there doesn't seem to be any way to do this, correct?
Instead of using http_async_query() in event_route[xhttp:request], I presume I could set a short timeout in http_client_query(), but my concern is new WebSocket connections could still impact and block existing ones in this case...
My other thought was to move my API call out from the event_route[xhttp:request] into my route handler for REGISTER requests, thinking I could offload new register requests to async workers and move the API call there so as to not block existing connections. Does this seem like a reasonable approach to the problem? Given my current script does not use any async workers, how would one go about optimising the number of plain SIP TCP workers to async workers, are there any guides for this?
In general, for any potentially blocking "APIs" or calls over the network, is it best practice to offload them into async workers. For example, we run RTPEngine on separate hosts to Kamailio so should I be wrapping rtpengine_offer/answer calls into async workers in case of a slow network?
Apologies for the long text, but I would really appreciate any help to understand these problems.
Thank you