[Devel] tcp keepalive question

Tue Dec 19 10:34:51 CET 2006

On Tuesday 19 December 2006 08:59, Juha Heinanen wrote:
> Dan Pascu writes:
>  > When you combine this fact with the blocking nature of OpenSER, it
>  > can raise serious issues when many TCP clients are connected and
>  > there is a high churn rate among them combined with high SIP
>  > traffic.
>
> so your conclusion is that there is nothing openser can do to keep nat
> bindings open to tcp UAs and that they should take care of that
> themselves like the nokia phones do?

Not at all. Sorry if I generated this confusion. This is probably because 
I was highlighting a different but related issue.

Using a form of keepalive (be it client based or server based) is needed 
to prevent the NAT from closing, nobody is arguing with that. Also using 
a server based solution like the one with TCP_KEEPALIVE has the advantage 
that it works with UAs that do not support TCP keepalives.

What I was saying, is that even if the UA does it or the server does it, 
when the moment comes for the server to send a request to the client it 
may block even if the NAT was kept open.

Imagine the case where UA1 uses TCP and the NAT is kept open (either from 
the client side or the server side it doesn't matter). Assume the 
keepalive procedure was just performed and a few seconds later the 
connection with UA1 is severed because of some network connectivity 
issue. The server doesn't know this since the TCP connection was not 
deliberately closed and the next TCP keepalive procedure is still not to 
be done for a while. If at that moment a request comes for UA1, openser 
will block until it timeouts. It doesn't matter that this timeout may be 
small (say 3 seconds), the fact that all TCP workers may be blocked at 
some point for 3 seconds and cannot do a thing is bad.

This cannot be solved with the current architecture and the more TCP 
clients one has, the more increased chances there are that this scenario 
will happen.

IMO, OpenSER's architecture is well suited for UDP clients and at most 
some TCP/TLS peers. But even in this case if a peer is down and there are 
many connection towards that peer, again many TCP workers will be 
blocked, and other TCP peers cannot be handled meanwhile. While in this 
case the worst that can happen is that for a while (tenths of seconds) 
until the connection with the disconnected peer does timeout and it is 
closed, there will be a small gap in which TCP/TLS connections will not 
work or work poorly, but after that it'll resume.

However with many TCP clients, the chances that many of them get 
disconnected and keep TCP workers idle waiting for a timeout and unable 
to process anything else is much higher.

The only way out of this is to use an architecture based around an event 
driven reactor, or to use multithreading. Of course in addition to these 
(that only guarantee that openser won't block on a rogue client), a TCP 
keepalive mechanism is required for keeping the NAT open, there is no 
argument about that.

-- 
Dan