This was discussed a lot during early days of SER and even many times
along the 15 years so far.
The persistence of transaction is not something easy, because of its
complex relations with timers for retransmissions over udp, but also
with connections for tcp/tls. Each transaction has a lot of states,
particularly bound to each outgoing branches that can be at different
phases.
So after a restart, any connection with a device behind nat cannot be
established, so all those transactions are dead anyhow. For UDP, each
branch can be in different state, with provisional replies received or
not, with some of the branches at different retransmission intervals
(now even more complex given that with tsilo one can add a branch at any
time).
This is a thing of investing a lot of resources to get a solution for
0.01% or less, which in most of the cases sort out themselves fairly
nice. If someone has those resources and considers these to be critical
for them, I won't have anything against provided that the start of the
server is not significantly delayed (or an option to turn this of exists).
You can do some more workarounds from the config to narrow down these cases:
- drop replies if associated transactions don't exist
- use htable to store where branches are going and then send cancel to
them (htable can save to db at shut down)
- instead of htable, external systems can be used for storage, like
redis or any sql via sqlops. Anyhow, you will hit them only when the
invite transaction doesn't exist. You can also enable storage in them
via some rpc command (e.g. turning a $shv() on), few minutes before you
expect a restart.
Also, with the new proposed embedded languages scripting routing in 5.0,
restarts should be needed less and less, as routing rules/blocks can be
reloaded. So it should help on this topic as well.
Cheers,
Daniel
On 14/03/16 22:00, Alex Balashov wrote:
Currently (AFAIK), restarting Kamailio amidst
production call
processing is basically "safe"; except for no availability during the
few seconds it takes to restart (which will just result in
retransmissions until it is available), most things will happen
"correctly" after restart even though TM state has been lost:
(1) Initial requests will be routed as initial requests always were;
(2) In-dialog requests will be loose-routed as sequential requests
always were;
(3) Replies to open transactions will fall back to stateless routing
but will be delivered correctly to their destinations based on SIP
fundamentals (i.e. Via).
(4) rtpproxy & rtpengine control messages are grouped by Call-ID, so
also stateless. If the proper destroy/remove functions are not called
from failure_route[] due to lack of TM state, it's not so bad;
rtpproxy & rtpengine will see an RTP timeout after a while and expire
the bindings on their own.
If dialog state is used, it will be lost, but assuming one is willing
to live with that, it's okay. I don't know if there has been any work
done to create a persistence layer for dialog that can be re-read
completely on startup, and if it actually works - does it? - but it's
a relatively small price to pay if it's important to integrate a
change into production in the middle of the day.
The one exception is CANCEL handling. CANCEL is a special animal,
since it's a hop-by-hop (branch-level) request, so CANCELs sent from a
caller apply to the 'caller -> Kamailio' branch. Kamailio generates
separate CANCELs endogenously for one or more 'Kamailio -> gateway'
branches.
Stateful CANCEL handling with TM is implemented using t_check_trans()
or t_relay_cancel(). For example, in the stock config[1]
# CANCEL processing
if (is_method("CANCEL")) {
if (t_check_trans()) {
route(RELAY);
}
exit;
}
Or, as in our case, more folklorically:
if(is_method("CANCEL")) {
if(!t_relay_cancel()) {
# Corresponding INVITE transaction found, but error
# occurred.
sl_send_reply("500", "Internal Server Error");
exit;
}
# Corresponding INVITE transaction for CANCEL was not
# found.
exit;
}
In both cases, the corresponding INVITE transaction must exist.
Unfortunately, there's no good alternative. According to RFC 3261
Section 16.11 ("Stateless Proxy"):
Stateless proxies MUST NOT perform special processing for
CANCEL requests. They are processed by the above rules
as any other requests.
So, in other words, route_logic(CANCEL) == route_logic(initial INVITE).
Sometimes, this is possible - with considerable config logic labour -
but other times the path taken by the CANCEL is not so deterministic,
as for example with round-robin load balancing, random distribution,
complex LCR, etc. One is basically left in these situations with the
choice of implementing one's own Call-ID/branch => destination
persistent state database of some kind, which is, to put it mildly,
complicated and undesirable.
Now, if the INVITE transaction receives a final negative reply, this
will get back to the calling UAC, and it will process it correctly.
However, some calls get answered with 2xx. Many UACs will behave
reasonably in this situation: when they don't receive a 200 OK for
their CANCELs but later receive an answer, they will go ahead and send
the end-to-end ACK, then BYE the call. However, they cannot be
reliably counted upon to do this. Some simply get drop the INVITE
transaction after their CANCEL has gone unreplied for a short time,
regardless of whether they receive a final negative reply for it.
Is there a better way? Perhaps a feature can be devised by which
Kamailio keeps some kind of lightweight and restart-persistent map to
which to send the CANCELs? Or perhaps TM is due for a feature that
allows the shm transaction table to be dumped to disk and persisted
across restarts?
Comments welcome. Also, if I'm missing something, please let me know!
-- Alex
[1]
https://github.com/kamailio/kamailio/blob/master/etc/kamailio.cfg#L466
--
Daniel-Constantin Mierla
http://www.asipto.com
http://twitter.com/#!/miconda -
http://www.linkedin.com/in/miconda
Kamailio World Conference, Berlin, May 18-20, 2016 -
http://www.kamailioworld.com