[sr-dev] DMQ design issues and incurring problems

Mon Apr 19 16:02:58 CEST 2021

Hello,

as I added support for other transports (tcp, tls, ...) to DMQ
inter-nodes communications and trying to configure it for a few testing
scenarios, I ended up in some design issues/limitations. Two of them
seem to prevent usual operations, therefore I want to get more opinions
and see what would be the best way to go further.

1) there is no decoupling between server_address and server socket. The
server socket is kept internally, but it is built from server_address
parameter. The actual problem appears when trying to use FQDN as
notification_address, because it ends up to have duplicate peer
addresses for the same node.

For example:

  - server1 with ip 1.2.3.4 and domain server1.sip.com

  - server2 with ip 5.6.7.8 and domain server2.sip.com

On server1:

modparam("dmq", "server_address", "sip:1.2.3.4:5060")
modparam("dmq", "notification_address", "sip:server2.sip.com:5060")

On server2:

modparam("dmq", "server_address", "sip:5.6.7.8:5060")
modparam("dmq", "notification_address", "sip:server1.sip.com:5060")

Then each node end ups with 4 peer nodes.

On server1:

  - sip:1.2.3.4:5060 (local=1)
  - sip:server1.sip.com:5060 (local=0)
  - sip:5.6.7.8:5060 (local=0)
  - sip:server2.sip.com:5060 (local=0)

On server2:

  - sip:1.2.3.4:5060 (local=0)
  - sip:server1.sip.com:5060 (local=0)
  - sip:5.6.7.8:5060 (local=1)
  - sip:server2.sip.com:5060 (local=0)

Practically each server considers the local FQDN being a remote peer.

There are KDMQ requests sent to itself, but a real problematic issue is
that presence replication (as I tested, could be for the other modules
as well) is broken, because instead of an update it happens a replace.
The case was a PUBLISH with body having state open, then in 30sec there
is an PUBLISH to refresh using same ETag and empty body (as per spec),
but instead of just updating the expires value, it also updates the body
to empty string.

If notification_address is using IP address instead of FQDN of the other
server, everything works as expected, with body being kept on refresh
and only expires being updated.

The looping/spiralling is overwriting the purpose of the KDMQ
replication action.

Use of FQDN is needed for TLS transport in order to be able to validate
the domain against the attributes in certificate. Currently it does not
work to have server_address with FQDN (maybe it would work with
advertise address to listen, but that will force to use for SIP headers,
which is not wanted).

To solve it I would introduce server_socket modparam, which if it is not
set, then it is computed from server_address like for now. This keeps
backward compatibility.

2) The second issue is a bit related. As there was a need to remove the
FQDN and change the notification_address to be with the IP of the other
node, I restarted each node, but the FQDNs stayed there, because after
the restart of the server 1, it got the list with FQDNs from the server
2. Then restarting the server 2, ended up by sync'ing with server 1 and
receiving again all the addresses. Practically the solution was to shut
down all the nodes, which is something one is likely not wanting to do,
because the entire platform is down with no active node, and even if it
is for short time, for cases when data is only in memory (e.g., htable
items, or in-memory only presence) everything is lost.

In other words, there is no way to remove a peer address that still
points to an active node but it is no longer wanted, because it persists
in the other running nodes. Could be also the case of changing the
domain to be used for notification address.

This issue leads to the 1), because one node then appears many times
(old FQDN and new FQDN), leading to loops/spirals with unwanted/broken
side effects.

I haven't thought much of it, but one solution could be an RPC command
to be able to remove unwanted addresses from list of peers. It still can
be a race of sync'ing immediately after removing via rpc, before being
able to remove from the other node, but then one can check with rpc
dmq.list_nodes and re-run the rpc command if it is the case.

Given that Kamailio is in testing phase for 5.5, I want to see if anyone
thinks of other solutions to fix this issues without introducing a new
parameter (again, with backward compatibility) and a new rpc command
(which does not have effect in breaking existing behaviour).

Cheers,
Daniel

-- 
Daniel-Constantin Mierla -- www.asipto.com
www.twitter.com/miconda -- www.linkedin.com/in/miconda
Kamailio Advanced Training - Online
May 17-20, 2021 (Europe Timezone) - June 7-10, 2021 (America Timezone)
  * https://www.asipto.com/sw/kamailio-advanced-training-online/