Description

When handling a PUBLISH we call handle_publish() and NOTIFYs are sent to all of the corresponding active_watchers (as expected). However, when NOTIFYs timeout (408), we're seeing that the corresponding entries in the active_watchers table are not being deleted as expected. Furthermore, we've noticed that NOTIFYs are being sent to active_watchers which are expired (i.e. expires < UNIX_TIMESTAMP()) and when we run kamcmd presence.cleanup, no expired entries are removed from the active_watchers table.

We suspect that all of these things might be related--the common theme being that records aren't deleted when expected.

Troubleshooting

Reproduction

In our setup, we're using Kamailio as a "presence server" (via the presence, presence_dialoginfo, and presence_xml modules). We're using subs_db_mode 3 (DB-only scheme) and we have multiple Kamailio instances connected to a shared database (MySQL 8.0.27).

Everything seems to be working as expected. However, as we accumulated stale entries in the active_watchers table we're finding that we're wasting more and more time on sending NOTIFYs to black holes. We're generating a lot of traffic and waiting for the timeouts to hit is causing memory issues and backups.

Here are the relevant portions of our kamailio.cfg file:

# ----- presence params -----
modparam("presence", "db_table_lock_type", 0)  # Disable locking; MySQL has issues with this is enabled.
modparam("presence", "db_update_period", -1)  # Disable synchronization.
modparam("presence", "db_url", PRESENCE_DB_URL)
modparam("presence", "expires_offset", 60)  # Force the client to send an UPDATE before the old PUBLISH expires.
modparam("presence", "max_expires", 1800)
modparam("presence", "min_expires", 1700)
modparam("presence", "publ_cache", 0)  # Disable the PUBLISH cache since the database is shared.
modparam("presence", "server_address", "sip:$CLUSTER_DOMAIN_NAME:5060")  # This becomes the value of the Contact header.
modparam("presence", "sip_uri_match", 1)  # Use case insensitive URI matching.
modparam("presence", "subs_db_mode", 3)  # Database-only scheme; everything is stored in the database.
modparam("presence", "notifier_processes", 0)  # Caution! Under load a race condition can cause CSeq's to be reused.
modparam("presence", "timeout_rm_subs", 1)

# ----- presence_dialoginfo params -----
modparam("presence_dialoginfo", "force_single_dialog", 1)  # Maybe not all phones support multiple "dialog" elements?
modparam("presence_dialoginfo", "force_dummy_dialog", 1)  # Maybe not all phones support a null body?

# ----- presence_xml params -----
modparam("presence_xml", "db_url", PRESENCE_DB_URL)
modparam("presence_xml", "force_active", 1)  # Skip permission/XCAP checks.
modparam("presence_xml", "force_dummy_presence", 1)  # Default to a simple "open" status when presentity info is unavailable.

# ...

route[PRESENCE] {
    if (!is_method("PUBLISH|SUBSCRIBE")) {
        return;
    }

    if (!t_newtran()) {
        sl_reply_error();
        exit;
    }

    if (is_method("PUBLISH")) {
        handle_publish();
        t_release();
    } else if (is_method("SUBSCRIBE")) {
        handle_subscribe();
        t_release();
    }
    exit;
}

SIP Traffic

Here's a somewhat sanitized example (the message seems OK to us; however, the Subscription-State: terminated; reason=timeout does make us wonder--do we as the sender know that the client is terminated/timed-out?):

2022/04/05 21:09:55.209846 10.21.3.12:5060 -> 10.31.0.226:6060
NOTIFY sip:SomeUser@192.168.86.24:54639;alias=123.21.125.232~54639~1 SIP/2.0
Via: SIP/2.0/UDP presence-w.staging.internal:5060;branch=z9hG4bK43ea.648a1952000000000000000000000000.0
To: <sip:SomeOtherUser@9bfadf66-a77b-6a69-25f3-02d96d4aa946>;tag=2607596073
From: <sip:SomeUser@9bfadf66-a77b-6a69-25f3-02d96d4aa946>;tag=69309ea83adcd977af8788878e9f31b3-42e32342
CSeq: 66 NOTIFY
Call-ID: 0_2607659559@192.168.86.24
Route: <sip:10.31.0.226:6060;r2=on;lr;ftag=2607596073>, <sip:55.8.122.110;r2=on;lr;ftag=2607596073>
Content-Length: 710
Max-Forwards: 70
Event: dialog
Contact: <sip:presence-w.staging.internal:5060>
Subscription-State: terminated;reason=timeout
Content-Type: application/dialog-info+xml

<?xml version="1.0"?>
<dialog-info xmlns="urn:ietf:params:xml:ns:dialog-info" version="66" state="full" entity="sip:SomeUser@9bfadf66-a77b-6a69-25f3-02d96d4aa946">
  <dialog id="0_1364146118@192.168.1.244" call-id="0_1364146118@192.168.1.244" direction="initiator">
    <state>confirmed</state>
    <remote>
      <identity>sip:4355558565@9bfadf66-a77b-6a69-25f3-02d96d4aa945:5060</identity>
      <target uri="sip:4355558565@9bfadf66-a77b-6a69-25f3-02d96d4aa946:5060"/>
    </remote>
    <local>
      <identity>sip:SomeUser@9bfadf66-a77b-6a69-25f3-02d96d4aa946:5060</identity>
      <target uri="sip:SomeUser@123.130.50.202:58872"/>
    </local>
  </dialog>
</dialog-info>

Possible Solutions

We didn't see any functions in the presence module that we could call directly to clean things up. One thought we had was to manually run some database commands from event_route[presence:notify-reply] (or in a reply_route). We've noticed that once the problematic entries are manually removed from the database that we no longer attempt to send NOTIFYs to the defunct destinations.

Additional Information

version: kamailio 5.5.4 (x86_64/linux) 
flags: USE_TCP, USE_TLS, USE_SCTP, TLS_HOOKS, USE_RAW_SOCKS, DISABLE_NAGLE, USE_MCAST, DNS_IP_HACK, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, TLSF_MALLOC, DBG_SR_MEMORY, USE_FUTEX, FAST_LOCK-ADAPTIVE_WAIT, USE_DNS_CACHE, USE_DNS_FAILOVER, USE_NAPTR, USE_DST_BLOCKLIST, HAVE_RESOLV_RES, TLS_PTHREAD_MUTEX_SHARED
ADAPTIVE_WAIT_LOOPS 1024, MAX_RECV_BUFFER_SIZE 262144, MAX_URI_SIZE 1024, BUF_SIZE 65535, DEFAULT PKG_SIZE 8MB
poll method support: poll, epoll_lt, epoll_et, sigio_rt, select.
id: unknown 
compiled with gcc 10.2.1
$ lsb_release -a
No LSB modules are available.
Distributor ID:	Debian
Description:	Debian GNU/Linux 11 (bullseye)
Release:	11
Codename:	bullseye

$ uname -a
Linux ip-10-21-3-12 5.10.0-13-cloud-amd64 #1 SMP Debian 5.10.106-1 (2022-03-17) x86_64 GNU/Linux


Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.Message ID: <kamailio/kamailio/issues/3074@github.com>