[sr-dev] [kamailio/kamailio] [presence] `active_watchers` entries not removed on timeout or expiration. (Issue #3074)

Mike Moore notifications at github.com
Wed Apr 6 00:06:01 CEST 2022


### Description

<!--
Explain what you did, what you expected to happen, and what actually happened.
-->
 
When handling a `PUBLISH` we call [handle_publish()](https://kamailio.org/docs/modules/5.5.x/modules/presence.html#presence.f.handle_publish) and `NOTIFY`s are sent to all of the corresponding `active_watchers` (as expected). However, when `NOTIFY`s timeout (`408`), we're seeing that the corresponding entries in the `active_watchers` table are _not_ being deleted as expected. Furthermore, we've noticed that `NOTIFY`s are being sent to `active_watchers` which are expired (i.e. `expires < UNIX_TIMESTAMP()`) and when we run `kamcmd presence.cleanup`, no expired entries are removed from the `active_watchers` table.

We suspect that all of these things might be related--the common theme being that records aren't deleted when expected. 

### Troubleshooting

#### Reproduction

<!--
If the issue can be reproduced, describe how it can be done.
-->

In our setup, we're using Kamailio as a "presence server" (via the [presence](https://kamailio.org/docs/modules/5.5.x/modules/presence.html), [presence_dialoginfo](https://kamailio.org/docs/modules/5.5.x/modules/presence_dialoginfo.html), and [presence_xml](https://kamailio.org/docs/modules/5.5.x/modules/presence_xml.html) modules). We're using [`subs_db_mode`](https://kamailio.org/docs/modules/5.5.x/modules/presence.html#presence.p.subs_db_mode) `3` (DB-only scheme) and we have multiple Kamailio instances connected to a shared database (MySQL `8.0.27`). 

Everything seems to be working as expected. However, as we accumulated stale entries in the `active_watchers` table we're finding that we're wasting more and more time on sending `NOTIFY`s to black holes. We're generating a _lot_ of traffic and waiting for the timeouts to hit is causing memory issues and backups.

Here are the relevant portions of our `kamailio.cfg` file:
```
# ----- presence params -----
modparam("presence", "db_table_lock_type", 0)  # Disable locking; MySQL has issues with this is enabled.
modparam("presence", "db_update_period", -1)  # Disable synchronization.
modparam("presence", "db_url", PRESENCE_DB_URL)
modparam("presence", "expires_offset", 60)  # Force the client to send an UPDATE before the old PUBLISH expires.
modparam("presence", "max_expires", 1800)
modparam("presence", "min_expires", 1700)
modparam("presence", "publ_cache", 0)  # Disable the PUBLISH cache since the database is shared.
modparam("presence", "server_address", "sip:$CLUSTER_DOMAIN_NAME:5060")  # This becomes the value of the Contact header.
modparam("presence", "sip_uri_match", 1)  # Use case insensitive URI matching.
modparam("presence", "subs_db_mode", 3)  # Database-only scheme; everything is stored in the database.
modparam("presence", "notifier_processes", 0)  # Caution! Under load a race condition can cause CSeq's to be reused.
modparam("presence", "timeout_rm_subs", 1)

# ----- presence_dialoginfo params -----
modparam("presence_dialoginfo", "force_single_dialog", 1)  # Maybe not all phones support multiple "dialog" elements?
modparam("presence_dialoginfo", "force_dummy_dialog", 1)  # Maybe not all phones support a null body?

# ----- presence_xml params -----
modparam("presence_xml", "db_url", PRESENCE_DB_URL)
modparam("presence_xml", "force_active", 1)  # Skip permission/XCAP checks.
modparam("presence_xml", "force_dummy_presence", 1)  # Default to a simple "open" status when presentity info is unavailable.

# ...

route[PRESENCE] {
    if (!is_method("PUBLISH|SUBSCRIBE")) {
        return;
    }

    if (!t_newtran()) {
        sl_reply_error();
        exit;
    }

    if (is_method("PUBLISH")) {
        handle_publish();
        t_release();
    } else if (is_method("SUBSCRIBE")) {
        handle_subscribe();
        t_release();
    }
    exit;
}
```

#### SIP Traffic

<!--
If the issue is exposed by processing specific SIP messages, grab them with ngrep or save in a pcap file, then add them next, or attach to issue, or provide a link to download them (e.g., to a pastebin site).
-->

Here's a somewhat sanitized example (the message seems OK to us; however, the `Subscription-State: terminated; reason=timeout` does make us wonder--do we as the sender _know_ that the client is terminated/timed-out?):
```
2022/04/05 21:09:55.209846 10.21.3.12:5060 -> 10.31.0.226:6060
NOTIFY sip:SomeUser at 192.168.86.24:54639;alias=123.21.125.232~54639~1 SIP/2.0
Via: SIP/2.0/UDP presence-w.staging.internal:5060;branch=z9hG4bK43ea.648a1952000000000000000000000000.0
To: <sip:SomeOtherUser at 9bfadf66-a77b-6a69-25f3-02d96d4aa946>;tag=2607596073
From: <sip:SomeUser at 9bfadf66-a77b-6a69-25f3-02d96d4aa946>;tag=69309ea83adcd977af8788878e9f31b3-42e32342
CSeq: 66 NOTIFY
Call-ID: 0_2607659559 at 192.168.86.24
Route: <sip:10.31.0.226:6060;r2=on;lr;ftag=2607596073>, <sip:55.8.122.110;r2=on;lr;ftag=2607596073>
Content-Length: 710
Max-Forwards: 70
Event: dialog
Contact: <sip:presence-w.staging.internal:5060>
Subscription-State: terminated;reason=timeout
Content-Type: application/dialog-info+xml

<?xml version="1.0"?>
<dialog-info xmlns="urn:ietf:params:xml:ns:dialog-info" version="66" state="full" entity="sip:SomeUser at 9bfadf66-a77b-6a69-25f3-02d96d4aa946">
  <dialog id="0_1364146118 at 192.168.1.244" call-id="0_1364146118 at 192.168.1.244" direction="initiator">
    <state>confirmed</state>
    <remote>
      <identity>sip:4355558565 at 9bfadf66-a77b-6a69-25f3-02d96d4aa945:5060</identity>
      <target uri="sip:4355558565 at 9bfadf66-a77b-6a69-25f3-02d96d4aa946:5060"/>
    </remote>
    <local>
      <identity>sip:SomeUser at 9bfadf66-a77b-6a69-25f3-02d96d4aa946:5060</identity>
      <target uri="sip:SomeUser at 123.130.50.202:58872"/>
    </local>
  </dialog>
</dialog-info>
```

### Possible Solutions

<!--
If you found a solution or workaround for the issue, describe it. Ideally, provide a pull request with a fix.
-->

We didn't see any functions in the `presence` module that we could call directly to clean things up. One thought we had was to manually run some database commands from `event_route[presence:notify-reply]` (or in a `reply_route`). We've noticed that once the problematic entries are manually removed from the database that we no longer attempt to send `NOTIFY`s to the defunct destinations.

### Additional Information

  * **Kamailio Version** - output of `kamailio -v`

```
version: kamailio 5.5.4 (x86_64/linux) 
flags: USE_TCP, USE_TLS, USE_SCTP, TLS_HOOKS, USE_RAW_SOCKS, DISABLE_NAGLE, USE_MCAST, DNS_IP_HACK, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, TLSF_MALLOC, DBG_SR_MEMORY, USE_FUTEX, FAST_LOCK-ADAPTIVE_WAIT, USE_DNS_CACHE, USE_DNS_FAILOVER, USE_NAPTR, USE_DST_BLOCKLIST, HAVE_RESOLV_RES, TLS_PTHREAD_MUTEX_SHARED
ADAPTIVE_WAIT_LOOPS 1024, MAX_RECV_BUFFER_SIZE 262144, MAX_URI_SIZE 1024, BUF_SIZE 65535, DEFAULT PKG_SIZE 8MB
poll method support: poll, epoll_lt, epoll_et, sigio_rt, select.
id: unknown 
compiled with gcc 10.2.1
```

* **Operating System**:

<!--
Details about the operating system, the type: Linux (e.g.,: Debian 8.4, Ubuntu 16.04, CentOS 7.1, ...), MacOS, xBSD, Solaris, ...;
Kernel details (output of `lsb_release -a` and `uname -a`)
-->

```
$ lsb_release -a
No LSB modules are available.
Distributor ID:	Debian
Description:	Debian GNU/Linux 11 (bullseye)
Release:	11
Codename:	bullseye

$ uname -a
Linux ip-10-21-3-12 5.10.0-13-cloud-amd64 #1 SMP Debian 5.10.106-1 (2022-03-17) x86_64 GNU/Linux
```


-- 
Reply to this email directly or view it on GitHub:
https://github.com/kamailio/kamailio/issues/3074
You are receiving this because you are subscribed to this thread.

Message ID: <kamailio/kamailio/issues/3074 at github.com>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.kamailio.org/pipermail/sr-dev/attachments/20220405/c06a5073/attachment-0001.htm>


More information about the sr-dev mailing list