Hi all,
We're running Kamailio 6.1.3 (also seen on 6.1.2) with the `siptrace` module sending HEP duplicates to a single HOMER over TCP.
We use the .deb packages from official Kamailio repository on Debian 11 and 13 currently.
On every kamailio in our fleet we see idle TCP connections to HOMER accumulate indefinitely on the `tcp_main` process, and on the higher-latency kamailios this eventually triggers a CRITICAL cascade we have to restart out of.
Relevant config:
tcp_connection_lifetime = 600
tcp_max_connections = 8192
modparam("siptrace", "duplicate_uri", "sip:<homer>:9060;transport=tcp")
modparam("siptrace", "hep_mode_on", 1)
modparam("siptrace", "hep_version", 3)
modparam("siptrace", "trace_mode", 1)
modparam("siptrace", "force_send_sock", "sip:<our_ip>:5060;transport=tcp")
`sip_trace()` is called once near the top of `request_route`, after `sanity_check`, with no method filtering — every accepted request is forwarded to HOMER.
UDP is unfortunately no longer an option for us (WebRTC/mobile SDP payloads exceed UDP MTU).
What we observe:
1. Continuous chronic stream of:
ERROR: <core> [core/tcp_main.c:655]: _wbufq_add(): write queue full or timeout
ERROR: siptrace [siptrace_hep.c:234]: trace_send_hep3_duplicate(): cannot send hep duplicate message
at ~3,000–6,000/hour.
2. `ss` shows ~2,200 ESTAB sockets from one kamailio to HOMER, all owned by the `tcp_main` PID,
all `Recv-Q=0 Send-Q=0` with only kernel keepalive timers.
Exactly one of them is the actively-used socket; the rest sit idle and are not released.
New zombie rate is ~160/hour, so `tcp_max_connections=8192` is hit in ~50 hours.
3. Eventually `tcp_main` emits:
CRITICAL: <core> [core/io_wait.h:596]: io_watch_del(): invalid fd 8378, not in [0, 200)
followed by a flood of `handle_ser_child(): received CON_ERROR` with rising connection IDs.
After this the Kamailio process is impaired and only a restart recovers it.
The rate of zombie accumulation, and consequently how often the CRITICAL hits,
correlates strongly with RTT to HOMER: kamailios in the same region as HOMER barely leak;
kamailios ~100 ms away leak several times faster and are the ones where the cascade recurs.
The `_wbufq_add` timeouts being the precursor is consistent with this:
at higher RTT, a single TCP socket drains much more slowly, so the
per-socket write queue fills faster under the same HEP load.
We've done an AI assisted (Claude) source-level reading of the relevant paths in `tcp_main`
(`tcp_send`, `_tcpconn_find`, `_wbufq_add`, `tcpconn_main_timeout`).
From that reading, the connection lookup and the per-connection timeout cleanup
(`tcpconn_main_timeout`, which is what should close idle connections after `tcp_connection_lifetime`)
look correct in isolation.
It seems the connections must be becoming invisible to that cleanup path somewhere we haven’t
tracked down — most plausibly because they end up with `reader_pid != 0` and the reader’s
idle handling doesn't release them.
But we couldn't confirm this from static reading alone.
Has anyone else seen this pattern?
Is the model of `reader_pid` holding outbound idle connections indefinitely a known limitation?
We have a much longer write-up with `ss` snapshots, per-hour error counts, and the exact lines we walked through in the 6.1 source —
happy to share it on-list or open a GitHub issue with it, whichever is preferred.
Thanks and best regards,
Florian Floimair