Ozzyboshi created an issue (kamailio/kamailio#4503)
Hello,
on my Kamailio installation I am experiencing a significant memory leak in SHM. Here are the details of my system:
``` version: kamailio 6.0.3 (x86_64/linux) flags: USE_TCP, USE_TLS, USE_SCTP, TLS_HOOKS, USE_RAW_SOCKS, DISABLE_NAGLE, USE_MCAST, NO_SIG_DEBUG, DNS_IP_HACK, SHM_MMAP, PKG_MALLOC, MEM_JOIN_FREE, Q_MALLOC, F_MALLOC, TLSF_MALLOC, DBG_SR_MEMORY, USE_FUTEX, FAST_LOCK-ADAPTIVE_WAIT, USE_DNS_CACHE, USE_DNS_FAILOVER, USE_NAPTR, USE_DST_BLOCKLIST, HAVE_RESOLV_RES, TLS_PTHREAD_MUTEX_SHARED ADAPTIVE_WAIT_LOOPS: 1024 MAX_RECV_BUFFER_SIZE: 262144 MAX_SEND_BUFFER_SIZE: 262144 MAX_URI_SIZE: 1024 BUF_SIZE: 65535 DEFAULT PKG_SIZE: 8MB poll method support: poll, epoll_lt, epoll_et, sigio_rt, select compiled with gcc 14.2.0 ```
The memory leak appears only when the presence feature is enabled.
When presence is active, Kamailio starts running dialog_publish(), whose code is here:
https://github.com/kamailio/kamailio/blob/9dc160d1d2bdf0542d3d9d8ae090bb1352...
This function does not send the PUBLISH directly: it calls pua_send_publish(), which is a function pointer referring to the send_publish() implementation in the pua module. Then send_publish() eventually calls set_uac_req() and tmb.t_request():
https://github.com/kamailio/kamailio/blob/9dc160d1d2bdf0542d3d9d8ae090bb1352...
Digging further, tmb.t_request() maps to request() in the TM module, which calls t_uac_with_ids() and then t_uac_prepare().
Now comes the suspicious part:
If I comment out the call to t_uac_prepare(), the memory leak disappears. This doesn’t necessarily mean the bug is inside t_uac_prepare(), but it’s a strong hint.
t_uac_prepare() allocates a new struct cell and returns it:
https://github.com/kamailio/kamailio/blob/9dc160d1d2bdf0542d3d9d8ae090bb1352...
My concern is: is this cell always freed?
The matching cleanup function is free_cell(), used only here:
https://github.com/kamailio/kamailio/blob/9dc160d1d2bdf0542d3d9d8ae090bb1352...
From what I can tell, free_cell() is called only if all these conditions are true:
- dst_cell == 0 - is_ack == 1 - dst_req == 0 - In my situation no ACK is involved (Kamailio is a proxy that sends PUBLISH and immediately gets a 200 OK). Therefore, is_ack is always false meaning the free_cell() cleanup logic is skipped entirely.
I tried forcing free_cell() unconditionally, but it leads to crashes, so clearly other parts of the code still rely on this structure.
Does the current free_cell() logic look correct to you? Is it expected that struct cell allocated by t_uac_prepare() remains unfreed in cases where PUBLISH → 200 OK occurs without ACK?
Any guidance on how to proceed or where else to look would be greatly appreciated.
Thanks
miconda left a comment (kamailio/kamailio#4503)
The ACK is only for INVITE requests, never for PUBLISH. ACK is also stateless, no transaction can be kept for it, because it does not receive responses.
A transaction that is created is kept in the internal hash table and it is supposed to be deleted by a timer routine, a few seconds after the transaction is completed (a final response is processed for it).
Do you see the PUBLISH transaction with `kamctl rpc tm.list`?
Ozzyboshi left a comment (kamailio/kamailio#4503)
I generated some calls then I stopped, this is what I god
``` # ngcp-kamcmd proxy tm.list | grep PUB | wc -l 48 # ngcp-kamcmd proxy tm.list | grep PUB | wc -l 26 # ngcp-kamcmd proxy tm.list | grep PUB | wc -l 12 # ngcp-kamcmd proxy tm.list | grep PUB | wc -l 8 # ngcp-kamcmd proxy tm.list | grep PUB | wc -l 2 # ngcp-kamcmd proxy tm.list | grep PUB | wc -l 0 ```
So for me there are no transactions in memory and the memory should be freed,
However if I keep banging kamailio for let's say 15 minutes, with the same call, and run thounsands of calls , then stop I get this
``` # ngcp-kamcmd proxy tm.list ERROR: reply too big ```
So, according to this, in some heavy traffic situations, it seems the transactions are not freed.
After that I run ngcp-kamcmd proxy tm.clean and the whole mem leak disappeared.
The whole process is summarized here
<img width="661" height="288" alt="Image" src="https://github.com/user-attachments/assets/6da97b55-f5b2-4728-a77d-8669b078c5c3" />
This graph shows 15.52. Start spawning thousand of calls 16.07. Stop spawning calls (the ongoing calls should last 2 or 3 minutes) 16.12. Calls should be terminated, no more dialogs or transactions should be open but still we have SHM at about 300MB, this is a plateau and for me it was a memory leak. If i run kamcmd proxy tm.list I have a "too big reply", meaning that a lot of transactions are still open? 16.27. I run tm clean, shm usage drops immediately to almost zero
So, probably I was wrong, kamailio is not really leaking, it's just keeping transactions in memory for some reason. Is there a way to increase memory so that kamcmd will reply with some data instead of giving me the error "too big?"
Closed #4503 as completed.
Ozzyboshi left a comment (kamailio/kamailio#4503)
I did not want to close the issue ( i clicked on the wrong button ), is it possible to reopen at least to have some suggestions about how to proceed from @miconda ?
Reopened #4503.
henningw left a comment (kamailio/kamailio#4503)
Re-opened it. Regarding the reply to big, have a look here: https://www.kamailio.org/docs/modules/stable/modules/ctl.html#binrpc_max_bod... - most probably you need to adapt the buffer size. Regarding the memory usage, are you using linux tools or having a look to the actual shared memory stats from the Kamailio
miconda left a comment (kamailio/kamailio#4503)
You should use `kamctl rpc tm.list`, it is more flexible for larger rpc responses. Or get the rpc response stored in a file with `RPCSTOREPATH=/tmp/kamailio-rpc-response.json kamctl rpc tm.list`.
Ozzyboshi left a comment (kamailio/kamailio#4503)
sorry for late reply, I added [binrpc_max_body_size](https://www.kamailio.org/docs/modules/stable/modules/ctl.html#binrpc_max_bod...) to my configuration with a very high number... something like that
``` modparam("ctl", "binrpc", "unix:/run/kamailio/ctl.proxy.sock") modparam("ctl", "binrpc", "tcp:127.0.0.1:5012") modparam("ctl", "mode", 0600) modparam("ctl", "user", "kamailio") modparam("ctl", "group", "kamailio") modparam("ctl", "binrpc_max_body_size", 1000000) modparam("ctl", "binrpc_struct_max_body_size", 1000000) modparam("ctl", "binrpc_buffer_size", 1000000) ```
but still
``` /usr/sbin/kamcmd -s /run/kamailio/ctl.proxy.sock tm.list ERROR: reply too big ```
Apart of that, I think there is no memory leak inside kamailio, however, according to me, it seems that in high traffic situation, with a lot of transactions, kamailio is unable to free all transactions but only a portion of them. Can you please point me where "the timer routine" mentioned by miconda is implemented so I can give a look?
Answering to @henningw : I wish I could use regular memcheckers and tools for kamailio like valgrind but I do not think this is possible because kamailio uses its own memory management routines. My data is extracted by our automated tools that poll kamailio regularly .
Ozzyboshi left a comment (kamailio/kamailio#4503)
``` /* read reply */ memset(&in_pkt, 0, sizeof(in_pkt)); if((ret = get_reply( s, reply_buf, MAX_REPLY_SIZE, cookie, &in_pkt, &msg_body)) < 0) { ```
Maybe i am wrong but it seems that all comes to MAX_REPLY_SIZE which is hardcoded
#define MAX_REPLY_SIZE 128 * 1024
Ozzyboshi left a comment (kamailio/kamailio#4503)
Hello again, I would like to share the results of my investigation.
As a first step, I built Kamailio with a larger MAX_REPLY_SIZE to allow me to inspect stuck transactions. It would be nice to have this value configurable, but there is a problem: in several parts of the code there are local arrays sized using MAX_REPLY_SIZE, and if the value is too large the stack memory is not sufficient.
Aside from this limitation, increasing the value allowed me to capture some Call-IDs associated with memory leaks. All of them had one particular thing in common: an error that caused our Kamailio configuration file to execute the drop; instruction.
It seems that when drop; is executed after branching, the function wait_handler() (where I would expect the SHM memory associated with the transaction cell to be freed) is not executed, resulting in a memory leak.
So I would like to ask the community whether this could explain the memory leak I’m seeing. If I remove the drop; instruction, everything works as expected: when calls end, SHM memory goes back to almost zero.
Has anyone encountered a similar issue? What can I do to continue this investigation? How is it possible that the drop; instruction causes this behavior?
henningw left a comment (kamailio/kamailio#4503)
Thanks for the update. Are you using drop; in a branch_route, effectively discarding the branch, or are you executing it in the main request_route? Do you have maybe a small cfg that shows the problem?
Ozzyboshi left a comment (kamailio/kamailio#4503)
Hello @henningw and thanks for your help. As stated in my previous message, the "drop;" is executed after branching, to be more specific I do something like:
` t_on_branch(route1) `
and then
``` branch_route[route1] { route(route2); } ```
inside route2 I invoke rtpengine to open audio ports
``` if(!rtpengine_offer($var(rtpp_flags), "auto-next")) { ... route(ROUTE_DROP); }
``` The mem leak occours when rtpengine_offer() returns false, this causes the execution of route drop which contains the "infamous" drop; instruction.
``` route[ROUTE_DROP] { ; do some loggin here drop; } ```
When this happens I get some hanging transaction, I can see them using kamctl
` /usr/sbin/kamcmd tm.list `
The only way to get rid of them is to run
` /usr/sbin/kamcmd tm.clear `
Otherwise they will stay there forever no matter if the transaction got a reply or not. I can easily reproduce this on a lab machine, if I replace the if(!rtpengine_offer($var(rtpp_flags), "auto-next")) line with if (1) I constantly get mem leak. So i can reproduce the problem quite easily.
github-actions[bot] left a comment (kamailio/kamailio#4503)
This issue is stale because it has been open 6 weeks with no activity. Remove stale label or comment or this will be closed in 2 weeks.
Ozzyboshi left a comment (kamailio/kamailio#4503)
Hello @henningw did you have time to look into this issue? I found in the core the section when drop; is executed but I do not see any point where the transaction should be freed. Can you tell me where it is and if this behaviour is like this by design?
henningw left a comment (kamailio/kamailio#4503)
Hello [@henningw](https://github.com/henningw) did you have time to look into this issue? I found in the core the section when drop; is executed but I do not see any point where the transaction should be freed. Can you tell me where it is and if this behaviour is like this by design?
Hello @Ozzyboshi, sorry I did not managed to look into it. I would not expect to have a memory leak in standard configuration tasks. Could you provide a quick small configuration that shows this issue, to make it easier to reproduce it?
Ozzyboshi left a comment (kamailio/kamailio#4503)
I have not time right now to give you detailed instructions how to reproduce, All i can say, just send an invite to kamailio , then branch and then drop; then run /usr/sbin/kamcmd tm.list
The transaction I guess should be there. I will try to give you more detailed instruction about how to reproduce when I get some more time.
github-actions[bot] left a comment (kamailio/kamailio#4503)
This issue is stale because it has been open 6 weeks with no activity. Remove stale label or comment or this will be closed in 2 weeks.
NormB left a comment (kamailio/kamailio#4503)
@Ozzyboshi I hope this patch resolves the problem you are having. Are you able to test it?
Ozzyboshi left a comment (kamailio/kamailio#4503)
We tried to install your tiny patch and from our point of view nothing has changed, we do a call, we branch, we issue a drop; command and according to kamctl tm list, the transaction is there and it's not removed.
henningw left a comment (kamailio/kamailio#4503)
We tried to install your tiny patch and from our point of view nothing has changed, we do a call, we branch, we issue a drop; command and according to kamctl tm list, the transaction is there and it's not removed.
Thank you for the feedback. I have set the linked PR to draft for now.
NormB left a comment (kamailio/kamailio#4503)
@henningw Thank you for moving the PR to draft status. While it has resolved one leak, it clearly hasn't resolved the problem @Ozzyboshi is experiencing. I'll keep digging into it as there must be an explanation and fix for it somewhere.
@henningw Thanks for providing a link to the other document, I'll read it.
NormB left a comment (kamailio/kamailio#4503)
@Ozzyboshi Thanks for testing. I'd like to understand why the patch didn't help in your case so I can fix it properly.
I've added diagnostic logging to `t_unref()` to capture the actual `kr` value at cleanup time. This is the single value that determines whether the transaction gets cleaned up or leaked.
Could you apply this on top of your 6.0.3 source, rebuild, send one INVITE that hits the drop path, and share the log line?
```diff --- a/src/modules/tm/t_lookup.c +++ b/src/modules/tm/t_lookup.c @@ -2056,6 +2056,14 @@ int t_unref(struct sip_msg *p_msg) if(p_msg->first_line.type == SIP_REQUEST) { kr = get_kr(); + if(unlikely(kr != 0 && kr != REQ_FWDED)) { + LM_WARN("t_unref kr=%d (hex 0x%x) T=%p method=%.*s callid=%.*s\n", + kr, kr, T, + p_msg->first_line.u.request.method.len, + p_msg->first_line.u.request.method.s, + p_msg->callid->body.len, + p_msg->callid->body.s); + } if(unlikely(kr == REQ_ERR_DELAYED)) { ```
The log will show something like `t_unref kr=16 (hex 0x10)` or `t_unref kr=17 (hex 0x11)`. Knowing the exact value tells me which cleanup path is failing and why.
Two other things that would help:
1. After the patch is applied and the call hits the drop path, are you waiting at least 10 seconds before checking `kamcmd tm.list`? The fix makes `kill_transaction()` run, which sends a 500 and puts the transaction on the wait timer (`wt_timer`, default 5s). During those 5 seconds the transaction is still visible in `tm.list` — that's normal, it should disappear after the timer fires.
2. Since you're on Sipwise NGCP — does the Sipwise build carry patches to the TM module? If their `t_unref()` or `set_kr()` code differs from upstream 6.0.3, the patch might need adjustment.
NormB left a comment (kamailio/kamailio#4503)
@Ozzyboshi I found why the patch didn't help.
The initial fix only addressed one of two code paths that cause this leak. When `T_DISABLE_INTERNAL_REPLY` is active, the error handling in `t_relay_to()` skips `set_kr(REQ_ERR_DELAYED)` entirely — the flag the initial patch was checking for is never set. `T_DISABLE_INTERNAL_REPLY` is activated by:
- calling `t_set_disable_internal_reply(1)` before `t_relay()` - passing flags that include 2 to `t_relay_to()`, e.g. `t_relay_to("proxy", "2")` or any of 2, 3, 6, 7
Does your config use any of these?
I've updated PR #4644 with a second change — a fallback in `t_unref()` that cleans up any transaction that ended up with no forwarded branches and no cleanup scheduled. Here are the test results on vanilla 6.0.3, 10,000 calls per scenario at 500/sec on 8 cores:
**Without `T_DISABLE_INTERNAL_REPLY`:**
| Config | No patch | Initial fix | Updated fix | |--------|:--------:|:-----------:|:-----------:| | `t_relay()` | 3548 | 0 | 0 | | `t_relay_to()` | 3546 | 0 | 0 | | `t_relay_to_udp("host","port")` | 3548 | 0 | 0 | | `t_relay_to("proxy","1")` | 3707 | 0 | 0 | | `t_relay_to("proxy","4")` | 3549 | 0 | 0 | | `t_relay_to("proxy","5")` | 3703 | 0 | 0 |
**With `T_DISABLE_INTERNAL_REPLY`:**
| Config | No patch | Initial fix | Updated fix | |--------|:--------:|:-----------:|:-----------:| | `t_relay_to("proxy","2")` | 3549 | **3547** | 0 | | `t_relay_to("proxy","3")` | 3703 | **3703** | 0 | | `t_relay_to("proxy","6")` | 3547 | **3544** | 0 | | `t_relay_to("proxy","7")` | 3702 | **3708** | 0 | | `t_set_disable_internal_reply(1)` + `t_relay()` | 3545 | **3547** | 0 | | `t_set_disable_internal_reply(1)` + `t_relay_to()` | 3542 | **3544** | 0 | | `t_set_disable_internal_reply(1)` + `t_relay_to_udp()` | 3548 | **3548** | 0 |
The bold values show where the initial fix still leaked — every config where `T_DISABLE_INTERNAL_REPLY` is active. The updated fix resolves all 14 tested configurations.
Documented flags: 1 (no auto 100), 2 (no internal reply), 4 (no DNS failover). Flags 3, 5, 6, 7 are undocumented combinations tested for completeness.
Would you be able to test the updated PR on your NGCP setup?
Ozzyboshi left a comment (kamailio/kamailio#4503)
sorry for delay, trying to test your patch today
Ozzyboshi left a comment (kamailio/kamailio#4503)
I installed this
``` if(p_msg->first_line.type == SIP_REQUEST) { kr = get_kr(); - if(unlikely(kr == REQ_ERR_DELAYED)) { + if(unlikely(kr & REQ_ERR_DELAYED)) { LM_DBG("delayed error reply generation(%d)\n", tm_error); if(unlikely(is_route_type(FAILURE_ROUTE))) { LM_BUG("called w/ kr=REQ_ERR_DELAYED in failure" @@ -2074,15 +2074,12 @@ int t_unref(struct sip_msg *p_msg) && !(kr & REQ_RLSD)))) { LM_WARN("script writer didn't release transaction\n"); t_release_transaction(T); - } else if(unlikely((kr & REQ_ERR_DELAYED) - && (kr - & ~(REQ_RLSD | REQ_RPLD | REQ_ERR_DELAYED - | REQ_FWDED)))) { - LM_BUG("REQ_ERR DELAYED should have been caught much" - " earlier for %p: %d (hex %x)\n", - T, kr, kr); + } else if(unlikely(T->nr_of_outgoings == 0 + && !(T->flags & T_IN_AGONY))) { + LM_WARN("dropping orphaned transaction without branches" + " T=%p kr=%d\n", T, kr); t_release_transaction(T); - } + } } ```
and so far it seems the leakage is no more there, i will leave my test running for the whole day and then report here.
Ozzyboshi left a comment (kamailio/kamailio#4503)
after a full day of tests the patch above seems to work perfectly, thanks to @NormB for his help, very appreciated!!!
NormB left a comment (kamailio/kamailio#4503)
@Ozzyboshi @henningw Thanks for your patience in helping to get the fix tested. It's results like this, for me, make it all worthwhile.
Norm
Ozzyboshi left a comment (kamailio/kamailio#4503)
is it possible to have https://github.com/kamailio/kamailio/pull/4644 merged into master so i can close this issue?
henningw left a comment (kamailio/kamailio#4503)
is it possible to have [#4644](https://github.com/kamailio/kamailio/pull/4644) merged into master so i can close this issue?
Thank you for testing and the feedback. It will be reviewed again and then merged, sure. Then the issue will be closed as well.
Closed #4503 as completed via #4644.