Hi,
I have a theory about the leak. I am not fully convinced this is the cause, since someone surely must have hit the same issue before me, so I'd like to hear a second opinion :)
When new memory cells are created in tm, in new_t() (tm/t_lookup.c:1275), their reference count is set to 2 (line 1298). As explained in the comment, one for the hash table, and one for the global variable T pointing to the current transaction in that Kamailio process.
Here is a gdb step-through of calling t_continue_helper() (tm/t_suspend.c:166) with a transaction that remains unfreed:
(gdb) def tn Type commands for definition of "tn". End with a line saying just "end".
print t->ref_count end
(gdb)
(gdb) break tm/t_suspend.c:195 Breakpoint 1 at 0x7f921d2bb492: file t_suspend.c, line 195.
Breakpoint 1, t_continue_helper (hash_index=16625, label=2123948819, rtact=0x7f921f8a5840, cbname=0x0, cbparam=0x0) at t_suspend.c:195 warning: Source file is more recent than executable. 195 if (!(t->flags & T_ASYNC_SUSPENDED)) { (gdb) n 200 if (t->flags & T_CANCELED) {(gdb) 212 LOCK_ASYNC_CONTINUE(t); (gdb) 214 t->flags |= T_ASYNC_CONTINUE; /* we can now know anywhere in kamailio (gdb) 218 t->flags &= ~T_ASYNC_SUSPENDED; (gdb) tn $1 = {val = 2} 221 cb_type = FAILURE_CB_TYPE; (gdb) $2 = {val = 2} 222 switch (t->async_backup.backup_route) { (gdb) $3 = {val = 2} 224 cb_type = FAILURE_CB_TYPE; (gdb) $4 = {val = 2} 225 break; (gdb) $5 = {val = 2} 237 if(t->async_backup.backup_route != TM_ONREPLY_ROUTE) { (gdb) $6 = {val = 2} 240 branch = t->async_backup.blind_uac; (gdb) $7 = {val = 2}241 if (branch >= 0) { (gdb) $8 = {val = 2} 242 stop_rb_timers(&t->uac[branch].request); (gdb) $9 = {val = 2} 244 if (t->uac[branch].last_received != 0) { (gdb) $10 = {val = 2} 262 t->uac[branch].last_received=500; (gdb) $11 = {val = 2} 263 if(t->uac[branch].reply!=NULL) { (gdb) $12 = {val = 2} 269 t->uac[branch].reply=FAKED_REPLY; (gdb) $13 = {val = 2} 271 uac = &t->uac[branch]; (gdb) $14 = {val = 2} 285 faked_req = fake_req(t->uas.request, 0 /* extra flags */, uac, (gdb) $15 = {val = 2} 287 if (faked_req==NULL) {(gdb) $16 = {val = 2} 292 faked_env( t, faked_req, 1); (gdb) $17 = {val = 2} 294 route_type_bk = get_route_type(); (gdb) $18 = {val = 2} 295 set_route_type(FAILURE_ROUTE); (gdb) $19 = {val = 2} 297 if (exec_pre_script_cb(faked_req, cb_type)>0) { (gdb) $20 = {val = 2} 298 if(rtact!=NULL) { (gdb) $21 = {val = 2} 299 if (run_top_route(rtact, faked_req, 0)<0) { (gdb) $22 = {val = 2} 322 exec_post_script_cb(faked_req, cb_type); (gdb) $23 = {val = 2} 324 set_route_type(route_type_bk); (gdb) $24 = {val = 2} 329 faked_env( t, 0, 1); (gdb) $25 = {val = 2} 331 t->uas.request->flags = faked_req->flags; (gdb) $26 = {val = 2} 333 free_faked_req(faked_req, faked_req_len); (gdb) $27 = {val = 2} 336 if (t->uas.status < 200) { (gdb) $28 = {val = 2} 340 for ( branch = 0; (gdb) $29 = {val = 2} 341 branch < t->nr_of_outgoings; (gdb) $30 = {val = 2} 340 for ( branch = 0; (gdb) $31 = {val = 2} 344 if (t->uac[branch].last_received < 200) (gdb) $32 = {val = 2} 342 branch++ (gdb) $33 = {val = 2} 341 branch < t->nr_of_outgoings; (gdb) $34 = {val = 2} 340 for ( branch = 0; (gdb) $35 = {val = 2} 344 if (t->uac[branch].last_received < 200) (gdb) $36 = {val = 2} 345 break; (gdb) $37 = {val = 2} 348 if (branch == t->nr_of_outgoings) { (gdb) $38 = {val = 2} 482 t->flags &= ~T_ASYNC_CONTINUE; (gdb) $39 = {val = 2} 483 if(t->async_backup.backup_route == TM_ONREPLY_ROUTE) { (gdb) $40 = {val = 2} 491 UNLOCK_ASYNC_CONTINUE(t); (gdb) $41 = {val = 2} 493 if(t->async_backup.backup_route != TM_ONREPLY_ROUTE){ (gdb) $42 = {val = 2} 496 t_unref(t->uas.request); (gdb) $43 = {val = 2} 543 return 0; (gdb) $44 = {val = 1} 570 } (gdb) $45 = {val = 1} t_continue (hash_index=16625, label=2123948819, route=0x7f921f8a5840) at t_suspend.c:576 576 }
As you can see, at the end there is still one reference to the cell. If I understand how this works correctly, after a while the request should be freed by a timer. However, line 242 (stop_rb_timers(&t->uac[branch].request)) disables that timer, hence the leak.
Can you share your thoughts on this? If this is indeed the issue, I am struggling to come up with a decent solution.
Greetings, Ivo
On 05/29/2018 12:02 AM, Daniel-Constantin Mierla wrote:
Hello,
could you look at those transactions and see more of their details? You can try with rpc command:
- https://www.kamailio.org/docs/modules/stable/modules/tm.html#tm.rpc.list
Or also with gdb if you are familiar with this tool.
Among the scopes is to figure out if the related call was completed, if the transaction was resumed/continued...
Is this running on a virtual machine/cloud? If yes, what kind?
Cheers, Daniel
On 28.05.18 17:01, Ivaylo Markov wrote:
Hello,
I am trying to set up Kamailio as a push notifications proxy, closely following the example in the "Kamailio in a Mobile World" presentation (https://www.slideshare.net/FedericoCabiddu/kamailioinamobileworld-51617342). I am running Debian 9 and Kamailio 5.1.3 from the official Debian repositories. I believe the main modules involved in the issue below are tm, tmx, and tsilo.
Every call passing through the proxy leads to a small memory leak in the tm module - there is a large amount of "delayed free" memory cells from tm's internal hash table. At some point the shared memory runs out and Kamailio restarts. Using the "kamcmd corex.shm_summary" command I was able to see that the top users of shared memory are "tm: h_table.c: build_cell" and "core: core/sip_msg_clone.c: sip_msg_shm_clone" with the same allocation count.
I experimented with removing different parts of the configuration and noticed that commenting out the "t_continue(...)" call in the "PUSHJOIN" route (see slide #22) prevents the leak from happening. Maybe something in that function is incrementing the reference counter to the hash table cell, but it is not decrementing the counter when done?
I tried looking around the source code of the tm and tmx modules, but saw nothing suspicious. I also tried using gdb with a breakpoint in t_continue_helper (tm/t_suspend.c:166) hoping to see what else is accessing the htable cell, but was unable to find anything of use.
Has someone encountered anything like this? Can you provide more directions on debuggin this? I can provide some bits of configuration, but an entire test setup would be rather difficult, unfortunately.
Thank you for your time, Ivo
Kamailio (SER) - Users Mailing List sr-users@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users