[SR-Users] Memory leak in tm with push notifications

Ivaylo Markov ivo at schupen.net
Tue May 29 18:04:50 CEST 2018


Hi,

I have a theory about the leak. I am not fully convinced this is the
cause, since someone surely must have hit the same issue before me, so
I'd like to hear a second opinion :)

When new memory cells are created in tm, in new_t()
(tm/t_lookup.c:1275), their reference count is set to 2 (line 1298).
As explained in the comment, one for the hash table, and one for the
global variable T pointing to the current transaction
in that Kamailio process.

Here is a gdb step-through of calling t_continue_helper()
(tm/t_suspend.c:166) with a transaction that remains unfreed:

(gdb) def tn
Type commands for definition of "tn".
End with a line saying just "end".
>print t->ref_count
>end
(gdb)

(gdb) break tm/t_suspend.c:195
Breakpoint 1 at 0x7f921d2bb492: file t_suspend.c, line 195.

Breakpoint 1, t_continue_helper (hash_index=16625, label=2123948819, rtact=0x7f921f8a5840, cbname=0x0, cbparam=0x0) at t_suspend.c:195
warning: Source file is more recent than executable.
195     if (!(t->flags & T_ASYNC_SUSPENDED)) {
(gdb) n
200     if (t->flags & T_CANCELED) {(gdb) 
212     LOCK_ASYNC_CONTINUE(t);
(gdb) 
214     t->flags |= T_ASYNC_CONTINUE;   /* we can now know anywhere in kamailio
(gdb) 
218     t->flags &= ~T_ASYNC_SUSPENDED;
(gdb) tn
$1 = {val = 2}
221     cb_type =  FAILURE_CB_TYPE;
(gdb) 
$2 = {val = 2}
222     switch (t->async_backup.backup_route) {
(gdb) 
$3 = {val = 2}
224             cb_type = FAILURE_CB_TYPE;
(gdb) 
$4 = {val = 2}
225             break;
(gdb) 
$5 = {val = 2}
237     if(t->async_backup.backup_route != TM_ONREPLY_ROUTE) {
(gdb) 
$6 = {val = 2}
240         branch = t->async_backup.blind_uac;
(gdb) 
$7 = {val = 2}241         if (branch >= 0) {
(gdb) 
$8 = {val = 2}
242             stop_rb_timers(&t->uac[branch].request);
(gdb) 
$9 = {val = 2}
244             if (t->uac[branch].last_received != 0) {
(gdb) 
$10 = {val = 2}
262             t->uac[branch].last_received=500;
(gdb) 
$11 = {val = 2}
263             if(t->uac[branch].reply!=NULL) {
(gdb) 
$12 = {val = 2}
269                 t->uac[branch].reply=FAKED_REPLY;
(gdb) 
$13 = {val = 2}
271             uac = &t->uac[branch];
(gdb) 
$14 = {val = 2}
285         faked_req = fake_req(t->uas.request, 0 /* extra flags */, uac,
(gdb)
$15 = {val = 2}
287         if (faked_req==NULL) {(gdb) 
$16 = {val = 2}
292         faked_env( t, faked_req, 1);
(gdb) 
$17 = {val = 2}
294         route_type_bk = get_route_type();
(gdb) 
$18 = {val = 2}
295         set_route_type(FAILURE_ROUTE);
(gdb) 
$19 = {val = 2}
297         if (exec_pre_script_cb(faked_req, cb_type)>0) {
(gdb) 
$20 = {val = 2}
298             if(rtact!=NULL) {
(gdb) 
$21 = {val = 2}
299                 if (run_top_route(rtact, faked_req, 0)<0) {
(gdb) 
$22 = {val = 2}
322             exec_post_script_cb(faked_req, cb_type);
(gdb) 
$23 = {val = 2}
324         set_route_type(route_type_bk);
(gdb) 
$24 = {val = 2}
329         faked_env( t, 0, 1);
(gdb) 
$25 = {val = 2}
331         t->uas.request->flags = faked_req->flags;
(gdb) 
$26 = {val = 2}
333         free_faked_req(faked_req, faked_req_len);
(gdb) 
$27 = {val = 2}
336         if (t->uas.status < 200) {
(gdb) 
$28 = {val = 2}
340             for (   branch = 0;
(gdb) 
$29 = {val = 2}
341                 branch < t->nr_of_outgoings;
(gdb) 
$30 = {val = 2}
340             for (   branch = 0;
(gdb) 
$31 = {val = 2}
344                 if (t->uac[branch].last_received < 200)
(gdb) 
$32 = {val = 2}
342                 branch++
(gdb) 
$33 = {val = 2}
341                 branch < t->nr_of_outgoings;
(gdb) 
$34 = {val = 2}
340             for (   branch = 0;
(gdb) 
$35 = {val = 2}
344                 if (t->uac[branch].last_received < 200)
(gdb) 
$36 = {val = 2}
345                     break;
(gdb) 
$37 = {val = 2}
348             if (branch == t->nr_of_outgoings) {
(gdb) 
$38 = {val = 2}
482     t->flags &= ~T_ASYNC_CONTINUE;
(gdb) 
$39 = {val = 2}
483     if(t->async_backup.backup_route == TM_ONREPLY_ROUTE) {
(gdb) 
$40 = {val = 2}
491     UNLOCK_ASYNC_CONTINUE(t);
(gdb) 
$41 = {val = 2}
493     if(t->async_backup.backup_route != TM_ONREPLY_ROUTE){
(gdb) 
$42 = {val = 2}
496         t_unref(t->uas.request);
(gdb)
$43 = {val = 2}
543     return 0;
(gdb)
$44 = {val = 1}
570 }
(gdb) 
$45 = {val = 1}
t_continue (hash_index=16625, label=2123948819, route=0x7f921f8a5840) at t_suspend.c:576
576 }


As you can see, at the end there is still one reference to the cell. If
I understand how this works correctly, after a while the request should
be freed by a timer. However, line 242
(stop_rb_timers(&t->uac[branch].request)) disables that timer, hence the
leak.

Can you share your thoughts on this? If this is indeed the issue, I am
struggling to come up with a decent solution.

Greetings,
Ivo


On 05/29/2018 12:02 AM, Daniel-Constantin Mierla wrote:
> Hello,
>
> could you look at those transactions and see more of their details? You
> can try with rpc command:
>
>   - https://www.kamailio.org/docs/modules/stable/modules/tm.html#tm.rpc.list
>
> Or also with gdb if you are familiar with this tool.
>
> Among the scopes is to figure out if the related call was completed, if
> the transaction was resumed/continued...
>
> Is this running on a virtual machine/cloud? If yes, what kind?
>
> Cheers,
> Daniel
>
>
> On 28.05.18 17:01, Ivaylo Markov wrote:
>> Hello,
>>
>> I am trying to set up Kamailio as a push notifications proxy, closely
>> following the example in the "Kamailio in a Mobile World" presentation
>> (https://www.slideshare.net/FedericoCabiddu/kamailioinamobileworld-51617342).
>> I am running Debian 9 and Kamailio 5.1.3 from the official Debian
>> repositories.
>> I believe the main modules involved in the issue below are tm, tmx, and
>> tsilo.
>>
>> Every call passing through the proxy leads to a small memory leak in the tm
>> module - there is a large amount of "delayed free" memory cells from tm's
>> internal hash table. At some point the shared memory runs out and Kamailio
>> restarts. Using the "kamcmd corex.shm_summary" command I was able to see
>> that the top users of shared memory are "tm: h_table.c: build_cell" and
>> "core: core/sip_msg_clone.c: sip_msg_shm_clone" with the same allocation
>> count.
>>
>> I experimented with removing different parts of the configuration and
>> noticed that commenting out the "t_continue(...)" call in the "PUSHJOIN"
>> route
>> (see slide #22) prevents the leak from happening. Maybe something in that
>> function is incrementing the reference counter to the hash table cell, but
>> it is not decrementing the counter when done?
>>
>> I tried looking around the source code of the tm and tmx modules, but saw
>> nothing suspicious. I also tried using gdb with a breakpoint in
>> t_continue_helper (tm/t_suspend.c:166) hoping to see what else is accessing
>> the htable cell, but was unable to find anything of use.
>>
>> Has someone encountered anything like this? Can you provide more directions
>> on debuggin this? I can provide some bits of configuration, but an entire
>> test setup would be rather difficult, unfortunately.
>>
>> Thank you for your time,
>> Ivo
>>
>>
>> _______________________________________________
>> Kamailio (SER) - Users Mailing List
>> sr-users at lists.kamailio.org
>> https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users




More information about the sr-users mailing list