Hello,
I am trying to set up Kamailio as a push notifications proxy, closely following the example in the "Kamailio in a Mobile World" presentation (https://www.slideshare.net/FedericoCabiddu/kamailioinamobileworld-51617342). I am running Debian 9 and Kamailio 5.1.3 from the official Debian repositories. I believe the main modules involved in the issue below are tm, tmx, and tsilo.
Every call passing through the proxy leads to a small memory leak in the tm module - there is a large amount of "delayed free" memory cells from tm's internal hash table. At some point the shared memory runs out and Kamailio restarts. Using the "kamcmd corex.shm_summary" command I was able to see that the top users of shared memory are "tm: h_table.c: build_cell" and "core: core/sip_msg_clone.c: sip_msg_shm_clone" with the same allocation count.
I experimented with removing different parts of the configuration and noticed that commenting out the "t_continue(...)" call in the "PUSHJOIN" route (see slide #22) prevents the leak from happening. Maybe something in that function is incrementing the reference counter to the hash table cell, but it is not decrementing the counter when done?
I tried looking around the source code of the tm and tmx modules, but saw nothing suspicious. I also tried using gdb with a breakpoint in t_continue_helper (tm/t_suspend.c:166) hoping to see what else is accessing the htable cell, but was unable to find anything of use.
Has someone encountered anything like this? Can you provide more directions on debuggin this? I can provide some bits of configuration, but an entire test setup would be rather difficult, unfortunately.
Thank you for your time, Ivo
Hello,
could you look at those transactions and see more of their details? You can try with rpc command:
- https://www.kamailio.org/docs/modules/stable/modules/tm.html#tm.rpc.list
Or also with gdb if you are familiar with this tool.
Among the scopes is to figure out if the related call was completed, if the transaction was resumed/continued...
Is this running on a virtual machine/cloud? If yes, what kind?
Cheers, Daniel
On 28.05.18 17:01, Ivaylo Markov wrote:
Hello,
I am trying to set up Kamailio as a push notifications proxy, closely following the example in the "Kamailio in a Mobile World" presentation (https://www.slideshare.net/FedericoCabiddu/kamailioinamobileworld-51617342). I am running Debian 9 and Kamailio 5.1.3 from the official Debian repositories. I believe the main modules involved in the issue below are tm, tmx, and tsilo.
Every call passing through the proxy leads to a small memory leak in the tm module - there is a large amount of "delayed free" memory cells from tm's internal hash table. At some point the shared memory runs out and Kamailio restarts. Using the "kamcmd corex.shm_summary" command I was able to see that the top users of shared memory are "tm: h_table.c: build_cell" and "core: core/sip_msg_clone.c: sip_msg_shm_clone" with the same allocation count.
I experimented with removing different parts of the configuration and noticed that commenting out the "t_continue(...)" call in the "PUSHJOIN" route (see slide #22) prevents the leak from happening. Maybe something in that function is incrementing the reference counter to the hash table cell, but it is not decrementing the counter when done?
I tried looking around the source code of the tm and tmx modules, but saw nothing suspicious. I also tried using gdb with a breakpoint in t_continue_helper (tm/t_suspend.c:166) hoping to see what else is accessing the htable cell, but was unable to find anything of use.
Has someone encountered anything like this? Can you provide more directions on debuggin this? I can provide some bits of configuration, but an entire test setup would be rather difficult, unfortunately.
Thank you for your time, Ivo
Kamailio (SER) - Users Mailing List sr-users@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users
Hello,
could you look at those transactions and see more of their details? You can try with rpc command:
The "delayed free" transactions are not visible using the RPC command. You would think it is something external that keeps a reference to them, but the leak stops if I remove the call to t_continue(), so it sounds like something is going on inside tm or tmx.
Among the scopes is to figure out if the related call was completed, if the transaction was resumed/continued...
If the INVITE times out, the transaction is freed as it should be - which makes sense, since it does not reach the t_continue() call (see above).
I am running Kamailio in a VM on ESXi 5.1.
I guess I will keep digging around in the source and try to trace things with gdb, this time starting from t_continue() in tmx.
Greetings, Ivo
On 05/29/2018 12:02 AM, Daniel-Constantin Mierla wrote:
Hello,
could you look at those transactions and see more of their details? You can try with rpc command:
- https://www.kamailio.org/docs/modules/stable/modules/tm.html#tm.rpc.list
Or also with gdb if you are familiar with this tool.
Among the scopes is to figure out if the related call was completed, if the transaction was resumed/continued...
Is this running on a virtual machine/cloud? If yes, what kind?
Cheers, Daniel
On 28.05.18 17:01, Ivaylo Markov wrote:
Hello,
I am trying to set up Kamailio as a push notifications proxy, closely following the example in the "Kamailio in a Mobile World" presentation (https://www.slideshare.net/FedericoCabiddu/kamailioinamobileworld-51617342). I am running Debian 9 and Kamailio 5.1.3 from the official Debian repositories. I believe the main modules involved in the issue below are tm, tmx, and tsilo.
Every call passing through the proxy leads to a small memory leak in the tm module - there is a large amount of "delayed free" memory cells from tm's internal hash table. At some point the shared memory runs out and Kamailio restarts. Using the "kamcmd corex.shm_summary" command I was able to see that the top users of shared memory are "tm: h_table.c: build_cell" and "core: core/sip_msg_clone.c: sip_msg_shm_clone" with the same allocation count.
I experimented with removing different parts of the configuration and noticed that commenting out the "t_continue(...)" call in the "PUSHJOIN" route (see slide #22) prevents the leak from happening. Maybe something in that function is incrementing the reference counter to the hash table cell, but it is not decrementing the counter when done?
I tried looking around the source code of the tm and tmx modules, but saw nothing suspicious. I also tried using gdb with a breakpoint in t_continue_helper (tm/t_suspend.c:166) hoping to see what else is accessing the htable cell, but was unable to find anything of use.
Has someone encountered anything like this? Can you provide more directions on debuggin this? I can provide some bits of configuration, but an entire test setup would be rather difficult, unfortunately.
Thank you for your time, Ivo
Kamailio (SER) - Users Mailing List sr-users@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users
Hi,
I have a theory about the leak. I am not fully convinced this is the cause, since someone surely must have hit the same issue before me, so I'd like to hear a second opinion :)
When new memory cells are created in tm, in new_t() (tm/t_lookup.c:1275), their reference count is set to 2 (line 1298). As explained in the comment, one for the hash table, and one for the global variable T pointing to the current transaction in that Kamailio process.
Here is a gdb step-through of calling t_continue_helper() (tm/t_suspend.c:166) with a transaction that remains unfreed:
(gdb) def tn Type commands for definition of "tn". End with a line saying just "end".
print t->ref_count end
(gdb)
(gdb) break tm/t_suspend.c:195 Breakpoint 1 at 0x7f921d2bb492: file t_suspend.c, line 195.
Breakpoint 1, t_continue_helper (hash_index=16625, label=2123948819, rtact=0x7f921f8a5840, cbname=0x0, cbparam=0x0) at t_suspend.c:195 warning: Source file is more recent than executable. 195 if (!(t->flags & T_ASYNC_SUSPENDED)) { (gdb) n 200 if (t->flags & T_CANCELED) {(gdb) 212 LOCK_ASYNC_CONTINUE(t); (gdb) 214 t->flags |= T_ASYNC_CONTINUE; /* we can now know anywhere in kamailio (gdb) 218 t->flags &= ~T_ASYNC_SUSPENDED; (gdb) tn $1 = {val = 2} 221 cb_type = FAILURE_CB_TYPE; (gdb) $2 = {val = 2} 222 switch (t->async_backup.backup_route) { (gdb) $3 = {val = 2} 224 cb_type = FAILURE_CB_TYPE; (gdb) $4 = {val = 2} 225 break; (gdb) $5 = {val = 2} 237 if(t->async_backup.backup_route != TM_ONREPLY_ROUTE) { (gdb) $6 = {val = 2} 240 branch = t->async_backup.blind_uac; (gdb) $7 = {val = 2}241 if (branch >= 0) { (gdb) $8 = {val = 2} 242 stop_rb_timers(&t->uac[branch].request); (gdb) $9 = {val = 2} 244 if (t->uac[branch].last_received != 0) { (gdb) $10 = {val = 2} 262 t->uac[branch].last_received=500; (gdb) $11 = {val = 2} 263 if(t->uac[branch].reply!=NULL) { (gdb) $12 = {val = 2} 269 t->uac[branch].reply=FAKED_REPLY; (gdb) $13 = {val = 2} 271 uac = &t->uac[branch]; (gdb) $14 = {val = 2} 285 faked_req = fake_req(t->uas.request, 0 /* extra flags */, uac, (gdb) $15 = {val = 2} 287 if (faked_req==NULL) {(gdb) $16 = {val = 2} 292 faked_env( t, faked_req, 1); (gdb) $17 = {val = 2} 294 route_type_bk = get_route_type(); (gdb) $18 = {val = 2} 295 set_route_type(FAILURE_ROUTE); (gdb) $19 = {val = 2} 297 if (exec_pre_script_cb(faked_req, cb_type)>0) { (gdb) $20 = {val = 2} 298 if(rtact!=NULL) { (gdb) $21 = {val = 2} 299 if (run_top_route(rtact, faked_req, 0)<0) { (gdb) $22 = {val = 2} 322 exec_post_script_cb(faked_req, cb_type); (gdb) $23 = {val = 2} 324 set_route_type(route_type_bk); (gdb) $24 = {val = 2} 329 faked_env( t, 0, 1); (gdb) $25 = {val = 2} 331 t->uas.request->flags = faked_req->flags; (gdb) $26 = {val = 2} 333 free_faked_req(faked_req, faked_req_len); (gdb) $27 = {val = 2} 336 if (t->uas.status < 200) { (gdb) $28 = {val = 2} 340 for ( branch = 0; (gdb) $29 = {val = 2} 341 branch < t->nr_of_outgoings; (gdb) $30 = {val = 2} 340 for ( branch = 0; (gdb) $31 = {val = 2} 344 if (t->uac[branch].last_received < 200) (gdb) $32 = {val = 2} 342 branch++ (gdb) $33 = {val = 2} 341 branch < t->nr_of_outgoings; (gdb) $34 = {val = 2} 340 for ( branch = 0; (gdb) $35 = {val = 2} 344 if (t->uac[branch].last_received < 200) (gdb) $36 = {val = 2} 345 break; (gdb) $37 = {val = 2} 348 if (branch == t->nr_of_outgoings) { (gdb) $38 = {val = 2} 482 t->flags &= ~T_ASYNC_CONTINUE; (gdb) $39 = {val = 2} 483 if(t->async_backup.backup_route == TM_ONREPLY_ROUTE) { (gdb) $40 = {val = 2} 491 UNLOCK_ASYNC_CONTINUE(t); (gdb) $41 = {val = 2} 493 if(t->async_backup.backup_route != TM_ONREPLY_ROUTE){ (gdb) $42 = {val = 2} 496 t_unref(t->uas.request); (gdb) $43 = {val = 2} 543 return 0; (gdb) $44 = {val = 1} 570 } (gdb) $45 = {val = 1} t_continue (hash_index=16625, label=2123948819, route=0x7f921f8a5840) at t_suspend.c:576 576 }
As you can see, at the end there is still one reference to the cell. If I understand how this works correctly, after a while the request should be freed by a timer. However, line 242 (stop_rb_timers(&t->uac[branch].request)) disables that timer, hence the leak.
Can you share your thoughts on this? If this is indeed the issue, I am struggling to come up with a decent solution.
Greetings, Ivo
On 05/29/2018 12:02 AM, Daniel-Constantin Mierla wrote:
Hello,
could you look at those transactions and see more of their details? You can try with rpc command:
- https://www.kamailio.org/docs/modules/stable/modules/tm.html#tm.rpc.list
Or also with gdb if you are familiar with this tool.
Among the scopes is to figure out if the related call was completed, if the transaction was resumed/continued...
Is this running on a virtual machine/cloud? If yes, what kind?
Cheers, Daniel
On 28.05.18 17:01, Ivaylo Markov wrote:
Hello,
I am trying to set up Kamailio as a push notifications proxy, closely following the example in the "Kamailio in a Mobile World" presentation (https://www.slideshare.net/FedericoCabiddu/kamailioinamobileworld-51617342). I am running Debian 9 and Kamailio 5.1.3 from the official Debian repositories. I believe the main modules involved in the issue below are tm, tmx, and tsilo.
Every call passing through the proxy leads to a small memory leak in the tm module - there is a large amount of "delayed free" memory cells from tm's internal hash table. At some point the shared memory runs out and Kamailio restarts. Using the "kamcmd corex.shm_summary" command I was able to see that the top users of shared memory are "tm: h_table.c: build_cell" and "core: core/sip_msg_clone.c: sip_msg_shm_clone" with the same allocation count.
I experimented with removing different parts of the configuration and noticed that commenting out the "t_continue(...)" call in the "PUSHJOIN" route (see slide #22) prevents the leak from happening. Maybe something in that function is incrementing the reference counter to the hash table cell, but it is not decrementing the counter when done?
I tried looking around the source code of the tm and tmx modules, but saw nothing suspicious. I also tried using gdb with a breakpoint in t_continue_helper (tm/t_suspend.c:166) hoping to see what else is accessing the htable cell, but was unable to find anything of use.
Has someone encountered anything like this? Can you provide more directions on debuggin this? I can provide some bits of configuration, but an entire test setup would be rather difficult, unfortunately.
Thank you for your time, Ivo
Kamailio (SER) - Users Mailing List sr-users@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users
Just in case anyone runs into this kind of problem in the future - commits 72f5eaeeef0239ebd16a2d645b83e83eb1a2b506 and 5fe2a1a1c67b550431dcae3c98701073f7edd953 (currently in the master branch only) seem to fix this.
On 05/28/2018 06:01 PM, Ivaylo Markov wrote:
Hello,
I am trying to set up Kamailio as a push notifications proxy, closely following the example in the "Kamailio in a Mobile World" presentation (https://www.slideshare.net/FedericoCabiddu/kamailioinamobileworld-51617342). I am running Debian 9 and Kamailio 5.1.3 from the official Debian repositories. I believe the main modules involved in the issue below are tm, tmx, and tsilo.
Every call passing through the proxy leads to a small memory leak in the tm module - there is a large amount of "delayed free" memory cells from tm's internal hash table. At some point the shared memory runs out and Kamailio restarts. Using the "kamcmd corex.shm_summary" command I was able to see that the top users of shared memory are "tm: h_table.c: build_cell" and "core: core/sip_msg_clone.c: sip_msg_shm_clone" with the same allocation count.
I experimented with removing different parts of the configuration and noticed that commenting out the "t_continue(...)" call in the "PUSHJOIN" route (see slide #22) prevents the leak from happening. Maybe something in that function is incrementing the reference counter to the hash table cell, but it is not decrementing the counter when done?
I tried looking around the source code of the tm and tmx modules, but saw nothing suspicious. I also tried using gdb with a breakpoint in t_continue_helper (tm/t_suspend.c:166) hoping to see what else is accessing the htable cell, but was unable to find anything of use.
Has someone encountered anything like this? Can you provide more directions on debuggin this? I can provide some bits of configuration, but an entire test setup would be rather difficult, unfortunately.
Thank you for your time, Ivo
Kamailio (SER) - Users Mailing List sr-users@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users
Hi,
I think I have this issue and I'm using 4.4 and I can't use master for now. I tried to cherry pick this 2 commits, but unfortunately it do not work.
For example commit 5fe2a1a1c67b550431dcae3c98701073f7edd953 make changes in function t_continue_helper, but 4.4 do not has such function, it has with slightly different name - t_continue.
Same commit add line 258 in src/modules/tm/t_suspend.c, but in 4.4 this part of code is slightly different. There is no " t->flags &= ~T_ASYNC_CONTINUE; " line in same if statement.
There is no way to remove line 390 from same file, cause in 4.4 that part of code differs quite a lot.
With second patch 72f5eaeeef0239ebd16a2d645b83e83eb1a2b506 there was much less problems, but still, there is big difference in part of code near line 592 of this commit, but probably in 4.4 i just need to update line 527 and change "UNREF_FREE(new_cell); " to " UNREF_FREE(new_cell, 0);"
Is it a worth to try to cherry pick this 2 commits or there are too much changes between 4.4 and Master and no way to make this work properly?
Please advise.
Thank you!
With kind regards,
Jurijs
On Tue, Jun 12, 2018 at 3:25 PM, Ivaylo Markov ivo@schupen.net wrote:
Just in case anyone runs into this kind of problem in the future - commits 72f5eaeeef0239ebd16a2d645b83e83eb1a2b506 and 5fe2a1a1c67b550431dcae3c98701073f7edd953 (currently in the master branch only) seem to fix this.
On 05/28/2018 06:01 PM, Ivaylo Markov wrote:
Hello,
I am trying to set up Kamailio as a push notifications proxy, closely following the example in the "Kamailio in a Mobile World" presentation (https://www.slideshare.net/FedericoCabiddu/kamailioinamobileworld-51617342). I am running Debian 9 and Kamailio 5.1.3 from the official Debian repositories. I believe the main modules involved in the issue below are tm, tmx, and tsilo.
Every call passing through the proxy leads to a small memory leak in the tm module - there is a large amount of "delayed free" memory cells from tm's internal hash table. At some point the shared memory runs out and Kamailio restarts. Using the "kamcmd corex.shm_summary" command I was able to see that the top users of shared memory are "tm: h_table.c: build_cell" and "core: core/sip_msg_clone.c: sip_msg_shm_clone" with the same allocation count.
I experimented with removing different parts of the configuration and noticed that commenting out the "t_continue(...)" call in the "PUSHJOIN" route (see slide #22) prevents the leak from happening. Maybe something in that function is incrementing the reference counter to the hash table cell, but it is not decrementing the counter when done?
I tried looking around the source code of the tm and tmx modules, but saw nothing suspicious. I also tried using gdb with a breakpoint in t_continue_helper (tm/t_suspend.c:166) hoping to see what else is accessing the htable cell, but was unable to find anything of use.
Has someone encountered anything like this? Can you provide more directions on debuggin this? I can provide some bits of configuration, but an entire test setup would be rather difficult, unfortunately.
Thank you for your time, Ivo
Kamailio (SER) - Users Mailing Listsr-users@lists.kamailio.orghttps://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users
Kamailio (SER) - Users Mailing List sr-users@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users
Am Mittwoch, 13. Juni 2018, 09:51:45 CEST schrieb Jurijs Ivolga:
I think I have this issue and I'm using 4.4 and I can't use master for now. I tried to cherry pick this 2 commits, but unfortunately it do not work.
For example commit 5fe2a1a1c67b550431dcae3c98701073f7edd953 make changes in function t_continue_helper, but 4.4 do not has such function, it has with slightly different name - t_continue.
Same commit add line 258 in src/modules/tm/t_suspend.c, but in 4.4 this part of code is slightly different. There is no " t->flags &= ~T_ASYNC_CONTINUE; " line in same if statement.
There is no way to remove line 390 from same file, cause in 4.4 that part of code differs quite a lot.
With second patch 72f5eaeeef0239ebd16a2d645b83e83eb1a2b506 there was much less problems, but still, there is big difference in part of code near line 592 of this commit, but probably in 4.4 i just need to update line 527 and change "UNREF_FREE(new_cell); " to " UNREF_FREE(new_cell, 0);"
Is it a worth to try to cherry pick this 2 commits or there are too much changes between 4.4 and Master and no way to make this work properly?
Hello Juris,
I don't know much details about your setup. But if you don't use a lot of custom code that needs to be touched before you can go to 5.1 then an update should be not difficult. There are also some other important fixes, some of them security relevant, that you miss as well if you stay on 4.4.
Best regards,
Henning
Hi Henning,
Thank you a lot for your input.
But I was asking if there is a point to create patch from this 2 commits and apply to 4.4. Is it worth? Or there is no way to make this work properly on 4.4? As I see, some part of code what is touched by this 2 commits differs quite a lot, so I'm bit afraid to create patch and apply it to our production servers, especially if I don't have a clue what it affects. :)
With kind regards,
Jurijs
On Wed, Jun 13, 2018 at 11:02 PM, Henning Westerholt hw@kamailio.org wrote:
Am Mittwoch, 13. Juni 2018, 09:51:45 CEST schrieb Jurijs Ivolga:
I think I have this issue and I'm using 4.4 and I can't use master for
now.
I tried to cherry pick this 2 commits, but unfortunately it do not work.
For example commit 5fe2a1a1c67b550431dcae3c98701073f7edd953 make
changes in
function t_continue_helper, but 4.4 do not has such function, it has with slightly different name - t_continue.
Same commit add line 258 in src/modules/tm/t_suspend.c, but in 4.4 this part of code is slightly different. There is no " t->flags &= ~T_ASYNC_CONTINUE; " line in same if statement.
There is no way to remove line 390 from same file, cause in 4.4 that part of code differs quite a lot.
With second patch 72f5eaeeef0239ebd16a2d645b83e83eb1a2b506 there was
much
less problems, but still, there is big difference in part of code near
line
592 of this commit, but probably in 4.4 i just need to update line 527
and
change "UNREF_FREE(new_cell); " to " UNREF_FREE(new_cell, 0);"
Is it a worth to try to cherry pick this 2 commits or there are too much changes between 4.4 and Master and no way to make this work properly?
Hello Juris,
I don't know much details about your setup. But if you don't use a lot of custom code that needs to be touched before you can go to 5.1 then an update should be not difficult. There are also some other important fixes, some of them security relevant, that you miss as well if you stay on 4.4.
Best regards,
Henning
-- If you like the work that I do in Kamailio, please consider supporting me on Patreon: https://www.patreon.com/henningw
Am Donnerstag, 14. Juni 2018, 08:31:58 CEST schrieb Jurijs Ivolga:
Thank you a lot for your input.
But I was asking if there is a point to create patch from this 2 commits and apply to 4.4. Is it worth? Or there is no way to make this work properly on 4.4? As I see, some part of code what is touched by this 2 commits differs quite a lot, so I'm bit afraid to create patch and apply it to our production servers, especially if I don't have a clue what it affects. :)
Hello Juris,
In my opinion there is indeed a risk that after applying the patch to 4.4 you will run into other problems because the patch does not fit 100%. TM is one of the most complicated modules, I would not suggest to fiddle with it if you don't have a clue, as you mentioned. ;-) There is of course the possibility to get somebody else to port the patch for you.
But as I already wrote - there are other important bugs which are fixed only in 5.0 and 5.1. We maintain only the last two stable release, as a project policy.
So I would recommend that you update your production systems instead of trying to re-fit this individual patch into the older code base.
Best regards,
Henning
Hi Henning,
Thank you a lot!
With kind regards,
Jurijs
On Thu, Jun 14, 2018 at 5:02 PM, Henning Westerholt hw@kamailio.org wrote:
Am Donnerstag, 14. Juni 2018, 08:31:58 CEST schrieb Jurijs Ivolga:
Thank you a lot for your input.
But I was asking if there is a point to create patch from this 2 commits and apply to 4.4. Is it worth? Or there is no way to make this work properly on 4.4? As I see, some part of code what is touched by this 2 commits differs quite a lot, so I'm bit afraid to create patch and apply
it
to our production servers, especially if I don't have a clue what it affects. :)
Hello Juris,
In my opinion there is indeed a risk that after applying the patch to 4.4 you will run into other problems because the patch does not fit 100%. TM is one of the most complicated modules, I would not suggest to fiddle with it if you don't have a clue, as you mentioned. ;-) There is of course the possibility to get somebody else to port the patch for you.
But as I already wrote - there are other important bugs which are fixed only in 5.0 and 5.1. We maintain only the last two stable release, as a project policy.
So I would recommend that you update your production systems instead of trying to re-fit this individual patch into the older code base.
Best regards,
Henning
-- If you like the work that I do in Kamailio, please consider supporting me on Patreon: https://www.patreon.com/henningw