### Description When restarting kamailio nodes in our infrastructure we noticed that under traffic some nodes started using the 100% of the CPU, with the precious help of @giavac we were able to track down the issue to an infinite loop inside the htable module when synchronizing somewhat big (60K) htables via dmq
### Troubleshooting
#### Reproduction Have 1 kamailio instance with a 60K+ htable and start a new instance, the first instance will try to send the whole table to the new instance and it will enter an infinite loop which consumes 100% of the CPU
This is caused by a double call to **ht_dmq_cell_group_flush** which creates a circular list on the json structure hierarchy, the second call happens in this block of code (hence why it's required a 60K htable): https://github.com/kamailio/kamailio/blob/5.2/src/modules/htable/ht_dmq.c#L5...
When this happens **ht_dmq_cell_group_flush** try to add **ht_dmq_jdoc_cell_group.jdoc_cells** inside **ht_dmq_jdoc_cell_group.jdoc->root** but this root already has **json_cells** as its child so when **srjson_AddItemToObject** is called (and in turn **srjson_AddItemToArray**) it gets appended as the child of itself: https://github.com/kamailio/kamailio/blob/master/src/lib/srutils/srjson.c#L8...
This circular structure then causes a loop when calling **srjson_PrintUnformatted** because in the **print_object** function the circular list is looped over: https://github.com/kamailio/kamailio/blob/master/src/lib/srutils/srjson.c#L6...
### Possible Solutions
One possible solution could be to destroy and init again the **ht_dmq_jdoc_cell_group** structure after calling the flush: ``` if (ht_dmq_jdoc_cell_group.size >= dmq_cell_group_max_size) { LM_DBG("sending group count[%d]size[%d]\n", ht_dmq_jdoc_cell_group.count, ht_dmq_jdoc_cell_group.size); if (ht_dmq_cell_group_flush(node) < 0) { ht_slot_unlock(ht, i); goto error; } ht_dmq_cell_group_destroy(); ht_dmq_cell_group_init(); } ``` But we are not sure about the performance implications.
### Additional Information `# kamailio -v version: kamailio 5.2.1 (x86_64/linux) 44e488 flags: STATS: Off, USE_TCP, USE_TLS, USE_SCTP, TLS_HOOKS, USE_RAW_SOCKS, DISABLE_NAGLE, USE_MCAST, DNS_IP_HACK, SHM_MEM, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, TLSF_MALLOC, DBG_SR_MEMORY, USE_FUTEX, FAST_LOCK-ADAPTIVE_WAIT, USE_DNS_CACHE, USE_DNS_FAILOVER, USE_NAPTR, USE_DST_BLACKLIST, HAVE_RESOLV_RES ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144 MAX_URI_SIZE 1024, BUF_SIZE 65535, DEFAULT PKG_SIZE 8MB poll method support: poll, epoll_lt, epoll_et, sigio_rt, select. id: 44e488 compiled on 11:52:58 Feb 21 2019 with gcc 5.4.0 `
* **Operating System**: ubuntu:xenial docker container `# uname -a Linux kama-0 4.4.0-135-generic #161-Ubuntu SMP Mon Aug 27 10:45:01 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux`
Thanks for reporting - I will take a look and fix accordingly.
Please can you try with the above and confirm whether or not it is fixed?
Thank you @charlesrchance , I did some tests with this setup: kamailio.cfg (meaningful lines): ``` fork=yes children=1 tcp_connection_lifetime=3605 pv_buffer_size=8192
# ----- dmq params ----- modparam("dmq", "server_address", DMQ_SERVER_ADDRESS) modparam("dmq", "notification_address", DMQ_NOTIFICATION_ADDRESS) modparam("dmq", "multi_notify", 1) modparam("dmq", "num_workers", 1) modparam("dmq", "ping_interval", 15) modparam("dmq", "worker_usleep", 1000)
# ----- htable params ----- modparam("htable", "enable_dmq", 1) modparam("htable", "dmq_init_sync", 1) modparam("htable", "htable", "ht=>size=16;dmqreplicate=1;autoexpire=10800;") # Keep track of concurrent channels for accounts. Should be same as dialog modparam("htable", "htable", "ht1=>size=16;dmqreplicate=1;autoexpire=10800;") # Keep track of concurrent channels for accounts. Should be same as dialog modparam("htable", "htable", "ht2=>size=16;dmqreplicate=1;autoexpire=10800;") # Keep track of concurrent channels for accounts. Should be same as dialog modparam("htable", "htable", "ht3=>size=16;dmqreplicate=1;autoexpire=10800;") # Keep track of concurrent channels for accounts. Should be same as dialog
#!define ONEK "really 1 k chars, believe me :)"
event_route[htable:mod-init] { $var(name) = POD_NAME + "\n"; xlog("L_ALERT", "$var(name)"); if(POD_NAME == "kama-0") { $var(count) = 0; while($var(count) < 99) { $sht(ht=>$var(count)) = ONEK; $sht(ht1=>$var(count)) = ONEK; $sht(ht2=>$var(count)) = ONEK; $sht(ht3=>$var(count)) = ONEK; $var(count) = $var(count)+1; } } }
request_route { if ($rm == "KDMQ"){ dmq_handle_message(); } exit; } ```
Started kama-0 which has now 4 htables of ~99K size each Started 10 kubernetes pods and launched kamailio 100 times with a timeout of 3 seconds on each pod So we have roughly 1000 kamailios trying to get these htables from kama-0 I didn't see any dangerous CPU spike and the loop doesn't happen anymore.
There's something I'm worried of though: the memory of the DMQ worker (measured from top), which usually stays around 0.1% is now stable at 1.4% and it's not going down again
I fear there's a memory leak somewhere but I'm not sure where, I had some doubts while debugging the loop issue about how the json_t structures are freed but it could be caused by me not knowing well the code; can you give us any hint to help you understand this issue better?
Thanks
Thanks for testing so extensively!
My (limited) understanding is that the call to srjson_DeleteItemFromObject() - by virtue of its internal call to srjson_Delete() - takes care of freeing the entire "cells" structure.
I will do some testing here to be sure there is no leak but in the meantime, can you confirm whether you see the memory increase linearly in line with the number of times you launch a new Kamailio instance? For example, if you increase from 100 to 200 times, does the memory usage then settle at 2.7%?
Hi @charlesrchance , I tried but couldn't reproduce the issue, I'm wondering how it happened
@charlesrchance do you need more tests before merging this fix?
I've been unable to reproduce myself, so if you're also unable to then I think it's good to merge - I'll submit the PR now.
Hello, sorry for bothering, just to know what timetable we should expect regarding this issue; (unfortunately) actually we have a production deployment with this latent behaviour that can be experienced simply restarting one instance of kamailio.
@paolovisintin - as soon as the above PR (#1872) is merged (will also be backported).
Fix has been merged/backported. Please reopen if still experiencing issues.
Closed #1863.