Description

When restarting kamailio nodes in our infrastructure we noticed that under traffic some nodes started using the 100% of the CPU, with the precious help of @giavac we were able to track down the issue to an infinite loop inside the htable module when synchronizing somewhat big (60K) htables via dmq

Troubleshooting

Reproduction

Have 1 kamailio instance with a 60K+ htable and start a new instance, the first instance will try to send the whole table to the new instance and it will enter an infinite loop which consumes 100% of the CPU

This is caused by a double call to ht_dmq_cell_group_flush which creates a circular list on the json structure hierarchy, the second call happens in this block of code (hence why it's required a 60K htable):
https://github.com/kamailio/kamailio/blob/5.2/src/modules/htable/ht_dmq.c#L509

When this happens ht_dmq_cell_group_flush try to add ht_dmq_jdoc_cell_group.jdoc_cells inside ht_dmq_jdoc_cell_group.jdoc->root but this root already has json_cells as its child
so when srjson_AddItemToObject is called (and in turn srjson_AddItemToArray) it gets appended as the child of itself:
https://github.com/kamailio/kamailio/blob/master/src/lib/srutils/srjson.c#L813

This circular structure then causes a loop when calling srjson_PrintUnformatted because in the print_object function the circular list is looped over:
https://github.com/kamailio/kamailio/blob/master/src/lib/srutils/srjson.c#L679

Possible Solutions

One possible solution could be to destroy and init again the ht_dmq_jdoc_cell_group structure after calling the flush:

if (ht_dmq_jdoc_cell_group.size >= dmq_cell_group_max_size) {
  LM_DBG("sending group count[%d]size[%d]\n", ht_dmq_jdoc_cell_group.count, ht_dmq_jdoc_cell_group.size);
  if (ht_dmq_cell_group_flush(node) < 0) {
    ht_slot_unlock(ht, i);
    goto error;
  }
  ht_dmq_cell_group_destroy();
  ht_dmq_cell_group_init();
}

But we are not sure about the performance implications.

Additional Information

# kamailio -v version: kamailio 5.2.1 (x86_64/linux) 44e488 flags: STATS: Off, USE_TCP, USE_TLS, USE_SCTP, TLS_HOOKS, USE_RAW_SOCKS, DISABLE_NAGLE, USE_MCAST, DNS_IP_HACK, SHM_MEM, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, TLSF_MALLOC, DBG_SR_MEMORY, USE_FUTEX, FAST_LOCK-ADAPTIVE_WAIT, USE_DNS_CACHE, USE_DNS_FAILOVER, USE_NAPTR, USE_DST_BLACKLIST, HAVE_RESOLV_RES ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144 MAX_URI_SIZE 1024, BUF_SIZE 65535, DEFAULT PKG_SIZE 8MB poll method support: poll, epoll_lt, epoll_et, sigio_rt, select. id: 44e488 compiled on 11:52:58 Feb 21 2019 with gcc 5.4.0


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.