Hello,
can you get the output of the next gdb commands:
- first core file
frame 2
p *p_cell
list
- second core file
frame 8
p *p_cell
list
Is there any way that I can get access to the system with the core files via ssh? It would help speeding up investigating the core files. If yes, I can provide my ssh key.
Cheers,
Daniel
Hi guys, recently we had some issues with kamailio in production, a proxy at the edge of the infrastructure was crashing seemingly randomly, it happened a couple of times in a timespan of 4-5 days until a point where we found it in an endless loop, unable to process SIP packets anymore and printing in loop messages like these:
50(60) CRITICAL: <core> [core/mem/q_malloc.c:512]: qm_free(): BUG: freeing already freed pointer (0x7ff40a359450), called from core: core/usr_avp.c: destroy_avp_list_unsafe(626), first free core: core/usr_avp.c: destroy_avp_list_unsafe(626) - ignoring50(60) CRITICAL: <core> [core/mem/q_malloc.c:512]: qm_free(): BUG: freeing already freed pointer (0x7ff409c4fa58), called from core: core/usr_avp.c: destroy_avp_list_unsafe(626), first free core: core/usr_avp.c: destroy_avp_list_unsafe(626) - ignoring
We managed to replicate the crash and the loop in a dev environment with a lot of difficulties. It seems not related to the peak load, but to the growing number of processed callsThe kamailio version is:version: kamailio 5.2.2 (x86_64/linux)flags: STATS: Off, USE_TCP, USE_TLS, USE_SCTP, TLS_HOOKS, USE_RAW_SOCKS, DISABLE_NAGLE, USE_MCAST, DNS_IP_HACK, SHM_MEM, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, TLSF_MALLOC, DBG_SR_MEMORY, USE_FUTEX, FAST_LOCK-ADAPTIVE_WAIT, USE_DNS_CACHE, USE_DNS_FAILOVER, USE_NAPTR, USE_DST_BLACKLIST, HAVE_RESOLV_RESADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144 MAX_URI_SIZE 1024, BUF_SIZE 65535, DEFAULT PKG_SIZE 8MBpoll method support: poll, epoll_lt, epoll_et, sigio_rt, select.id: unknowncompiled with gcc 5.3.1
Every time this happened it was the timer pid that was crashing or looping, we managed to get 2 core dumps of the crash and 1 of the loop (got it after killing the timer with SIGQUIT).The timer process crashes after attempting to free memory that it's not correctly managed. In particular we saw that it happens inside destroy_avp_list_unsafe() called by the tm timer, when uri_avps_from points to an invalid memory address.We don't use dialog, uac, http_async_client, but we do use topos.
In the crash dumps it seems that there's something wrong inside free_cell_helper(), when trying to clean the dead_cell something goes wrong while freeing uri_avps_from, here's the crash log:
61(76) CRITICAL: <core> [core/pass_fd.c:277]: receive_fd(): EOF on 660(15) ALERT: <core> [main.c:740]: handle_sigs(): child process 65 exited by a signal 110(15) ALERT: <core> [main.c:743]: handle_sigs(): core was generated0(15) INFO: <core> [main.c:766]: handle_sigs(): terminating due to SIGCHLD<childs getting the SIGTERM signal>
0(15) CRITICAL: <core> [core/mem/q_malloc.c:512]: qm_free(): BUG: freeing already freed pointer (0x7f4d60a2e168), called from htable: ht_api.c: ht_cell_free(217), first free core: core/usr_avp.c: destroy_avp_list_unsafe(626) - ignoring0(15) INFO: <core> [core/sctp_core.c:53]: sctp_core_destroy(): SCTP API not initialized
Before the crash we have no warnings on errors about shared memory allocation
backtrace:
(gdb) bt full#0 0x00000000006371d7 in destroy_avp_list_unsafe (list=0x7f34f7a411b0) at core/usr_avp.c:625avp = 0x2foo = 0x2__func__ = "destroy_avp_list_unsafe"#1 0x00007f351c6909ab in free_cell_helper (dead_cell=0x7f34f7a41018, silent=0, fname=0x7f351c796a02 "timer.c", fline=689) at h_table.c:260b = 0x0i = 1rpl = 0x0tt = 0x0foo = 0x11c68daf2cbs = 0x0cbs_tmp = 0x7f34f80c3790__func__ = "free_cell_helper"#2 0x00007f351c7291e6 in wait_handler (ti=1984410416, wait_tl=0x7f34f7a410a0, data=0x7f34f7a41018) at timer.c:689p_cell = 0x7f34f7a41018ret = 1unlinked = 0rcount = 1__func__ = "wait_handler"#3 0x00000000004d0f56 in timer_list_expire (t=1984410416, h=0x7f34f6be7850, slow_l=0x7f34f6bea368, slow_mark=2702) at core/timer.c:874tl = 0x7f34f7a410a0ret = 0#4 0x00000000004d13bd in timer_handler () at core/timer.c:939saved_ticks = 1984410416run_slow_timer = 0i = 654__func__ = "timer_handler"#5 0x00000000004d1845 in timer_main () at core/timer.c:978No locals.#6 0x0000000000425cc8 in main_loop () at main.c:1727i = 8pid = 0si = 0x0si_desc = "udp receiver child=7 sock=127.0.0.1:5060\000\000\000\000q\000\000\000@6y\000\000\000\000\000\000\b\206|\210X\322:\032\301\255\000\000\000\000\000\000\000\000 \000\000\000\000\000\000\000\002\000\000\000\000\063\275{\000\000\000\000\000>", '\000' <repeats 15 times>, "\320\220\377\226\374\177\000\000\351\022a\000\000\000\000"nrprocs = 8woneinit = 1__func__ = "main_loop"#7 0x000000000042ca76 in main (argc=10, argv=0x7ffc96ff9398) at main.c:2696cfg_stream = 0x25ca010c = -1r = 0tmp = 0x7ffc96ffa4c9 ""tmp_len = -1761635728port = 32764proto = -1761635632options = 0x76c598 ":f:cm:M:dVIhEeb:l:L:n:vKrRDTN:W:w:t:u:g:P:G:SQ:O:a:A:x:X:Y:"ret = -1seed = 1492539644rfd = 4debug_save = 0debug_flag = 0dont_fork_cnt = 2n_lst = 0x7f351f6f6718p = 0xffffffff <error: Cannot access memory at address 0xffffffff>st = {st_dev = 2049, st_ino = 1282695, st_nlink = 2, st_mode = 16895, st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = 4096, st_blksize = 4096, st_blocks = 8, st_atim = {tv_sec = 1552931478, tv_nsec = 972761643},st_mtim = {tv_sec = 1552931478, tv_nsec = 972761643}, st_ctim = {tv_sec = 1552931478, tv_nsec = 972761643}, __glibc_reserved = {0, 0, 0}}__func__ = "main"
In the loop dump we have a different situation, there's a loop in the uri_avps_from structure as you can see from the next pointers:
(gdb) p *dead_cell->uri_avps_from$3 = {id = 66, flags = 275, next = 0x7ff40a359450, d = {p = 0x7ff40a9ad540, l = 140686126667072, data = "@\325\232\n\364\177\000"}}(gdb) p *dead_cell->uri_avps_from.next$4 = {id = 66, flags = 275, next = 0x7ff409c4fa58, d = {p = 0x7ff40a359480, l = 140686120031360, data = "\200\224\065\n\364\177\000"}}(gdb) p *dead_cell->uri_avps_from.next.next$5 = {id = 66, flags = 275, next = 0x7ff40a359450, d = {p = 0x7ff409c4fa88, l = 140686112651912, data = "\210\372\304\t\364\177\000"}}
backtrace:(gdb) bt full#0 0x00007ff43152e67e in _IO_default_xsputn () from /lib/x86_64-linux-gnu/libc.so.6No symbol table info available.#1 0x00007ff43150150b in vfprintf () from /lib/x86_64-linux-gnu/libc.so.6No symbol table info available.#2 0x00007ff431502ef1 in ?? () from /lib/x86_64-linux-gnu/libc.so.6No symbol table info available.#3 0x00007ff43150032d in vfprintf () from /lib/x86_64-linux-gnu/libc.so.6No symbol table info available.#4 0x00007ff4315087f7 in fprintf () from /lib/x86_64-linux-gnu/libc.so.6No symbol table info available.#5 0x0000000000686e43 in qm_free (qmp=0x7ff409336000, p=0x7ff409c4fa58, file=0x7e0a75 "core: core/usr_avp.c", func=0x7e2d70 <__func__.8439> "destroy_avp_list_unsafe", line=626, mname=0x7e0a70 "core") at core/mem/q_malloc.c:512__llevel = -3qm = 0x7ff409336000f = 0x7ff409c4fa20size = 72next = 0x0prev = 0x0__func__ = "qm_free"#6 0x0000000000637209 in destroy_avp_list_unsafe (list=0x7ff409e1b6c8) at core/usr_avp.c:626avp = 0x7ff40a359450foo = 0x7ff409c4fa58__func__ = "destroy_avp_list_unsafe"#7 0x00007ff42ee5d9ab in free_cell_helper (dead_cell=0x7ff409e1b530, silent=0, fname=0x7ff42ef63a02 "timer.c", fline=689) at h_table.c:260b = 0x0i = 1rpl = 0x0tt = 0x0foo = 0x12ee5aaf2cbs = 0x0cbs_tmp = 0x7ff409c5d420__func__ = "free_cell_helper"#8 0x00007ff42eef61e6 in wait_handler (ti=89094309, wait_tl=0x7ff409e1b5b8, data=0x7ff409e1b530) at timer.c:689p_cell = 0x7ff409e1b530ret = 1unlinked = 0rcount = 1__func__ = "wait_handler"#9 0x00000000004d0f56 in timer_list_expire (t=89094309, h=0x7ff4093b4850, slow_l=0x7ff4093b8058, slow_mark=5981) at core/timer.c:874tl = 0x7ff409e1b5b8ret = 0#10 0x00000000004d13bd in timer_handler () at core/timer.c:939saved_ticks = 89094309run_slow_timer = 0i = 861__func__ = "timer_handler"#11 0x00000000004d1845 in timer_main () at core/timer.c:978No locals.#12 0x0000000000425cc8 in main_loop () at main.c:1727i = 8pid = 0si = 0x0si_desc = "udp receiver child=7 sock=127.0.0.1:5060\000\000\000\000q\000\000\000@6y\000\000\000\000\000\000\032 t\220\260\242\024\032\301\255\000\000\000\000\000\000\000\000 \000\000\000\000\000\000\000\002\000\000\000\000\063\275{\000\000\000\000\000>", '\000' <repeats 15 times>, "\220\306\336\071\377\177\000\000\351\022a\000\000\000\000"nrprocs = 8woneinit = 1__func__ = "main_loop"#13 0x000000000042ca76 in main (argc=10, argv=0x7fff39dec958) at main.c:2696cfg_stream = 0x2292010c = -1r = 0tmp = 0x7fff39dee4c9 ""tmp_len = 970901552port = 32767proto = 970901648options = 0x76c598 ":f:cm:M:dVIhEeb:l:L:n:vKrRDTN:W:w:t:u:g:P:G:SQ:O:a:A:x:X:Y:"ret = -1seed = 2710999419rfd = 4debug_save = 0debug_flag = 0dont_fork_cnt = 2n_lst = 0x7ff431ec3718p = 0xffffffff <error: Cannot access memory at address 0xffffffff>st = {st_dev = 2049, st_ino = 1282695, st_nlink = 2, st_mode = 16895, st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = 4096, st_blksize = 4096, st_blocks = 8, st_atim = {tv_sec = 1552931523, tv_nsec = 918435869},st_mtim = {tv_sec = 1552994274, tv_nsec = 814438687}, st_ctim = {tv_sec = 1552994274, tv_nsec = 814438687}, __glibc_reserved = {0, 0, 0}}__func__ = "main"
We are not sure on how to keep debugging this issue which is causing some serious troubles in our environment, any help is appreciated
Here you can find the dumps (beware, 12M download but 1.5G once uncompressed):
Due thanks to Giacomo Vacca who helped us to dig into this issue.
Thanks
_______________________________________________ Kamailio (SER) - Users Mailing List sr-users@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users
-- Daniel-Constantin Mierla -- www.asipto.com www.twitter.com/miconda -- www.linkedin.com/in/miconda Kamailio World Conference - May 6-8, 2019 -- www.kamailioworld.com Kamailio Advanced Training - Mar 25-27, 2019, in Washington, DC, USA -- www.asipto.com