Hello,

can you get the output of the next gdb commands:

  - first core file

frame 2
p *p_cell
list

  - second core file

frame 8
p *p_cell
list

Is there any way that I can get access to the system with the core files via ssh? It would help speeding up investigating the core files. If yes, I can provide my ssh key.

Cheers,
Daniel

On 19.03.19 16:12, Enrico Bandiera wrote:
Hi guys, recently we had some issues with kamailio in production, a proxy at the edge of the infrastructure was crashing seemingly randomly, it happened a couple of times in a timespan of 4-5 days until a point where we found it in an endless loop, unable to process SIP packets anymore and printing in loop messages like these:

50(60) CRITICAL: <core> [core/mem/q_malloc.c:512]: qm_free(): BUG: freeing already freed pointer (0x7ff40a359450), called from core: core/usr_avp.c: destroy_avp_list_unsafe(626), first free core: core/usr_avp.c: destroy_avp_list_unsafe(626) - ignoring
50(60) CRITICAL: <core> [core/mem/q_malloc.c:512]: qm_free(): BUG: freeing already freed pointer (0x7ff409c4fa58), called from core: core/usr_avp.c: destroy_avp_list_unsafe(626), first free core: core/usr_avp.c: destroy_avp_list_unsafe(626) - ignoring

We managed to replicate the crash and the loop in a dev environment with a lot of difficulties. It seems not related to the peak load, but to the growing number of processed calls
The kamailio version is:
version: kamailio 5.2.2 (x86_64/linux) 
flags: STATS: Off, USE_TCP, USE_TLS, USE_SCTP, TLS_HOOKS, USE_RAW_SOCKS, DISABLE_NAGLE, USE_MCAST, DNS_IP_HACK, SHM_MEM, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, TLSF_MALLOC, DBG_SR_MEMORY, USE_FUTEX, FAST_LOCK-ADAPTIVE_WAIT, USE_DNS_CACHE, USE_DNS_FAILOVER, USE_NAPTR, USE_DST_BLACKLIST, HAVE_RESOLV_RES
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144 MAX_URI_SIZE 1024, BUF_SIZE 65535, DEFAULT PKG_SIZE 8MB
poll method support: poll, epoll_lt, epoll_et, sigio_rt, select.
id: unknown 
compiled with gcc 5.3.1


Every time this happened it was the timer pid that was crashing or looping, we managed to get 2 core dumps of the crash and 1 of the loop (got it after killing the timer with SIGQUIT).
The timer process crashes after attempting to free memory that it's not correctly managed. In particular we saw that it happens inside destroy_avp_list_unsafe() called by the tm timer, when uri_avps_from points to an invalid memory address.
We don't use dialog, uac, http_async_client, but we do use topos.

In the crash dumps it seems that there's something wrong inside free_cell_helper(), when trying to clean the dead_cell something goes wrong while freeing uri_avps_from, here's the crash log:

61(76) CRITICAL: <core> [core/pass_fd.c:277]: receive_fd(): EOF on 66
 0(15) ALERT: <core> [main.c:740]: handle_sigs(): child process 65 exited by a signal 11
 0(15) ALERT: <core> [main.c:743]: handle_sigs(): core was generated
 0(15) INFO: <core> [main.c:766]: handle_sigs(): terminating due to SIGCHLD
<childs getting the SIGTERM signal>
 0(15) CRITICAL: <core> [core/mem/q_malloc.c:512]: qm_free(): BUG: freeing already freed pointer (0x7f4d60a2e168), called from htable: ht_api.c: ht_cell_free(217), first free core: core/usr_avp.c: destroy_avp_list_unsafe(626) - ignoring
 0(15) INFO: <core> [core/sctp_core.c:53]: sctp_core_destroy(): SCTP API not initialized

Before the crash we have no warnings on errors about shared memory allocation

backtrace:

(gdb) bt full
#0  0x00000000006371d7 in destroy_avp_list_unsafe (list=0x7f34f7a411b0) at core/usr_avp.c:625
        avp = 0x2
        foo = 0x2
        __func__ = "destroy_avp_list_unsafe"
#1  0x00007f351c6909ab in free_cell_helper (dead_cell=0x7f34f7a41018, silent=0, fname=0x7f351c796a02 "timer.c", fline=689) at h_table.c:260
        b = 0x0
        i = 1
        rpl = 0x0
        tt = 0x0
        foo = 0x11c68daf2
        cbs = 0x0
        cbs_tmp = 0x7f34f80c3790
        __func__ = "free_cell_helper"
#2  0x00007f351c7291e6 in wait_handler (ti=1984410416, wait_tl=0x7f34f7a410a0, data=0x7f34f7a41018) at timer.c:689
        p_cell = 0x7f34f7a41018
        ret = 1
        unlinked = 0
        rcount = 1
        __func__ = "wait_handler"
#3  0x00000000004d0f56 in timer_list_expire (t=1984410416, h=0x7f34f6be7850, slow_l=0x7f34f6bea368, slow_mark=2702) at core/timer.c:874
        tl = 0x7f34f7a410a0
        ret = 0
#4  0x00000000004d13bd in timer_handler () at core/timer.c:939
        saved_ticks = 1984410416
        run_slow_timer = 0
        i = 654
        __func__ = "timer_handler"
#5  0x00000000004d1845 in timer_main () at core/timer.c:978
No locals.
#6  0x0000000000425cc8 in main_loop () at main.c:1727
        i = 8
        pid = 0
        si = 0x0
        si_desc = "udp receiver child=7 sock=127.0.0.1:5060\000\000\000\000q\000\000\000@6y\000\000\000\000\000\000\b\206|\210X\322:\032\301\255\000\000\000\000\000\000\000\000 \000\000\000\000\000\000\000\002\000\000\000\000\063\275{\000\000\000\000\000>", '\000' <repeats 15 times>, "\320\220\377\226\374\177\000\000\351\022a\000\000\000\000"
        nrprocs = 8
        woneinit = 1
        __func__ = "main_loop"
#7  0x000000000042ca76 in main (argc=10, argv=0x7ffc96ff9398) at main.c:2696
        cfg_stream = 0x25ca010
        c = -1
        r = 0
        tmp = 0x7ffc96ffa4c9 ""
        tmp_len = -1761635728
        port = 32764
        proto = -1761635632
        options = 0x76c598 ":f:cm:M:dVIhEeb:l:L:n:vKrRDTN:W:w:t:u:g:P:G:SQ:O:a:A:x:X:Y:"
        ret = -1
        seed = 1492539644
        rfd = 4
        debug_save = 0
        debug_flag = 0
        dont_fork_cnt = 2
        n_lst = 0x7f351f6f6718
        p = 0xffffffff <error: Cannot access memory at address 0xffffffff>
        st = {st_dev = 2049, st_ino = 1282695, st_nlink = 2, st_mode = 16895, st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = 4096, st_blksize = 4096, st_blocks = 8, st_atim = {tv_sec = 1552931478, tv_nsec = 972761643}, 
          st_mtim = {tv_sec = 1552931478, tv_nsec = 972761643}, st_ctim = {tv_sec = 1552931478, tv_nsec = 972761643}, __glibc_reserved = {0, 0, 0}}
        __func__ = "main"







In the loop dump we have a different situation, there's a loop in the uri_avps_from structure as you can see from the next pointers:

(gdb) p *dead_cell->uri_avps_from
$3 = {id = 66, flags = 275, next = 0x7ff40a359450, d = {p = 0x7ff40a9ad540, l = 140686126667072, data = "@\325\232\n\364\177\000"}}
(gdb) p *dead_cell->uri_avps_from.next
$4 = {id = 66, flags = 275, next = 0x7ff409c4fa58, d = {p = 0x7ff40a359480, l = 140686120031360, data = "\200\224\065\n\364\177\000"}}
(gdb) p *dead_cell->uri_avps_from.next.next
$5 = {id = 66, flags = 275, next = 0x7ff40a359450, d = {p = 0x7ff409c4fa88, l = 140686112651912, data = "\210\372\304\t\364\177\000"}}

backtrace:
(gdb) bt full
#0  0x00007ff43152e67e in _IO_default_xsputn () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#1  0x00007ff43150150b in vfprintf () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#2  0x00007ff431502ef1 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#3  0x00007ff43150032d in vfprintf () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#4  0x00007ff4315087f7 in fprintf () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#5  0x0000000000686e43 in qm_free (qmp=0x7ff409336000, p=0x7ff409c4fa58, file=0x7e0a75 "core: core/usr_avp.c", func=0x7e2d70 <__func__.8439> "destroy_avp_list_unsafe", line=626, mname=0x7e0a70 "core") at core/mem/q_malloc.c:512
        __llevel = -3
        qm = 0x7ff409336000
        f = 0x7ff409c4fa20
        size = 72
        next = 0x0
        prev = 0x0
        __func__ = "qm_free"
#6  0x0000000000637209 in destroy_avp_list_unsafe (list=0x7ff409e1b6c8) at core/usr_avp.c:626
        avp = 0x7ff40a359450
        foo = 0x7ff409c4fa58
        __func__ = "destroy_avp_list_unsafe"
#7  0x00007ff42ee5d9ab in free_cell_helper (dead_cell=0x7ff409e1b530, silent=0, fname=0x7ff42ef63a02 "timer.c", fline=689) at h_table.c:260
        b = 0x0
        i = 1
        rpl = 0x0
        tt = 0x0
        foo = 0x12ee5aaf2
        cbs = 0x0
        cbs_tmp = 0x7ff409c5d420
        __func__ = "free_cell_helper"
#8  0x00007ff42eef61e6 in wait_handler (ti=89094309, wait_tl=0x7ff409e1b5b8, data=0x7ff409e1b530) at timer.c:689
        p_cell = 0x7ff409e1b530
        ret = 1
        unlinked = 0
        rcount = 1
        __func__ = "wait_handler"
#9  0x00000000004d0f56 in timer_list_expire (t=89094309, h=0x7ff4093b4850, slow_l=0x7ff4093b8058, slow_mark=5981) at core/timer.c:874
        tl = 0x7ff409e1b5b8
        ret = 0
#10 0x00000000004d13bd in timer_handler () at core/timer.c:939
        saved_ticks = 89094309
        run_slow_timer = 0
        i = 861
        __func__ = "timer_handler"
#11 0x00000000004d1845 in timer_main () at core/timer.c:978
No locals.
#12 0x0000000000425cc8 in main_loop () at main.c:1727
        i = 8
        pid = 0
        si = 0x0
        si_desc = "udp receiver child=7 sock=127.0.0.1:5060\000\000\000\000q\000\000\000@6y\000\000\000\000\000\000\032 t\220\260\242\024\032\301\255\000\000\000\000\000\000\000\000 \000\000\000\000\000\000\000\002\000\000\000\000\063\275{\000\000\000\000\000>", '\000' <repeats 15 times>, "\220\306\336\071\377\177\000\000\351\022a\000\000\000\000"
        nrprocs = 8
        woneinit = 1
        __func__ = "main_loop"
#13 0x000000000042ca76 in main (argc=10, argv=0x7fff39dec958) at main.c:2696
        cfg_stream = 0x2292010
        c = -1
        r = 0
        tmp = 0x7fff39dee4c9 ""
        tmp_len = 970901552
        port = 32767
        proto = 970901648
        options = 0x76c598 ":f:cm:M:dVIhEeb:l:L:n:vKrRDTN:W:w:t:u:g:P:G:SQ:O:a:A:x:X:Y:"
        ret = -1
        seed = 2710999419
        rfd = 4
        debug_save = 0
        debug_flag = 0
        dont_fork_cnt = 2
        n_lst = 0x7ff431ec3718
        p = 0xffffffff <error: Cannot access memory at address 0xffffffff>
        st = {st_dev = 2049, st_ino = 1282695, st_nlink = 2, st_mode = 16895, st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = 4096, st_blksize = 4096, st_blocks = 8, st_atim = {tv_sec = 1552931523, tv_nsec = 918435869}, 
          st_mtim = {tv_sec = 1552994274, tv_nsec = 814438687}, st_ctim = {tv_sec = 1552994274, tv_nsec = 814438687}, __glibc_reserved = {0, 0, 0}}
        __func__ = "main"



We are not sure on how to keep debugging this issue which is causing some serious troubles in our environment, any help is appreciated

Here you can find the dumps (beware, 12M download but 1.5G once uncompressed):


Due thanks to Giacomo Vacca who helped us to dig into this issue.

Thanks

_______________________________________________
Kamailio (SER) - Users Mailing List
sr-users@lists.kamailio.org
https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users
-- 
Daniel-Constantin Mierla -- www.asipto.com
www.twitter.com/miconda -- www.linkedin.com/in/miconda
Kamailio World Conference - May 6-8, 2019 -- www.kamailioworld.com
Kamailio Advanced Training - Mar 25-27, 2019, in Washington, DC, USA -- www.asipto.com