[SR-Users] Kamailio crashing and infinite-looping when trying to free memory

Daniel-Constantin Mierla miconda at gmail.com
Tue Mar 19 17:07:20 CET 2019


Hello,

can you get the output of the next gdb commands:

  - first core file

frame 2
p *p_cell
list

  - second core file

frame 8
p *p_cell
list

Is there any way that I can get access to the system with the core files
via ssh? It would help speeding up investigating the core files. If yes,
I can provide my ssh key.

Cheers,
Daniel

On 19.03.19 16:12, Enrico Bandiera wrote:
> Hi guys, recently we had some issues with kamailio in production, a
> proxy at the edge of the infrastructure was crashing seemingly
> randomly, it happened a couple of times in a timespan of 4-5 days
> until a point where we found it in an endless loop, unable to process
> SIP packets anymore and printing in loop messages like these:
>
> 50(60) CRITICAL: <core> [core/mem/q_malloc.c:512]: qm_free(): BUG:
> freeing already freed pointer (0x7ff40a359450), called from core:
> core/usr_avp.c: destroy_avp_list_unsafe(626), first free core:
> core/usr_avp.c: destroy_avp_list_unsafe(626) - ignoring
> 50(60) CRITICAL: <core> [core/mem/q_malloc.c:512]: qm_free(): BUG:
> freeing already freed pointer (0x7ff409c4fa58), called from core:
> core/usr_avp.c: destroy_avp_list_unsafe(626), first free core:
> core/usr_avp.c: destroy_avp_list_unsafe(626) - ignoring
>
> We managed to replicate the crash and the loop in a dev environment
> with a lot of difficulties. It seems not related to the peak load, but
> to the growing number of processed calls
> The kamailio version is:
> version: kamailio 5.2.2 (x86_64/linux) 
> flags: STATS: Off, USE_TCP, USE_TLS, USE_SCTP, TLS_HOOKS,
> USE_RAW_SOCKS, DISABLE_NAGLE, USE_MCAST, DNS_IP_HACK, SHM_MEM,
> SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, TLSF_MALLOC, DBG_SR_MEMORY,
> USE_FUTEX, FAST_LOCK-ADAPTIVE_WAIT, USE_DNS_CACHE, USE_DNS_FAILOVER,
> USE_NAPTR, USE_DST_BLACKLIST, HAVE_RESOLV_RES
> ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144 MAX_URI_SIZE
> 1024, BUF_SIZE 65535, DEFAULT PKG_SIZE 8MB
> poll method support: poll, epoll_lt, epoll_et, sigio_rt, select.
> id: unknown 
> compiled with gcc 5.3.1
>
>
> Every time this happened it was the timer pid that was crashing or
> looping, we managed to get 2 core dumps of the crash and 1 of the loop
> (got it after killing the timer with SIGQUIT).
> The timer process crashes after attempting to free memory that it's
> not correctly managed. In particular we saw that it happens inside
> destroy_avp_list_unsafe() called by the tm timer, when uri_avps_from
> points to an invalid memory address.
> We don't use dialog, uac, http_async_client, but we do use topos.
>
> In the crash dumps it seems that there's something wrong inside
> free_cell_helper(), when trying to clean the dead_cell something goes
> wrong while freeing uri_avps_from, here's the crash log:
>
> 61(76) CRITICAL: <core> [core/pass_fd.c:277]: receive_fd(): EOF on 66
>  0(15) ALERT: <core> [main.c:740]: handle_sigs(): child process 65
> exited by a signal 11
>  0(15) ALERT: <core> [main.c:743]: handle_sigs(): core was generated
>  0(15) INFO: <core> [main.c:766]: handle_sigs(): terminating due to
> SIGCHLD
> <childs getting the SIGTERM signal>
>  0(15) CRITICAL: <core> [core/mem/q_malloc.c:512]: qm_free(): BUG:
> freeing already freed pointer (0x7f4d60a2e168), called from htable:
> ht_api.c: ht_cell_free(217), first free core: core/usr_avp.c:
> destroy_avp_list_unsafe(626) - ignoring
>  0(15) INFO: <core> [core/sctp_core.c:53]: sctp_core_destroy(): SCTP
> API not initialized
>
> Before the crash we have no warnings on errors about shared memory
> allocation
>
> backtrace:
>
> (gdb) bt full
> #0  0x00000000006371d7 in destroy_avp_list_unsafe
> (list=0x7f34f7a411b0) at core/usr_avp.c:625
>         avp = 0x2
>         foo = 0x2
>         __func__ = "destroy_avp_list_unsafe"
> #1  0x00007f351c6909ab in free_cell_helper (dead_cell=0x7f34f7a41018,
> silent=0, fname=0x7f351c796a02 "timer.c", fline=689) at h_table.c:260
>         b = 0x0
>         i = 1
>         rpl = 0x0
>         tt = 0x0
>         foo = 0x11c68daf2
>         cbs = 0x0
>         cbs_tmp = 0x7f34f80c3790
>         __func__ = "free_cell_helper"
> #2  0x00007f351c7291e6 in wait_handler (ti=1984410416,
> wait_tl=0x7f34f7a410a0, data=0x7f34f7a41018) at timer.c:689
>         p_cell = 0x7f34f7a41018
>         ret = 1
>         unlinked = 0
>         rcount = 1
>         __func__ = "wait_handler"
> #3  0x00000000004d0f56 in timer_list_expire (t=1984410416,
> h=0x7f34f6be7850, slow_l=0x7f34f6bea368, slow_mark=2702) at
> core/timer.c:874
>         tl = 0x7f34f7a410a0
>         ret = 0
> #4  0x00000000004d13bd in timer_handler () at core/timer.c:939
>         saved_ticks = 1984410416
>         run_slow_timer = 0
>         i = 654
>         __func__ = "timer_handler"
> #5  0x00000000004d1845 in timer_main () at core/timer.c:978
> No locals.
> #6  0x0000000000425cc8 in main_loop () at main.c:1727
>         i = 8
>         pid = 0
>         si = 0x0
>         si_desc = "udp receiver child=7 sock=127.0.0.1:5060
> <http://127.0.0.1:5060>\000\000\000\000q\000\000\000 at 6y\000\000\000\000\000\000\b\206|\210X\322:\032\301\255\000\000\000\000\000\000\000\000
> \000\000\000\000\000\000\000\002\000\000\000\000\063\275{\000\000\000\000\000>",
> '\000' <repeats 15 times>,
> "\320\220\377\226\374\177\000\000\351\022a\000\000\000\000"
>         nrprocs = 8
>         woneinit = 1
>         __func__ = "main_loop"
> #7  0x000000000042ca76 in main (argc=10, argv=0x7ffc96ff9398) at
> main.c:2696
>         cfg_stream = 0x25ca010
>         c = -1
>         r = 0
>         tmp = 0x7ffc96ffa4c9 ""
>         tmp_len = -1761635728
>         port = 32764
>         proto = -1761635632
>         options = 0x76c598
> ":f:cm:M:dVIhEeb:l:L:n:vKrRDTN:W:w:t:u:g:P:G:SQ:O:a:A:x:X:Y:"
>         ret = -1
>         seed = 1492539644
>         rfd = 4
>         debug_save = 0
>         debug_flag = 0
>         dont_fork_cnt = 2
>         n_lst = 0x7f351f6f6718
>         p = 0xffffffff <error: Cannot access memory at address 0xffffffff>
>         st = {st_dev = 2049, st_ino = 1282695, st_nlink = 2, st_mode =
> 16895, st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size =
> 4096, st_blksize = 4096, st_blocks = 8, st_atim = {tv_sec =
> 1552931478, tv_nsec = 972761643}, 
>           st_mtim = {tv_sec = 1552931478, tv_nsec = 972761643},
> st_ctim = {tv_sec = 1552931478, tv_nsec = 972761643}, __glibc_reserved
> = {0, 0, 0}}
>         __func__ = "main"
>
>
>
>
>
>
>
> In the loop dump we have a different situation, there's a loop in the
> uri_avps_from structure as you can see from the next pointers:
>
> (gdb) p *dead_cell->uri_avps_from
> $3 = {id = 66, flags = 275, next = 0x7ff40a359450, d = {p =
> 0x7ff40a9ad540, l = 140686126667072, data = "@\325\232\n\364\177\000"}}
> (gdb) p *dead_cell->uri_avps_from.next
> $4 = {id = 66, flags = 275, next = 0x7ff409c4fa58, d = {p =
> 0x7ff40a359480, l = 140686120031360, data = "\200\224\065\n\364\177\000"}}
> (gdb) p *dead_cell->uri_avps_from.next.next
> $5 = {id = 66, flags = 275, next = 0x7ff40a359450, d = {p =
> 0x7ff409c4fa88, l = 140686112651912, data = "\210\372\304\t\364\177\000"}}
>
> backtrace:
> (gdb) bt full
> #0  0x00007ff43152e67e in _IO_default_xsputn () from
> /lib/x86_64-linux-gnu/libc.so.6
> No symbol table info available.
> #1  0x00007ff43150150b in vfprintf () from /lib/x86_64-linux-gnu/libc.so.6
> No symbol table info available.
> #2  0x00007ff431502ef1 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
> No symbol table info available.
> #3  0x00007ff43150032d in vfprintf () from /lib/x86_64-linux-gnu/libc.so.6
> No symbol table info available.
> #4  0x00007ff4315087f7 in fprintf () from /lib/x86_64-linux-gnu/libc.so.6
> No symbol table info available.
> #5  0x0000000000686e43 in qm_free (qmp=0x7ff409336000,
> p=0x7ff409c4fa58, file=0x7e0a75 "core: core/usr_avp.c", func=0x7e2d70
> <__func__.8439> "destroy_avp_list_unsafe", line=626, mname=0x7e0a70
> "core") at core/mem/q_malloc.c:512
>         __llevel = -3
>         qm = 0x7ff409336000
>         f = 0x7ff409c4fa20
>         size = 72
>         next = 0x0
>         prev = 0x0
>         __func__ = "qm_free"
> #6  0x0000000000637209 in destroy_avp_list_unsafe
> (list=0x7ff409e1b6c8) at core/usr_avp.c:626
>         avp = 0x7ff40a359450
>         foo = 0x7ff409c4fa58
>         __func__ = "destroy_avp_list_unsafe"
> #7  0x00007ff42ee5d9ab in free_cell_helper (dead_cell=0x7ff409e1b530,
> silent=0, fname=0x7ff42ef63a02 "timer.c", fline=689) at h_table.c:260
>         b = 0x0
>         i = 1
>         rpl = 0x0
>         tt = 0x0
>         foo = 0x12ee5aaf2
>         cbs = 0x0
>         cbs_tmp = 0x7ff409c5d420
>         __func__ = "free_cell_helper"
> #8  0x00007ff42eef61e6 in wait_handler (ti=89094309,
> wait_tl=0x7ff409e1b5b8, data=0x7ff409e1b530) at timer.c:689
>         p_cell = 0x7ff409e1b530
>         ret = 1
>         unlinked = 0
>         rcount = 1
>         __func__ = "wait_handler"
> #9  0x00000000004d0f56 in timer_list_expire (t=89094309,
> h=0x7ff4093b4850, slow_l=0x7ff4093b8058, slow_mark=5981) at
> core/timer.c:874
>         tl = 0x7ff409e1b5b8
>         ret = 0
> #10 0x00000000004d13bd in timer_handler () at core/timer.c:939
>         saved_ticks = 89094309
>         run_slow_timer = 0
>         i = 861
>         __func__ = "timer_handler"
> #11 0x00000000004d1845 in timer_main () at core/timer.c:978
> No locals.
> #12 0x0000000000425cc8 in main_loop () at main.c:1727
>         i = 8
>         pid = 0
>         si = 0x0
>         si_desc = "udp receiver child=7 sock=127.0.0.1:5060
> <http://127.0.0.1:5060>\000\000\000\000q\000\000\000 at 6y\000\000\000\000\000\000\032
> t\220\260\242\024\032\301\255\000\000\000\000\000\000\000\000
> \000\000\000\000\000\000\000\002\000\000\000\000\063\275{\000\000\000\000\000>",
> '\000' <repeats 15 times>,
> "\220\306\336\071\377\177\000\000\351\022a\000\000\000\000"
>         nrprocs = 8
>         woneinit = 1
>         __func__ = "main_loop"
> #13 0x000000000042ca76 in main (argc=10, argv=0x7fff39dec958) at
> main.c:2696
>         cfg_stream = 0x2292010
>         c = -1
>         r = 0
>         tmp = 0x7fff39dee4c9 ""
>         tmp_len = 970901552
>         port = 32767
>         proto = 970901648
>         options = 0x76c598
> ":f:cm:M:dVIhEeb:l:L:n:vKrRDTN:W:w:t:u:g:P:G:SQ:O:a:A:x:X:Y:"
>         ret = -1
>         seed = 2710999419
>         rfd = 4
>         debug_save = 0
>         debug_flag = 0
>         dont_fork_cnt = 2
>         n_lst = 0x7ff431ec3718
>         p = 0xffffffff <error: Cannot access memory at address 0xffffffff>
>         st = {st_dev = 2049, st_ino = 1282695, st_nlink = 2, st_mode =
> 16895, st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size =
> 4096, st_blksize = 4096, st_blocks = 8, st_atim = {tv_sec =
> 1552931523, tv_nsec = 918435869}, 
>           st_mtim = {tv_sec = 1552994274, tv_nsec = 814438687},
> st_ctim = {tv_sec = 1552994274, tv_nsec = 814438687}, __glibc_reserved
> = {0, 0, 0}}
>         __func__ = "main"
>
>
>
> We are not sure on how to keep debugging this issue which is causing
> some serious troubles in our environment, any help is appreciated
>
> Here you can find the dumps (beware, 12M download but 1.5G once
> uncompressed):
>
> https://drive.google.com/file/d/1U0LZedyona8jJZHq6HaoqXpfnU83p6Sp/view?usp=sharing
>
> Due thanks to Giacomo Vacca who helped us to dig into this issue.
>
> Thanks
>
> _______________________________________________
> Kamailio (SER) - Users Mailing List
> sr-users at lists.kamailio.org
> https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users

-- 
Daniel-Constantin Mierla -- www.asipto.com
www.twitter.com/miconda -- www.linkedin.com/in/miconda
Kamailio World Conference - May 6-8, 2019 -- www.kamailioworld.com
Kamailio Advanced Training - Mar 25-27, 2019, in Washington, DC, USA -- www.asipto.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.kamailio.org/pipermail/sr-users/attachments/20190319/663e3b84/attachment.html>


More information about the sr-users mailing list