Hello,
can you get the output of the next gdb commands:
- first core file
frame 2
p *p_cell
list
- second core file
frame 8
p *p_cell
list
Is there any way that I can get access to the system with the core files
via ssh? It would help speeding up investigating the core files. If yes,
I can provide my ssh key.
Cheers,
Daniel
On 19.03.19 16:12, Enrico Bandiera wrote:
Hi guys, recently we had some issues with kamailio in
production, a
proxy at the edge of the infrastructure was crashing seemingly
randomly, it happened a couple of times in a timespan of 4-5 days
until a point where we found it in an endless loop, unable to process
SIP packets anymore and printing in loop messages like these:
50(60) CRITICAL: <core> [core/mem/q_malloc.c:512]: qm_free(): BUG:
freeing already freed pointer (0x7ff40a359450), called from core:
core/usr_avp.c: destroy_avp_list_unsafe(626), first free core:
core/usr_avp.c: destroy_avp_list_unsafe(626) - ignoring
50(60) CRITICAL: <core> [core/mem/q_malloc.c:512]: qm_free(): BUG:
freeing already freed pointer (0x7ff409c4fa58), called from core:
core/usr_avp.c: destroy_avp_list_unsafe(626), first free core:
core/usr_avp.c: destroy_avp_list_unsafe(626) - ignoring
We managed to replicate the crash and the loop in a dev environment
with a lot of difficulties. It seems not related to the peak load, but
to the growing number of processed calls
The kamailio version is:
version: kamailio 5.2.2 (x86_64/linux)
flags: STATS: Off, USE_TCP, USE_TLS, USE_SCTP, TLS_HOOKS,
USE_RAW_SOCKS, DISABLE_NAGLE, USE_MCAST, DNS_IP_HACK, SHM_MEM,
SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, TLSF_MALLOC, DBG_SR_MEMORY,
USE_FUTEX, FAST_LOCK-ADAPTIVE_WAIT, USE_DNS_CACHE, USE_DNS_FAILOVER,
USE_NAPTR, USE_DST_BLACKLIST, HAVE_RESOLV_RES
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144 MAX_URI_SIZE
1024, BUF_SIZE 65535, DEFAULT PKG_SIZE 8MB
poll method support: poll, epoll_lt, epoll_et, sigio_rt, select.
id: unknown
compiled with gcc 5.3.1
Every time this happened it was the timer pid that was crashing or
looping, we managed to get 2 core dumps of the crash and 1 of the loop
(got it after killing the timer with SIGQUIT).
The timer process crashes after attempting to free memory that it's
not correctly managed. In particular we saw that it happens inside
destroy_avp_list_unsafe() called by the tm timer, when uri_avps_from
points to an invalid memory address.
We don't use dialog, uac, http_async_client, but we do use topos.
In the crash dumps it seems that there's something wrong inside
free_cell_helper(), when trying to clean the dead_cell something goes
wrong while freeing uri_avps_from, here's the crash log:
61(76) CRITICAL: <core> [core/pass_fd.c:277]: receive_fd(): EOF on 66
0(15) ALERT: <core> [main.c:740]: handle_sigs(): child process 65
exited by a signal 11
0(15) ALERT: <core> [main.c:743]: handle_sigs(): core was generated
0(15) INFO: <core> [main.c:766]: handle_sigs(): terminating due to
SIGCHLD
<childs getting the SIGTERM signal>
0(15) CRITICAL: <core> [core/mem/q_malloc.c:512]: qm_free(): BUG:
freeing already freed pointer (0x7f4d60a2e168), called from htable:
ht_api.c: ht_cell_free(217), first free core: core/usr_avp.c:
destroy_avp_list_unsafe(626) - ignoring
0(15) INFO: <core> [core/sctp_core.c:53]: sctp_core_destroy(): SCTP
API not initialized
Before the crash we have no warnings on errors about shared memory
allocation
backtrace:
(gdb) bt full
#0 0x00000000006371d7 in destroy_avp_list_unsafe
(list=0x7f34f7a411b0) at core/usr_avp.c:625
avp = 0x2
foo = 0x2
__func__ = "destroy_avp_list_unsafe"
#1 0x00007f351c6909ab in free_cell_helper (dead_cell=0x7f34f7a41018,
silent=0, fname=0x7f351c796a02 "timer.c", fline=689) at h_table.c:260
b = 0x0
i = 1
rpl = 0x0
tt = 0x0
foo = 0x11c68daf2
cbs = 0x0
cbs_tmp = 0x7f34f80c3790
__func__ = "free_cell_helper"
#2 0x00007f351c7291e6 in wait_handler (ti=1984410416,
wait_tl=0x7f34f7a410a0, data=0x7f34f7a41018) at timer.c:689
p_cell = 0x7f34f7a41018
ret = 1
unlinked = 0
rcount = 1
__func__ = "wait_handler"
#3 0x00000000004d0f56 in timer_list_expire (t=1984410416,
h=0x7f34f6be7850, slow_l=0x7f34f6bea368, slow_mark=2702) at
core/timer.c:874
tl = 0x7f34f7a410a0
ret = 0
#4 0x00000000004d13bd in timer_handler () at core/timer.c:939
saved_ticks = 1984410416
run_slow_timer = 0
i = 654
__func__ = "timer_handler"
#5 0x00000000004d1845 in timer_main () at core/timer.c:978
No locals.
#6 0x0000000000425cc8 in main_loop () at main.c:1727
i = 8
pid = 0
si = 0x0
si_desc = "udp receiver child=7 sock=127.0.0.1:5060
<http://127.0.0.1:5060>\000\000\000\000q\000\000\000@6y\000\000\000\000\000\000\b\206|\210X\322:\032\301\255\000\000\000\000\000\000\000\000
\000\000\000\000\000\000\000\002\000\000\000\000\063\275{\000\000\000\000\000>",
'\000' <repeats 15 times>,
"\320\220\377\226\374\177\000\000\351\022a\000\000\000\000"
nrprocs = 8
woneinit = 1
__func__ = "main_loop"
#7 0x000000000042ca76 in main (argc=10, argv=0x7ffc96ff9398) at
main.c:2696
cfg_stream = 0x25ca010
c = -1
r = 0
tmp = 0x7ffc96ffa4c9 ""
tmp_len = -1761635728
port = 32764
proto = -1761635632
options = 0x76c598
":f:cm:M:dVIhEeb:l:L:n:vKrRDTN:W:w:t:u:g:P:G:SQ:O:a:A:x:X:Y:"
ret = -1
seed = 1492539644
rfd = 4
debug_save = 0
debug_flag = 0
dont_fork_cnt = 2
n_lst = 0x7f351f6f6718
p = 0xffffffff <error: Cannot access memory at address 0xffffffff>
st = {st_dev = 2049, st_ino = 1282695, st_nlink = 2, st_mode =
16895, st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size =
4096, st_blksize = 4096, st_blocks = 8, st_atim = {tv_sec =
1552931478, tv_nsec = 972761643},
st_mtim = {tv_sec = 1552931478, tv_nsec = 972761643},
st_ctim = {tv_sec = 1552931478, tv_nsec = 972761643}, __glibc_reserved
= {0, 0, 0}}
__func__ = "main"
In the loop dump we have a different situation, there's a loop in the
uri_avps_from structure as you can see from the next pointers:
(gdb) p *dead_cell->uri_avps_from
$3 = {id = 66, flags = 275, next = 0x7ff40a359450, d = {p =
0x7ff40a9ad540, l = 140686126667072, data = "@\325\232\n\364\177\000"}}
(gdb) p *dead_cell->uri_avps_from.next
$4 = {id = 66, flags = 275, next = 0x7ff409c4fa58, d = {p =
0x7ff40a359480, l = 140686120031360, data = "\200\224\065\n\364\177\000"}}
(gdb) p *dead_cell->uri_avps_from.next.next
$5 = {id = 66, flags = 275, next = 0x7ff40a359450, d = {p =
0x7ff409c4fa88, l = 140686112651912, data = "\210\372\304\t\364\177\000"}}
backtrace:
(gdb) bt full
#0 0x00007ff43152e67e in _IO_default_xsputn () from
/lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#1 0x00007ff43150150b in vfprintf () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#2 0x00007ff431502ef1 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#3 0x00007ff43150032d in vfprintf () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#4 0x00007ff4315087f7 in fprintf () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#5 0x0000000000686e43 in qm_free (qmp=0x7ff409336000,
p=0x7ff409c4fa58, file=0x7e0a75 "core: core/usr_avp.c", func=0x7e2d70
<__func__.8439> "destroy_avp_list_unsafe", line=626, mname=0x7e0a70
"core") at core/mem/q_malloc.c:512
__llevel = -3
qm = 0x7ff409336000
f = 0x7ff409c4fa20
size = 72
next = 0x0
prev = 0x0
__func__ = "qm_free"
#6 0x0000000000637209 in destroy_avp_list_unsafe
(list=0x7ff409e1b6c8) at core/usr_avp.c:626
avp = 0x7ff40a359450
foo = 0x7ff409c4fa58
__func__ = "destroy_avp_list_unsafe"
#7 0x00007ff42ee5d9ab in free_cell_helper (dead_cell=0x7ff409e1b530,
silent=0, fname=0x7ff42ef63a02 "timer.c", fline=689) at h_table.c:260
b = 0x0
i = 1
rpl = 0x0
tt = 0x0
foo = 0x12ee5aaf2
cbs = 0x0
cbs_tmp = 0x7ff409c5d420
__func__ = "free_cell_helper"
#8 0x00007ff42eef61e6 in wait_handler (ti=89094309,
wait_tl=0x7ff409e1b5b8, data=0x7ff409e1b530) at timer.c:689
p_cell = 0x7ff409e1b530
ret = 1
unlinked = 0
rcount = 1
__func__ = "wait_handler"
#9 0x00000000004d0f56 in timer_list_expire (t=89094309,
h=0x7ff4093b4850, slow_l=0x7ff4093b8058, slow_mark=5981) at
core/timer.c:874
tl = 0x7ff409e1b5b8
ret = 0
#10 0x00000000004d13bd in timer_handler () at core/timer.c:939
saved_ticks = 89094309
run_slow_timer = 0
i = 861
__func__ = "timer_handler"
#11 0x00000000004d1845 in timer_main () at core/timer.c:978
No locals.
#12 0x0000000000425cc8 in main_loop () at main.c:1727
i = 8
pid = 0
si = 0x0
si_desc = "udp receiver child=7 sock=127.0.0.1:5060
<http://127.0.0.1:5060>\000\000\000\000q\000\000\000@6y\000\000\000\000\000\000\032
t\220\260\242\024\032\301\255\000\000\000\000\000\000\000\000
\000\000\000\000\000\000\000\002\000\000\000\000\063\275{\000\000\000\000\000>",
'\000' <repeats 15 times>,
"\220\306\336\071\377\177\000\000\351\022a\000\000\000\000"
nrprocs = 8
woneinit = 1
__func__ = "main_loop"
#13 0x000000000042ca76 in main (argc=10, argv=0x7fff39dec958) at
main.c:2696
cfg_stream = 0x2292010
c = -1
r = 0
tmp = 0x7fff39dee4c9 ""
tmp_len = 970901552
port = 32767
proto = 970901648
options = 0x76c598
":f:cm:M:dVIhEeb:l:L:n:vKrRDTN:W:w:t:u:g:P:G:SQ:O:a:A:x:X:Y:"
ret = -1
seed = 2710999419
rfd = 4
debug_save = 0
debug_flag = 0
dont_fork_cnt = 2
n_lst = 0x7ff431ec3718
p = 0xffffffff <error: Cannot access memory at address 0xffffffff>
st = {st_dev = 2049, st_ino = 1282695, st_nlink = 2, st_mode =
16895, st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size =
4096, st_blksize = 4096, st_blocks = 8, st_atim = {tv_sec =
1552931523, tv_nsec = 918435869},
st_mtim = {tv_sec = 1552994274, tv_nsec = 814438687},
st_ctim = {tv_sec = 1552994274, tv_nsec = 814438687}, __glibc_reserved
= {0, 0, 0}}
__func__ = "main"
We are not sure on how to keep debugging this issue which is causing
some serious troubles in our environment, any help is appreciated
Here you can find the dumps (beware, 12M download but 1.5G once
uncompressed):
https://drive.google.com/file/d/1U0LZedyona8jJZHq6HaoqXpfnU83p6Sp/view?usp=…
Due thanks to Giacomo Vacca who helped us to dig into this issue.
Thanks
_______________________________________________
Kamailio (SER) - Users Mailing List
sr-users(a)lists.kamailio.org
https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users