Kamailio 5.2 crashed when it received 480 reply to INVITE. Below is backtrace from the core file.
The crash happens in t_reply.c on the last line of this block:
uac=&t->uac[branch]; LM_DBG("org. status uas=%d, uac[%d]=%d local=%d is_invite=%d)\n", t->uas.status, branch, uac->last_received, is_local(t), is_invite(t)); last_uac_status=uac->last_received;
Earlier it was checked that the transaction was found. Its uac[0] seems to be broken.
-- Juha
-----------------------------------------
Program terminated with signal SIGSEGV, Segmentation fault. #0_ 0x00007f1073e234c3 in reply_received (p_msg=0x7f1076b605f0) at t_reply.c:2240 2240_ _ _ t_reply.c: No such file or directory. (gdb) bt full #0_ 0x00007f1073e234c3 in reply_received (p_msg=0x7f1076b605f0) at t_reply.c:2240 _ _ _ _ _ _ _ msg_status = 480 _ _ _ _ _ _ _ last_uac_status = 1590315756 _ _ _ _ _ _ _ ack = 0x50550c4 <error: Cannot access memory at address 0x50550c4> _ _ _ _ _ _ _ ack_len = 4 _ _ _ _ _ _ _ branch = 0 _ _ _ _ _ _ _ reply_status = 29 _ _ _ _ _ _ _ onreply_route = 9941216 _ _ _ _ _ _ _ cancel_data = {cancel_bitmap = 0, reason = {cause = 0, u = {text = {s = 0x0, len = 1590087991}, e2e_cancel = 0x0, packed_hdrs = {s = 0x0, len = 1590087991}}}} _ _ _ _ _ _ _ uac = 0x0 _ _ _ _ _ _ _ t = 0x7f105dfe6480 _ _ _ _ _ _ _ lack_dst = {send_sock = 0x555b5f02720f <buf+431>, to = {s = {sa_family = 29127, sa_data = "XXX"}, sin = {sin_family = 29127, sin_port = 24322, sin_addr = {s_addr = 21851}, sin_zero = "XXX"}, sin6 = { _ _ _ _ _ _ _ _ _ _ _ _ _ sin6_family = 29127, sin6_port = 24322, sin6_flowinfo = 21851, sin6_addr = {__in6_u = {__u6_addr8 = "XXX", __u6_addr16 = {XXX, XXX, XXX, XXX, XXX, XXX, XXX, XXX}, __u6_addr32 = {XXX, XXX, XXX, _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ XXX}}}, sin6_scope_id = 1980563656}}, id = 32528, proto = 112 'p', send_flags = {f = 30268, blst_imask = 32528}} _ _ _ _ _ _ _ backup_user_from = 0x0 _ _ _ _ _ _ _ backup_user_to = 0xXXX <qm_info+46> _ _ _ _ _ _ _ backup_domain_from = 0xXXX _ _ _ _ _ _ _ backup_domain_to = 0xXXX _ _ _ _ _ _ _ backup_uri_from = 0x0 _ _ _ _ _ _ _ backup_uri_to = 0xXXX _ _ _ _ _ _ _ backup_xavps = 0x45ed834e3 _ _ _ _ _ _ _ replies_locked = 1 _ _ _ _ _ _ _ branch_ret = 1593995512 _ _ _ _ _ _ _ prev_branch = 21851 _ _ _ _ _ _ _ blst_503_timeout = 340003632 _ _ _ _ _ _ _ hf = 0x7f1076490810 _ _ _ _ _ _ _ onsend_params = {req = 0x7f10763c4898, rpl = 0x7f10763c4888, param = 0x97b5f0, code = 10751248, flags = 0, branch = 0, t_rbuf = 0xaf95c0, dst = 0x7f1076db4fc0 <__syslog>, send_buf = {s = 0x555b5ed834e3 "INFO", len = 134217728}} _ _ _ _ _ _ _ ctx = {rec_lev = 1593995791, run_flags = 21851, last_retcode = 1593995708, jmp_env = {{__jmpbuf = {48, 139708676767760, 93849330384899, -7479270984431321856, 93850924380609, 139708690288576, 93850921612515, 134217728}, __mask_was_saved = 12582912, __saved_mask = { _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ __val = {6, 140720648489936, 139708687844848, 140720648490064, 93850920720905, 93850924380373, 139708676767760, 140720648489904, 139708469727337, 139708679781296, 139708687844848, 139708684105760, 140720648490560, 5888963087, 93849330384896, 11507136}}}}} _ _ _ _ _ _ _ bctx = 0x7f10760d0010 _ _ _ _ _ _ _ keng = 0x0 _ _ _ _ _ _ _ __func__ = "reply_received" #1_ 0x0000555b5eadf4dc in do_forward_reply (msg=0x7f1076b605f0, mode=0) at core/forward.c:747 _ _ _ _ _ _ _ new_buf = 0x0 _ _ _ _ _ _ _ dst = {send_sock = 0x0, to = {s = {sa_family = 0, sa_data = '\000' <repeats 13 times>}, sin = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "XXX"}, sin6 = {sin6_family = 0, sin6_port = 0, sin6_flowinfo = 0, sin6_addr = { _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ __in6_u = {__u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}, sin6_scope_id = 0}}, id = 0, proto = 0 '\000', send_flags = {f = 0, blst_imask = 0}} _ _ _ _ _ _ _ new_len = 0 _ _ _ _ _ _ _ r = 1 _ _ _ _ _ _ _ ip = {af = XXX, len = 32528, u = {addrl = {XXX, 95}, addr32 = {XXX, XXX, XXX, 0}, addr16 = {XXX, XXX, XXX, XXX, XXX, XXX, XXX, XXX}, addr = "XXX"}} _ _ _ _ _ _ _ s = 0x7ffc14440c68 "" _ _ _ _ _ _ _ len = 32764 _ _ _ _ _ _ _ __func__ = "do_forward_reply" #2_ 0x0000555b5eae12f9 in forward_reply (msg=0x7f1076b605f0) at core/forward.c:852 No locals. #3_ 0x0000555b5eb5b679 in receive_msg ( _ _ _ buf=0x555b5f027060 <buf> "SIP/2.0 480 Request Terminated\r\nVia: SIP/2.0/UDP XXX;branch=z9hG4bKe951.40cf95b28fe54d0cbda88a8fa4c91d48.0\r\nVia: SIP/2.0/UDP XXX:5060;branch=z9hG4bK04B95fa49ac99a7fa91\r\nTo: <sip:XXX"..., len=431, _ _ _ rcv_info=0x7ffc14440ff0) at core/receive.c:433 _ _ _ _ _ _ _ msg = 0x7f1076b605f0 _ _ _ _ _ _ _ ctx = {rec_lev = 0, run_flags = 0, last_retcode = 1, jmp_env = {{__jmpbuf = {139708690288576, 9004276570109933907, 93850921612515, 134217728, 12582912, 6, 9004276570114128211, 3007006209029601619}, __mask_was_saved = 0, __saved_mask = {__val = {0, 0, 0, 1, _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 139708266465728, 0, 0, 4634971920, 139708266465728, 140720648490768, 93850918093314, 120, 93850918093450, 139708680838560, 139708680838560, 140720648490832}}}}} _ _ _ _ _ _ _ bctx = 0x0 _ _ _ _ _ _ _ ret = 1 _ _ _ _ _ _ _ stats_on = 0 _ _ _ _ _ _ _ tvb = {tv_sec = 0, tv_usec = 0} _ _ _ _ _ _ _ tve = {tv_sec = 0, tv_usec = 0} _ _ _ _ _ _ _ tz = {tz_minuteswest = 0, tz_dsttime = 0} _ _ _ _ _ _ _ diff = 0 _ _ _ _ _ _ _ inb = {s = 0x555b5f027060 <buf> "SIP/2.0 480 Request Terminated\r\nVia: SIP/2.0/UDP XXX;branch=z9hG4bKe951.40cf95b28fe54d0cbda88a8fa4c91d48.0\r\nVia: SIP/2.0/UDP XXX:5060;branch=z9hG4bK04B95fa49ac99a7fa91\r\nTo: <sip:XXX"..., len = 431} _ _ _ _ _ _ _ netinfo = {data = {s = 0x0, len = 0}, rcv = 0x0, dst = 0x0} _ _ _ _ _ _ _ keng = 0x0 _ _ _ _ _ _ _ evp = {data = 0x7ffc14440df0, rcv = 0x7ffc14440ff0, dst = 0x0} _ _ _ _ _ _ _ cidlockidx = 0 _ _ _ _ _ _ _ cidlockset = 0 _ _ _ _ _ _ _ errsipmsg = 0 _ _ _ _ _ _ _ __func__ = "receive_msg" #4_ 0x0000555b5ea30dc4 in udp_rcv_loop () at core/udp_server.c:541 _ _ _ _ _ _ _ len = 431 _ _ _ _ _ _ _ buf = "SIP/2.0 480 Request Terminated\r\nVia: SIP/2.0/UDP XXX;branch=z9hG4bKe951.40cf95b28fe54d0cbda88a8fa4c91d48.0\r\nVia: SIP/2.0/UDP XXX:5060;branch=z9hG4bK04B95fa49ac99a7fa91\r\nTo: <sip:XXX"... _ _ _ _ _ _ _ tmp = 0x8000000 <error: Cannot access memory at address 0x8000000> _ _ _ _ _ _ _ from = 0x7f10764b1da0 _ _ _ _ _ _ _ fromlen = 16 _ _ _ _ _ _ _ ri = {src_ip = {af = 2, len = 4, u = {addrl = {XXX, XXX}, addr32 = {XXX, XXX, XXX, XXX}, addr16 = {XXX, XXX, XXX, XXX, XXX, XXX, XXX, XXX}, addr = "XXX"}}, dst_ip = { _ _ _ _ _ _ _ _ _ _ _ af = 2, len = 4, u = {addrl = {XXX, 0}, addr32 = {XXX, 0, 0, 0}, addr16 = {XXX, XXX, 0, 0, 0, 0, 0, 0}, addr = "XXX", '\000' <repeats 11 times>}}, src_port = 5060, dst_port = 5060, proto_reserved1 = 0, proto_reserved2 = 0, src_su = {s = { _ _ _ _ _ _ _ _ _ _ _ _ _ sa_family = 2, sa_data = "XXX"}, sin = {sin_family = 2, sin_port = 50195, sin_addr = {s_addr = XXX}, sin_zero = "\000\000\000\000\000\000\000"}, sin6 = {sin6_family = 2, sin6_port = 50195, sin6_flowinfo = 1345864889, _ _ _ _ _ _ _ _ _ _ _ _ _ sin6_addr = {__in6_u = {__u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}, sin6_scope_id = 0}}, bind_address = 0xXXX, proto = 1 '\001'} _ _ _ _ _ _ _ evp = {data = 0x0, rcv = 0x0, dst = 0x0} _ _ _ _ _ _ _ printbuf = "XXX"... _ _ _ _ _ _ _ i = 1981052368 _ _ _ _ _ _ _ j = 5 _ _ _ _ _ _ _ l = 0 _ _ _ _ _ _ _ __func__ = "udp_rcv_loop" #5_ 0x0000555b5e9c8e32 in main_loop () at main.c:1645 _ _ _ _ _ _ _ i = 4 _ _ _ _ _ _ _ pid = 0 _ _ _ _ _ _ _ si = 0x7f1076130940 _ _ _ _ _ _ _ si_desc = "udp receiver child=4 sock=XXX:5060XXX" _ _ _ _ _ _ _ nrprocs = 8 _ _ _ _ _ _ _ woneinit = 1 _ _ _ _ _ _ _ __func__ = "main_loop" #6_ 0x0000555b5e9d0fdd in main (argc=17, argv=0x7ffc14441698) at main.c:2675 _ _ _ _ _ _ _ cfg_stream = 0x555b5fe5c010 _ _ _ _ _ _ _ c = -1 _ _ _ _ _ _ _ r = 0 _ _ _ _ _ _ _ tmp = 0x7ffc14442f30 "" _ _ _ _ _ _ _ tmp_len = 340006256 _ _ _ _ _ _ _ port = 32764 _ _ _ _ _ _ _ proto = 340006352 _ _ _ _ _ _ _ options = 0x555b5ed33020 ":f:cm:M:dVIhEeb:l:L:n:vKrRDTN:W:w:t:u:g:P:G:SQ:O:a:A:x:X:Y:" _ _ _ _ _ _ _ ret = -1 _ _ _ _ _ _ _ seed = 1181662442 _ _ _ _ _ _ _ rfd = 4 _ _ _ _ _ _ _ debug_save = 0 _ _ _ _ _ _ _ debug_flag = 0 _ _ _ _ _ _ _ dont_fork_cnt = 0 _ _ _ _ _ _ _ n_lst = 0x0 _ _ _ _ _ _ _ p = 0xffffffff <error: Cannot access memory at address 0xffffffff> _ _ _ _ _ _ _ st = {st_dev = 19, st_ino = 17502, st_nlink = 2, st_mode = 16832, st_uid = 115, st_gid = 123, __pad0 = 0, st_rdev = 0, st_size = 40, st_blksize = 4096, st_blocks = 0, st_atim = {tv_sec = 1547850959, tv_nsec = 183989794}, st_mtim = {tv_sec = 1547851014, _ _ _ _ _ _ _ _ _ _ _ tv_nsec = 719730801}, st_ctim = {tv_sec = 1547851014, tv_nsec = 955611149}, __glibc_reserved = {0, 0, 0}} _ _ _ _ _ _ _ __func__ = "main"
Juha Heinanen writes:
Kamailio 5.2 crashed when it received 480 reply to INVITE. Below is backtrace from the core file.
The crash happens in t_reply.c on the last line of this block:
uac=&t->uac[branch]; LM_DBG("org. status uas=%d, uac[%d]=%d local=%d is_invite=%d)\n", t->uas.status, branch, uac->last_received, is_local(t), is_invite(t)); last_uac_status=uac->last_received;
Earlier it was checked that the transaction was found. Its uac[0] seems to be broken.
pcap shows that kamailio receives from the uas two "180 ringing" replies 30 secs apart after which fr_inv_timeout hits. at that point kamailio sends "408 request timeout" to uac and gets back ack. then 45 secs later, kamailio gets one more "180 ringing" reply from the uas followed by several 480 replies. one of them crashes kamailio.
-- juha
Juha Heinanen writes:
pcap shows that kamailio receives from the uas two "180 ringing" replies 30 secs apart after which fr_inv_timeout hits. at that point kamailio sends "408 request timeout" to uac and gets back ack. then 45 secs later, kamailio gets one more "180 ringing" reply from the uas followed by several 480 replies. one of them crashes kamailio.
pcap also shows that when fr_inv_timeout fired and kamailio send 408 to uac, it did not send cancel to uas. it send 408 to uac at 17:44:44 and cancel to uas at 18:00:23, i.e. about 15 minutes later!
-- juha
Hello,
I'll look into it. Was there a possibility that some operation could have taken very long, for example writing accounting database record?
Cheers, Daniel
On 05.02.19 11:00, Juha Heinanen wrote:
Juha Heinanen writes:
pcap shows that kamailio receives from the uas two "180 ringing" replies 30 secs apart after which fr_inv_timeout hits. at that point kamailio sends "408 request timeout" to uac and gets back ack. then 45 secs later, kamailio gets one more "180 ringing" reply from the uas followed by several 480 replies. one of them crashes kamailio.
pcap also shows that when fr_inv_timeout fired and kamailio send 408 to uac, it did not send cancel to uas. it send 408 to uac at 17:44:44 and cancel to uas at 18:00:23, i.e. about 15 minutes later!
-- juha
Kamailio (SER) - Users Mailing List sr-users@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users
Daniel-Constantin Mierla writes:
I'll look into it. Was there a possibility that some operation could have taken very long, for example writing accounting database record?
Thanks. I don't think that this is db related issue. Here is a summary:
- K receives INVITE at 17:43:28 and forwards it over udp to uas - uas immediately responds with 183 followed by 180, which k forwards to UAC - at 17:44:29 uas responds again with 180 and K forwards it to UAC - at 17:44:44 K's invite timeout timer fires and it sends "408 request timeout" to uac, which responds with ack - at that point K should also have send CANCEL to uas, but it didn't - after that k goes to deep sleep, i.e., it does not react to any new incoming requests from anybody nor to several "480 request terminated" replies from the uas - about 15 minutes later at 18:00:23, K wakes up and sends two cancel requests to uas - uas replies to cancels with "481 transaction does not exist" - then K sends several ACKs to the uas and crashes
So it seems that crash is side effect of K not sending cancel to the uas immediate. Instead it goes to sleep and doesn't feel well after waking up 15 minutes later.
The real questions are why didn't K send the CANCEL immediately and went to sleep instead. How to find out why it happened?
-- Juha
It is rather strange that it got like "frozen" for 15 minutes. Was there a high CPU during that interval?
Typically, if there is a deadlock that block processing, then it does not recover until restart. But in this case is no restart, because that destroys active trasactions and no CANCEL would have been issued.
Is the system a virtual machine, container or bare metal?
I encountered in the past that some VMs got fronzen due to VM manager doing some backup or similar activities ...
Cheers, Daniel
On 05.02.19 21:33, Juha Heinanen wrote:
Daniel-Constantin Mierla writes:
I'll look into it. Was there a possibility that some operation could have taken very long, for example writing accounting database record?
Thanks. I don't think that this is db related issue. Here is a summary:
- K receives INVITE at 17:43:28 and forwards it over udp to uas
- uas immediately responds with 183 followed by 180, which k forwards to UAC
- at 17:44:29 uas responds again with 180 and K forwards it to UAC
- at 17:44:44 K's invite timeout timer fires and it sends "408 request timeout" to uac, which responds with ack
- at that point K should also have send CANCEL to uas, but it didn't
- after that k goes to deep sleep, i.e., it does not react to any new incoming requests from anybody nor to several "480 request terminated" replies from the uas
- about 15 minutes later at 18:00:23, K wakes up and sends two cancel requests to uas
- uas replies to cancels with "481 transaction does not exist"
- then K sends several ACKs to the uas and crashes
So it seems that crash is side effect of K not sending cancel to the uas immediate. Instead it goes to sleep and doesn't feel well after waking up 15 minutes later.
The real questions are why didn't K send the CANCEL immediately and went to sleep instead. How to find out why it happened?
-- Juha
Daniel-Constantin Mierla writes:
It is rather strange that it got like "frozen" for 15 minutes. Was there a high CPU during that interval?
CPU was not high and that top showed nothing unusual during the freeze. Also, there was nothing special in MySQL logs.
Typically, if there is a deadlock that block processing, then it does not recover until restart. But in this case is no restart, because that destroys active trasactions and no CANCEL would have been issued.
Is the system a virtual machine, container or bare metal?
It is virtual machine.
I encountered in the past that some VMs got fronzen due to VM manager doing some backup or similar activities ...
That may very well be the case here too. I'll keep on investigating.
I would have been nice though, if K would not had crashed after in unfroze.
-- Juha
Juha Heinanen writes:
Thanks. I don't think that this is db related issue. Here is a summary:
- K receives INVITE at 17:43:28 and forwards it over udp to uas
- uas immediately responds with 183 followed by 180, which k forwards to UAC
- at 17:44:29 uas responds again with 180 and K forwards it to UAC
- at 17:44:44 K's invite timeout timer fires and it sends "408 request timeout" to uac, which responds with ack
- at that point K should also have send CANCEL to uas, but it didn't
- after that k goes to deep sleep, i.e., it does not react to any new incoming requests from anybody nor to several "480 request terminated" replies from the uas
- about 15 minutes later at 18:00:23, K wakes up and sends two cancel requests to uas
- uas replies to cancels with "481 transaction does not exist"
- then K sends several ACKs to the uas and crashes
I misunderstood the pcap that I got. It only contained packets of the call that caused invite timeout to fire and where CANCEL was not sent. So K did properly process other requests during the 15 minute period.
In syslog there is this just before the crash:
Feb 1 16:00:23 /usr/bin/sip-proxy[1348]: CRITICAL: <core> [core/pass_fd.c:277]: receive_fd(): EOF on 18
and before that during the same second:
Feb 1 16:00:23 /usr/bin/sip-proxy[1327]: INFO: db_mysql [km_dbase.c:85]: db_mysql_submit_query(): driver error on ping: Lost connection to MySQL server during query
So the crash may have something to do with db access, but why was cancel not sent 15 minutes ago? There is no mysql related messages in syslog at that time.
-- Juha
On 06.02.19 09:51, Juha Heinanen wrote:
Juha Heinanen writes:
Thanks. I don't think that this is db related issue. Here is a summary:
- K receives INVITE at 17:43:28 and forwards it over udp to uas
- uas immediately responds with 183 followed by 180, which k forwards to UAC
- at 17:44:29 uas responds again with 180 and K forwards it to UAC
- at 17:44:44 K's invite timeout timer fires and it sends "408 request timeout" to uac, which responds with ack
- at that point K should also have send CANCEL to uas, but it didn't
- after that k goes to deep sleep, i.e., it does not react to any new incoming requests from anybody nor to several "480 request terminated" replies from the uas
- about 15 minutes later at 18:00:23, K wakes up and sends two cancel requests to uas
- uas replies to cancels with "481 transaction does not exist"
- then K sends several ACKs to the uas and crashes
I misunderstood the pcap that I got. It only contained packets of the call that caused invite timeout to fire and where CANCEL was not sent. So K did properly process other requests during the 15 minute period.
In syslog there is this just before the crash:
Feb 1 16:00:23 /usr/bin/sip-proxy[1348]: CRITICAL: <core> [core/pass_fd.c:277]: receive_fd(): EOF on 18
and before that during the same second:
Feb 1 16:00:23 /usr/bin/sip-proxy[1327]: INFO: db_mysql [km_dbase.c:85]: db_mysql_submit_query(): driver error on ping: Lost connection to MySQL server during query
So the crash may have something to do with db access, but why was cancel not sent 15 minutes ago? There is no mysql related messages in syslog at that time.
OK. There should be no crash no matter what caused the delay. I wanted to sort out the freezing of entire kamailio for 15min, because that is rather unusual to happen for all processes and then recover.
I will look more at the tm for this delay/blocking of a single process.
Cheers, Daniel
Daniel-Constantin Mierla writes:
OK. There should be no crash no matter what caused the delay. I wanted to sort out the freezing of entire kamailio for 15min, because that is rather unusual to happen for all processes and then recover.
I will look more at the tm for this delay/blocking of a single process.
K routed the INVITE by process 1310 and after that the same process was used to route other requests before K 15 minutes later sent the two CANCELs to the UAS. Unfortunately the syslog does not show the process number that sent the CANCELs.
-- Juha
I got the full syslog and pcap and based on those K indeed totally froze for about 14 minutes. Below is updated summary:
- K receives INVITE at 17:43:28 and forwards it over udp to uas - uas immediately responds with 183 followed by 180, which k forwards to UAC - at 17:44:29 uas responds again with 180 and K forwards it to UAC - at 17:44:44 K's invite timeout timer fires and it sends "408 request timeout" to uac, which responds with ack - at that point K should also have send CANCEL to uas, but it didn't - K keeps on handling requests normally until 17:46:39 - after that K freezes totally (no messages to syslog although pcap shows that K receives requests over udp) - at 16:00:23, K un-freezes and (according to pcap) sends two cancel requests to the uas - K also sends out several replies and request that are related to requests that it had received before the freeze (and in most cases gets back "481 transaction does not exist" replies) - after that (during the same second) K starts to process normally new incoming requests - during the same second after the un-freeze, these errors are produced to syslog:
Feb 1 16:00:23 /usr/bin/sip-proxy[1327]: INFO: db_mysql [km_dbase.c:85]: db_mysql_submit_query(): driver error on ping: Lost connection to MySQL server during query Feb 1 16:00:23 /usr/bin/sip-proxy[1313]: INFO: tm [t_reply.c:480]: _reply_light(): can't generate 408 reply when a final 487 was sent out Feb 1 16:00:23 /usr/bin/sip-proxy[1313]: ERROR: sl [sl.c:277]: send_reply(): failed to reply stateful (tm) Feb 1 16:00:23 /usr/bin/sip-proxy[1348]: CRITICAL: <core> [core/pass_fd.c:277]: receive_fd(): EOF on 18
- finally K crashes:
Feb 1 16:00:24 /usr/bin/sip-proxy[1309]: ALERT: <core> [main.c:755]: handle_sigs(): child process 1314 exited by a signal 11
-- Juha
OK. It happens to be traveling for few days, once I am back, I will dig into it and see how to fix this long delayed processing so it does not crash.
Cheers, Daniel
On 06.02.19 22:04, Juha Heinanen wrote:
I got the full syslog and pcap and based on those K indeed totally froze for about 14 minutes. Below is updated summary:
- K receives INVITE at 17:43:28 and forwards it over udp to uas
- uas immediately responds with 183 followed by 180, which k forwards to UAC
- at 17:44:29 uas responds again with 180 and K forwards it to UAC
- at 17:44:44 K's invite timeout timer fires and it sends "408 request timeout" to uac, which responds with ack
- at that point K should also have send CANCEL to uas, but it didn't
- K keeps on handling requests normally until 17:46:39
- after that K freezes totally (no messages to syslog although pcap shows that K receives requests over udp)
- at 16:00:23, K un-freezes and (according to pcap) sends two cancel requests to the uas
- K also sends out several replies and request that are related to requests that it had received before the freeze (and in most cases gets back "481 transaction does not exist" replies)
- after that (during the same second) K starts to process normally new incoming requests
- during the same second after the un-freeze, these errors are produced to syslog:
Feb 1 16:00:23 /usr/bin/sip-proxy[1327]: INFO: db_mysql [km_dbase.c:85]: db_mysql_submit_query(): driver error on ping: Lost connection to MySQL server during query Feb 1 16:00:23 /usr/bin/sip-proxy[1313]: INFO: tm [t_reply.c:480]: _reply_light(): can't generate 408 reply when a final 487 was sent out Feb 1 16:00:23 /usr/bin/sip-proxy[1313]: ERROR: sl [sl.c:277]: send_reply(): failed to reply stateful (tm) Feb 1 16:00:23 /usr/bin/sip-proxy[1348]: CRITICAL: <core> [core/pass_fd.c:277]: receive_fd(): EOF on 18
- finally K crashes:
Feb 1 16:00:24 /usr/bin/sip-proxy[1309]: ALERT: <core> [main.c:755]: handle_sigs(): child process 1314 exited by a signal 11
-- Juha
Daniel-Constantin Mierla writes:
OK. It happens to be traveling for few days, once I am back, I will dig into it and see how to fix this long delayed processing so it does not crash.
Thanks for looking into this. Perhaps the freeze is due to some locking issue like the one when K didn't do anything for a long time if it was started with an unreachable rtpengine.
-- Juha
Hello,
replying to the initial message to have the backtrace easy to look at its content...
The info locals in frame 0 show:
uac = 0x0
However, that is set few lines above as:
uac=&t->uac[branch];
An address of a variable (or field in a structure) cannot be null. Some something happened with the stack. Did the OS kept running smooth after this issue?
uac is a local variable, so it is allocated on the stack of the respective process. Given the sequence of the C code, there is no option to overwrite uac since it was set. If the transaction pointer is invalid, then the crash should have happened at the line:
uac=&t->uac[branch];
So at this moment, either the core file was somehow corrupted/not properly dumped or kernel process supervizer did something wrong on resume after the freeze.
There are no safety checks that can be added. Maybe you can try to reproduce and see if the new corefile gives a different backtrace.
Cheers, Daniel
On 05.02.19 10:08, Juha Heinanen wrote:
Kamailio 5.2 crashed when it received 480 reply to INVITE. Below is backtrace from the core file.
The crash happens in t_reply.c on the last line of this block:
uac=&t->uac[branch]; LM_DBG("org. status uas=%d, uac[%d]=%d local=%d is_invite=%d)\n", t->uas.status, branch, uac->last_received, is_local(t), is_invite(t)); last_uac_status=uac->last_received;
Earlier it was checked that the transaction was found. Its uac[0] seems to be broken.
-- Juha
Program terminated with signal SIGSEGV, Segmentation fault. #0_ 0x00007f1073e234c3 in reply_received (p_msg=0x7f1076b605f0) at t_reply.c:2240 2240_ _ _ t_reply.c: No such file or directory. (gdb) bt full #0_ 0x00007f1073e234c3 in reply_received (p_msg=0x7f1076b605f0) at t_reply.c:2240 _ _ _ _ _ _ _ msg_status = 480 _ _ _ _ _ _ _ last_uac_status = 1590315756 _ _ _ _ _ _ _ ack = 0x50550c4 <error: Cannot access memory at address 0x50550c4> _ _ _ _ _ _ _ ack_len = 4 _ _ _ _ _ _ _ branch = 0 _ _ _ _ _ _ _ reply_status = 29 _ _ _ _ _ _ _ onreply_route = 9941216 _ _ _ _ _ _ _ cancel_data = {cancel_bitmap = 0, reason = {cause = 0, u = {text = {s = 0x0, len = 1590087991}, e2e_cancel = 0x0, packed_hdrs = {s = 0x0, len = 1590087991}}}} _ _ _ _ _ _ _ uac = 0x0 _ _ _ _ _ _ _ t = 0x7f105dfe6480 _ _ _ _ _ _ _ lack_dst = {send_sock = 0x555b5f02720f <buf+431>, to = {s = {sa_family = 29127, sa_data = "XXX"}, sin = {sin_family = 29127, sin_port = 24322, sin_addr = {s_addr = 21851}, sin_zero = "XXX"}, sin6 = { _ _ _ _ _ _ _ _ _ _ _ _ _ sin6_family = 29127, sin6_port = 24322, sin6_flowinfo = 21851, sin6_addr = {__in6_u = {__u6_addr8 = "XXX", __u6_addr16 = {XXX, XXX, XXX, XXX, XXX, XXX, XXX, XXX}, __u6_addr32 = {XXX, XXX, XXX, _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ XXX}}}, sin6_scope_id = 1980563656}}, id = 32528, proto = 112 'p', send_flags = {f = 30268, blst_imask = 32528}} _ _ _ _ _ _ _ backup_user_from = 0x0 _ _ _ _ _ _ _ backup_user_to = 0xXXX <qm_info+46> _ _ _ _ _ _ _ backup_domain_from = 0xXXX _ _ _ _ _ _ _ backup_domain_to = 0xXXX _ _ _ _ _ _ _ backup_uri_from = 0x0 _ _ _ _ _ _ _ backup_uri_to = 0xXXX _ _ _ _ _ _ _ backup_xavps = 0x45ed834e3 _ _ _ _ _ _ _ replies_locked = 1 _ _ _ _ _ _ _ branch_ret = 1593995512 _ _ _ _ _ _ _ prev_branch = 21851 _ _ _ _ _ _ _ blst_503_timeout = 340003632 _ _ _ _ _ _ _ hf = 0x7f1076490810 _ _ _ _ _ _ _ onsend_params = {req = 0x7f10763c4898, rpl = 0x7f10763c4888, param = 0x97b5f0, code = 10751248, flags = 0, branch = 0, t_rbuf = 0xaf95c0, dst = 0x7f1076db4fc0 <__syslog>, send_buf = {s = 0x555b5ed834e3 "INFO", len = 134217728}} _ _ _ _ _ _ _ ctx = {rec_lev = 1593995791, run_flags = 21851, last_retcode = 1593995708, jmp_env = {{__jmpbuf = {48, 139708676767760, 93849330384899, -7479270984431321856, 93850924380609, 139708690288576, 93850921612515, 134217728}, __mask_was_saved = 12582912, __saved_mask = { _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ __val = {6, 140720648489936, 139708687844848, 140720648490064, 93850920720905, 93850924380373, 139708676767760, 140720648489904, 139708469727337, 139708679781296, 139708687844848, 139708684105760, 140720648490560, 5888963087, 93849330384896, 11507136}}}}} _ _ _ _ _ _ _ bctx = 0x7f10760d0010 _ _ _ _ _ _ _ keng = 0x0 _ _ _ _ _ _ _ __func__ = "reply_received" #1_ 0x0000555b5eadf4dc in do_forward_reply (msg=0x7f1076b605f0, mode=0) at core/forward.c:747 _ _ _ _ _ _ _ new_buf = 0x0 _ _ _ _ _ _ _ dst = {send_sock = 0x0, to = {s = {sa_family = 0, sa_data = '\000' <repeats 13 times>}, sin = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "XXX"}, sin6 = {sin6_family = 0, sin6_port = 0, sin6_flowinfo = 0, sin6_addr = { _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ __in6_u = {__u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}, sin6_scope_id = 0}}, id = 0, proto = 0 '\000', send_flags = {f = 0, blst_imask = 0}} _ _ _ _ _ _ _ new_len = 0 _ _ _ _ _ _ _ r = 1 _ _ _ _ _ _ _ ip = {af = XXX, len = 32528, u = {addrl = {XXX, 95}, addr32 = {XXX, XXX, XXX, 0}, addr16 = {XXX, XXX, XXX, XXX, XXX, XXX, XXX, XXX}, addr = "XXX"}} _ _ _ _ _ _ _ s = 0x7ffc14440c68 "" _ _ _ _ _ _ _ len = 32764 _ _ _ _ _ _ _ __func__ = "do_forward_reply" #2_ 0x0000555b5eae12f9 in forward_reply (msg=0x7f1076b605f0) at core/forward.c:852 No locals. #3_ 0x0000555b5eb5b679 in receive_msg ( _ _ _ buf=0x555b5f027060 <buf> "SIP/2.0 480 Request Terminated\r\nVia: SIP/2.0/UDP XXX;branch=z9hG4bKe951.40cf95b28fe54d0cbda88a8fa4c91d48.0\r\nVia: SIP/2.0/UDP XXX:5060;branch=z9hG4bK04B95fa49ac99a7fa91\r\nTo: <sip:XXX"..., len=431, _ _ _ rcv_info=0x7ffc14440ff0) at core/receive.c:433 _ _ _ _ _ _ _ msg = 0x7f1076b605f0 _ _ _ _ _ _ _ ctx = {rec_lev = 0, run_flags = 0, last_retcode = 1, jmp_env = {{__jmpbuf = {139708690288576, 9004276570109933907, 93850921612515, 134217728, 12582912, 6, 9004276570114128211, 3007006209029601619}, __mask_was_saved = 0, __saved_mask = {__val = {0, 0, 0, 1, _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 139708266465728, 0, 0, 4634971920, 139708266465728, 140720648490768, 93850918093314, 120, 93850918093450, 139708680838560, 139708680838560, 140720648490832}}}}} _ _ _ _ _ _ _ bctx = 0x0 _ _ _ _ _ _ _ ret = 1 _ _ _ _ _ _ _ stats_on = 0 _ _ _ _ _ _ _ tvb = {tv_sec = 0, tv_usec = 0} _ _ _ _ _ _ _ tve = {tv_sec = 0, tv_usec = 0} _ _ _ _ _ _ _ tz = {tz_minuteswest = 0, tz_dsttime = 0} _ _ _ _ _ _ _ diff = 0 _ _ _ _ _ _ _ inb = {s = 0x555b5f027060 <buf> "SIP/2.0 480 Request Terminated\r\nVia: SIP/2.0/UDP XXX;branch=z9hG4bKe951.40cf95b28fe54d0cbda88a8fa4c91d48.0\r\nVia: SIP/2.0/UDP XXX:5060;branch=z9hG4bK04B95fa49ac99a7fa91\r\nTo: <sip:XXX"..., len = 431} _ _ _ _ _ _ _ netinfo = {data = {s = 0x0, len = 0}, rcv = 0x0, dst = 0x0} _ _ _ _ _ _ _ keng = 0x0 _ _ _ _ _ _ _ evp = {data = 0x7ffc14440df0, rcv = 0x7ffc14440ff0, dst = 0x0} _ _ _ _ _ _ _ cidlockidx = 0 _ _ _ _ _ _ _ cidlockset = 0 _ _ _ _ _ _ _ errsipmsg = 0 _ _ _ _ _ _ _ __func__ = "receive_msg" #4_ 0x0000555b5ea30dc4 in udp_rcv_loop () at core/udp_server.c:541 _ _ _ _ _ _ _ len = 431 _ _ _ _ _ _ _ buf = "SIP/2.0 480 Request Terminated\r\nVia: SIP/2.0/UDP XXX;branch=z9hG4bKe951.40cf95b28fe54d0cbda88a8fa4c91d48.0\r\nVia: SIP/2.0/UDP XXX:5060;branch=z9hG4bK04B95fa49ac99a7fa91\r\nTo: <sip:XXX"... _ _ _ _ _ _ _ tmp = 0x8000000 <error: Cannot access memory at address 0x8000000> _ _ _ _ _ _ _ from = 0x7f10764b1da0 _ _ _ _ _ _ _ fromlen = 16 _ _ _ _ _ _ _ ri = {src_ip = {af = 2, len = 4, u = {addrl = {XXX, XXX}, addr32 = {XXX, XXX, XXX, XXX}, addr16 = {XXX, XXX, XXX, XXX, XXX, XXX, XXX, XXX}, addr = "XXX"}}, dst_ip = { _ _ _ _ _ _ _ _ _ _ _ af = 2, len = 4, u = {addrl = {XXX, 0}, addr32 = {XXX, 0, 0, 0}, addr16 = {XXX, XXX, 0, 0, 0, 0, 0, 0}, addr = "XXX", '\000' <repeats 11 times>}}, src_port = 5060, dst_port = 5060, proto_reserved1 = 0, proto_reserved2 = 0, src_su = {s = { _ _ _ _ _ _ _ _ _ _ _ _ _ sa_family = 2, sa_data = "XXX"}, sin = {sin_family = 2, sin_port = 50195, sin_addr = {s_addr = XXX}, sin_zero = "\000\000\000\000\000\000\000"}, sin6 = {sin6_family = 2, sin6_port = 50195, sin6_flowinfo = 1345864889, _ _ _ _ _ _ _ _ _ _ _ _ _ sin6_addr = {__in6_u = {__u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}, sin6_scope_id = 0}}, bind_address = 0xXXX, proto = 1 '\001'} _ _ _ _ _ _ _ evp = {data = 0x0, rcv = 0x0, dst = 0x0} _ _ _ _ _ _ _ printbuf = "XXX"... _ _ _ _ _ _ _ i = 1981052368 _ _ _ _ _ _ _ j = 5 _ _ _ _ _ _ _ l = 0 _ _ _ _ _ _ _ __func__ = "udp_rcv_loop" #5_ 0x0000555b5e9c8e32 in main_loop () at main.c:1645 _ _ _ _ _ _ _ i = 4 _ _ _ _ _ _ _ pid = 0 _ _ _ _ _ _ _ si = 0x7f1076130940 _ _ _ _ _ _ _ si_desc = "udp receiver child=4 sock=XXX:5060XXX" _ _ _ _ _ _ _ nrprocs = 8 _ _ _ _ _ _ _ woneinit = 1 _ _ _ _ _ _ _ __func__ = "main_loop" #6_ 0x0000555b5e9d0fdd in main (argc=17, argv=0x7ffc14441698) at main.c:2675 _ _ _ _ _ _ _ cfg_stream = 0x555b5fe5c010 _ _ _ _ _ _ _ c = -1 _ _ _ _ _ _ _ r = 0 _ _ _ _ _ _ _ tmp = 0x7ffc14442f30 "" _ _ _ _ _ _ _ tmp_len = 340006256 _ _ _ _ _ _ _ port = 32764 _ _ _ _ _ _ _ proto = 340006352 _ _ _ _ _ _ _ options = 0x555b5ed33020 ":f:cm:M:dVIhEeb:l:L:n:vKrRDTN:W:w:t:u:g:P:G:SQ:O:a:A:x:X:Y:" _ _ _ _ _ _ _ ret = -1 _ _ _ _ _ _ _ seed = 1181662442 _ _ _ _ _ _ _ rfd = 4 _ _ _ _ _ _ _ debug_save = 0 _ _ _ _ _ _ _ debug_flag = 0 _ _ _ _ _ _ _ dont_fork_cnt = 0 _ _ _ _ _ _ _ n_lst = 0x0 _ _ _ _ _ _ _ p = 0xffffffff <error: Cannot access memory at address 0xffffffff> _ _ _ _ _ _ _ st = {st_dev = 19, st_ino = 17502, st_nlink = 2, st_mode = 16832, st_uid = 115, st_gid = 123, __pad0 = 0, st_rdev = 0, st_size = 40, st_blksize = 4096, st_blocks = 0, st_atim = {tv_sec = 1547850959, tv_nsec = 183989794}, st_mtim = {tv_sec = 1547851014, _ _ _ _ _ _ _ _ _ _ _ tv_nsec = 719730801}, st_ctim = {tv_sec = 1547851014, tv_nsec = 955611149}, __glibc_reserved = {0, 0, 0}} _ _ _ _ _ _ _ __func__ = "main"
Kamailio (SER) - Users Mailing List sr-users@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users
Daniel-Constantin Mierla writes:
replying to the initial message to have the backtrace easy to look at its content...
The info locals in frame 0 show:
uac = 0x0
However, that is set few lines above as:
uac=&t->uac[branch];
An address of a variable (or field in a structure) cannot be null. Some something happened with the stack. Did the OS kept running smooth after this issue?
The OS is running smoothly all the time. It is possible to ssh to the host, take packet traces, etc. during Kamailio freeze. This freezing/recovery happens happening at the frequency of once in a few days.
So at this moment, either the core file was somehow corrupted/not properly dumped or kernel process supervizer did something wrong on resume after the freeze.
As I mentioned, the OS is running normally during Kamailio freeze.
There are no safety checks that can be added. Maybe you can try to reproduce and see if the new corefile gives a different backtrace.
Crash has happened only once after the freeze. I'll let you know if another crash happends and dump becomes available.
Could this freezing be caused by same kind of thing as the rtpengine non-availability freeze that you fixed in master? If I remember correctly, the fix was not backported to stable.
-- Juha
On 13.02.19 22:24, Juha Heinanen wrote:
Daniel-Constantin Mierla writes:
replying to the initial message to have the backtrace easy to look at its content...
The info locals in frame 0 show:
uac = 0x0
However, that is set few lines above as:
uac=&t->uac[branch];
An address of a variable (or field in a structure) cannot be null. Some something happened with the stack. Did the OS kept running smooth after this issue?
The OS is running smoothly all the time. It is possible to ssh to the host, take packet traces, etc. during Kamailio freeze. This freezing/recovery happens happening at the frequency of once in a few days.
So at this moment, either the core file was somehow corrupted/not properly dumped or kernel process supervizer did something wrong on resume after the freeze.
As I mentioned, the OS is running normally during Kamailio freeze.
There are no safety checks that can be added. Maybe you can try to reproduce and see if the new corefile gives a different backtrace.
Crash has happened only once after the freeze. I'll let you know if another crash happends and dump becomes available.
Could this freezing be caused by same kind of thing as the rtpengine non-availability freeze that you fixed in master? If I remember correctly, the fix was not backported to stable.
That patch was backported (at least to branch 5.2).
However, that had invalid access to transaction pointer. In this case, accessing transaction is ok, by setting a variable to the address of one of its fields, but then the variable becomes NULL, which is not possible unless the stack got corrupted/overwritten. However, from the moment of setting the variable till the moment of the crash, there are no other set or copy operations, therefore nothing that I can relate to kamailio's c code. Maybe it happens another time and the backtrace will be different.
Cheers, Daniel
Daniel-Constantin Mierla writes:
However, that had invalid access to transaction pointer. In this case, accessing transaction is ok, by setting a variable to the address of one of its fields, but then the variable becomes NULL, which is not possible unless the stack got corrupted/overwritten. However, from the moment of setting the variable till the moment of the crash, there are no other set or copy operations, therefore nothing that I can relate to kamailio's c code. Maybe it happens another time and the backtrace will be different.
I agree, Juha
Today there was another report with a backtrace like the one you reported, so I try to figure out if there are any common aspects between the two systems.
Can you tell what linux distro and version was running in your case? Also, can you give the output for:
uname -a
Cheers, Daniel
On 14.02.19 08:54, Juha Heinanen wrote:
Daniel-Constantin Mierla writes:
However, that had invalid access to transaction pointer. In this case, accessing transaction is ok, by setting a variable to the address of one of its fields, but then the variable becomes NULL, which is not possible unless the stack got corrupted/overwritten. However, from the moment of setting the variable till the moment of the crash, there are no other set or copy operations, therefore nothing that I can relate to kamailio's c code. Maybe it happens another time and the backtrace will be different.
I agree, Juha
Short update: I think I found the issue and tried to come up with a solution in the commit:
- https://github.com/kamailio/kamailio/commit/814d5cc1f4f5b1e4b95737108dffc1e7...
The tests that reproduced the crash rather quickly before the commit (done by Yufei Tao) are now running fine for long time.
I will backport to stable branches and likely next week I am considering to do a new release from 5.2 branch.
Cheers, Daniel
On 26.02.19 02:58, Juha Heinanen wrote:
Juha Heinanen writes:
It was Debian Stretch installed on (if I remember correctly) xen virtual machine. I'll check to be sure.
Yes, it was VM on XenServer.