Running `Kamailio 4.4.7:97f308` with heavy `app_perl` usage, `usrloc` (`db_mode` 3) and not much else.
After an upgrade from 4.1, getting periodic death like this:
``` Sep 25 20:54:28 switch /sbin/kamailio[29771]: CRITICAL: <core> [pass_fd.c:277]: receive_fd(): EOF on 9 ```
Also happens on 4.2.x.
It happens every 30-31 minutes or so on the dot, which suggests that there is some sort of background operation on this operator's system elsewhere that is causing this, but I haven't been able to find it.
Anyway, the PID is that of the TCP main marshalling process, i.e.
``` # kamcmd ps | grep -i tcp 15716 tcp receiver (generic) child=0 15717 tcp receiver (generic) child=1 15718 tcp receiver (generic) child=2 15719 tcp receiver (generic) child=3 15720 tcp receiver (generic) child=4 15721 tcp receiver (generic) child=5 15722 tcp receiver (generic) child=6 15723 tcp receiver (generic) child=7 15724 tcp receiver (generic) child=8 15725 tcp receiver (generic) child=9 15726 tcp receiver (generic) child=10 15727 tcp receiver (generic) child=11 15728 tcp receiver (generic) child=12 15729 tcp receiver (generic) child=13 15730 tcp receiver (generic) child=14 15731 tcp receiver (generic) child=15 15732 tcp main process ```
In this case, that would be `15732`.
I assume this is because one of the TCP receiver processes dies, but I haven't been able to find any evidence of that. This is a high-volume system, so I can't reduce the worker thread pool too much, but I tried reducing the number of child processes per listener from 16 to 3, and attaching GDB to each one. They all die normally upon receipt of `SIGTERM`:
``` Program received signal SIGTERM, Terminated. 0x00002b1517a436f3 in __epoll_wait_nocancel () from /lib64/libc.so.6 ```
Yet, it is the TCP distributor thread that shows the EOF in `receive_fd()`.
Because it's not a crash per se, I don't have a core dump or a way of grabbing the state of the program at the exact moment of the crash. All the processes seem to exit normally.
I have read some past issues that mention this, but their ultimate causes don't seem to be relevant here (e.g. no `dialog` usage). Moreover, the commits made to address this issue in other forms are present in the latest 4.4.x.
For reasons related to the high traffic volume, running with a higher debug verbosity level or some other fairly obvious ideas (e.g. no forking) aren't practical at all.
Any suggestions welcome!
Do you get also a second error message from the handle_io function? Is there an error from errno printed?
No, and no.
— Sent from mobile, with due apologies for brevity and errors.
On Sep 26, 2019, at 2:18 AM, Henning Westerholt notifications@github.com wrote:
Do you get also a second error message from the handle_io function? Is there an error from errno printed?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.
I attached an `strace` to the main TCP receiver process, and saw this preceding the crash:
``` recvmsg(10, {msg_name(0)=NULL, msg_iov(1)=[{"", 16}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT) = 0 ```
This then engenders the following sequence of events:
``` sendto(5, "<26>Sep 26 13:09:01 /sbin/kamail"..., 99, MSG_NOSIGNAL, NULL, 0) = 99 epoll_ctl(63, EPOLL_CTL_DEL, 10, {EPOLLWRNORM|EPOLLHUP|EPOLLRDHUP|EPOLLET|0x39e9800, {u32=32767, u64=32767}}) = 0 epoll_wait(63, 2b9481eca8c8, 1006, 5000) = -1 EINTR (Interrupted system call) --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=628, si_uid=106} --- exit_group(0) = ? +++ exited with 0 +++ ```
I'm still not sure what is raising `SIGTERM` exactly, but it thickens the plot insofar as it appears that the cause of the TCP receiver process is dying is actually a `SIGTERM` from another place, rather than the 0 return value of `recvmsg()` per se. I assume this is something related to a different child process dying -- I think Kamailio kills all the other children upon receipt of a `SIGCHLD` from one of the workers, right? -- but I haven't been able to get to the bottom of which child process is dying and why.
Yeah, so this issue is mis-titled. It looks like every EOF is followed by:
``` Sep 26 13:09:01 softswitch1b /sbin/kamailio[628]: ALERT: <core> [main.c:740]: handle_sigs(): child process 666 exited normally, status=255 ```
And that is one of the TCP workers. So, for some reason they just keep exiting, every X minutes, but not as the result of a crash.
The plot thickens. It looks like the problematic message actually came in on one of the UDP receivers.
Yep, it's definitely occurring in one of the UDP receivers. The problem is that all I get out of GDB is a normal exit:
``` (gdb) Continuing.
Program exited with code 0377. ```
And because it's detached, I don't know exactly what the context is and so it's hard to track down where the exit is happening and set a breakpoint.
I did get this sequence of events on the last crash:
``` Sep 26 14:01:30 switch /sbin/kamailio[15614]: ERROR: <core> [receive.c:173]: receive_msg(): core parsing of SIP message failed (x.x.x.x:1111/1) Sep 26 14:01:33 switch /sbin/kamailio[15614]: ERROR: tm [t_reply.c:533]: _reply_light(): ERROR: _reply_light: can't generate 487 reply when a final 603 was sent out Sep 26 14:01:35 switch /sbin/kamailio[15613]: ALERT: <core> [main.c:740]: handle_sigs(): child process 15614 exited normally, status=255 ```
And that was a UDP receiver:
``` 15614 udp receiver child=0 sock=y.y.y.y:5060 ```
I assume that the child just abort itself, via abort(). This is used in some places because Kamailio assumes that there is a critical condition which it can't safely continue processing.
There are also such aborts like this in the TCP core path, that was my first guess. But they are also in tm, but many of them in #defines. Do you have EXTRA_DEBUG activated as compile define, by chance?
This has evolved to a goose chase into `app_perl`, I think, as it seems to be occurring during an interpreter reload, possibly upon reaching `reset_cycles`. Stand by for further info.
Closed #2075.
I think this has evolved into an issue that is quite clearly not a Kamailio issue, so I am going to close it.
Ok, good to know :)