Running `Kamailio 4.4.7:97f308` with heavy `app_perl` usage, `usrloc` (`db_mode` 3) and not much else.
After an upgrade from 4.1, getting periodic death like this:
``` Sep 25 20:54:28 switch /sbin/kamailio[29771]: CRITICAL: <core> [pass_fd.c:277]: receive_fd(): EOF on 9 ```
Also happens on 4.2.x.
It happens every 30-31 minutes or so on the dot, which suggests that there is some sort of background operation on this operator's system elsewhere that is causing this, but I haven't been able to find it.
Anyway, the PID is that of the TCP main marshalling process, i.e.
``` # kamcmd ps | grep -i tcp 15716 tcp receiver (generic) child=0 15717 tcp receiver (generic) child=1 15718 tcp receiver (generic) child=2 15719 tcp receiver (generic) child=3 15720 tcp receiver (generic) child=4 15721 tcp receiver (generic) child=5 15722 tcp receiver (generic) child=6 15723 tcp receiver (generic) child=7 15724 tcp receiver (generic) child=8 15725 tcp receiver (generic) child=9 15726 tcp receiver (generic) child=10 15727 tcp receiver (generic) child=11 15728 tcp receiver (generic) child=12 15729 tcp receiver (generic) child=13 15730 tcp receiver (generic) child=14 15731 tcp receiver (generic) child=15 15732 tcp main process ```
In this case, that would be `15732`.
I assume this is because one of the TCP receiver processes dies, but I haven't been able to find any evidence of that. This is a high-volume system, so I can't reduce the worker thread pool too much, but I tried reducing the number of child processes per listener from 16 to 3, and attaching GDB to each one. They all die normally upon receipt of `SIGTERM`:
``` Program received signal SIGTERM, Terminated. 0x00002b1517a436f3 in __epoll_wait_nocancel () from /lib64/libc.so.6 ```
Yet, it is the TCP distributor thread that shows the EOF in `receive_fd()`.
Because it's not a crash per se, I don't have a core dump or a way of grabbing the state of the program at the exact moment of the crash. All the processes seem to exit normally.
I have read some past issues that mention this, but their ultimate causes don't seem to be relevant here (e.g. no `dialog` usage). Moreover, the commits made to address this issue in other forms are present in the latest 4.4.x.
For reasons related to the high traffic volume, running with a higher debug verbosity level or some other fairly obvious ideas (e.g. no forking) aren't practical at all.
Any suggestions welcome!