[kamailio/kamailio] EOF in pass_fd.c:receive_fd() (#2075) - sr-dev

26 Sep 2019


      Running `Kamailio 4.4.7:97f308` with heavy `app_perl` usage, `usrloc` (`db_mode` 3) and not much else.
After an upgrade from 4.1, getting periodic death like this:
```
Sep 25 20:54:28 switch /sbin/kamailio[29771]: CRITICAL: <core> [pass_fd.c:277]: receive_fd(): EOF on 9
```
Also happens on 4.2.x.
It happens every 30-31 minutes or so on the dot, which suggests that there is some sort of background operation on this operator's system elsewhere that is causing this, but I haven't been able to find it.
Anyway, the PID is that of the TCP main marshalling process, i.e.
```
# kamcmd ps | grep -i tcp
15716	tcp receiver (generic) child=0
15717	tcp receiver (generic) child=1
15718	tcp receiver (generic) child=2
15719	tcp receiver (generic) child=3
15720	tcp receiver (generic) child=4
15721	tcp receiver (generic) child=5
15722	tcp receiver (generic) child=6
15723	tcp receiver (generic) child=7
15724	tcp receiver (generic) child=8
15725	tcp receiver (generic) child=9
15726	tcp receiver (generic) child=10
15727	tcp receiver (generic) child=11
15728	tcp receiver (generic) child=12
15729	tcp receiver (generic) child=13
15730	tcp receiver (generic) child=14
15731	tcp receiver (generic) child=15
15732	tcp main process
```
In this case, that would be `15732`.
I assume this is because one of the TCP receiver processes dies, but I haven't been able to find any evidence of that. This is a high-volume system, so I can't reduce the worker thread pool too much, but I tried reducing the number of child processes per listener from 16 to 3, and attaching GDB to each one. They all die normally upon receipt of `SIGTERM`:
```
Program received signal SIGTERM, Terminated.
0x00002b1517a436f3 in __epoll_wait_nocancel () from /lib64/libc.so.6
```
Yet, it is the TCP distributor thread that shows the EOF in `receive_fd()`.
Because it's not a crash per se, I don't have a core dump or a way of grabbing the state of the program at the exact moment of the crash. All the processes seem to exit normally.
I have read some past issues that mention this, but their ultimate causes don't seem to be relevant here (e.g. no `dialog` usage). Moreover, the commits made to address this issue in other forms are present in the latest 4.4.x.
For reasons related to the high traffic volume, running with a higher debug verbosity level or some other fairly obvious ideas (e.g. no forking) aren't practical at all.
Any suggestions welcome!
-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/kamailio/kamailio/issues/2075