Running Kamailio 4.4.7:97f308 with heavy app_perl usage, usrloc (db_mode 3) and not much else.

After an upgrade from 4.1, getting periodic death like this:

Sep 25 20:54:28 switch /sbin/kamailio[29771]: CRITICAL: <core> [pass_fd.c:277]: receive_fd(): EOF on 9

Also happens on 4.2.x.

It happens every 30-31 minutes or so on the dot, which suggests that there is some sort of background operation on this operator's system elsewhere that is causing this, but I haven't been able to find it.

Anyway, the PID is that of the TCP main marshalling process, i.e.

# kamcmd ps | grep -i tcp
15716	tcp receiver (generic) child=0
15717	tcp receiver (generic) child=1
15718	tcp receiver (generic) child=2
15719	tcp receiver (generic) child=3
15720	tcp receiver (generic) child=4
15721	tcp receiver (generic) child=5
15722	tcp receiver (generic) child=6
15723	tcp receiver (generic) child=7
15724	tcp receiver (generic) child=8
15725	tcp receiver (generic) child=9
15726	tcp receiver (generic) child=10
15727	tcp receiver (generic) child=11
15728	tcp receiver (generic) child=12
15729	tcp receiver (generic) child=13
15730	tcp receiver (generic) child=14
15731	tcp receiver (generic) child=15
15732	tcp main process

In this case, that would be 15732.

I assume this is because one of the TCP receiver processes dies, but I haven't been able to find any evidence of that. This is a high-volume system, so I can't reduce the worker thread pool too much, but I tried reducing the number of child processes per listener from 16 to 3, and attaching GDB to each one. They all die normally upon receipt of SIGTERM:

Program received signal SIGTERM, Terminated.
0x00002b1517a436f3 in __epoll_wait_nocancel () from /lib64/libc.so.6

Yet, it is the TCP distributor thread that shows the EOF in receive_fd().

Because it's not a crash per se, I don't have a core dump or a way of grabbing the state of the program at the exact moment of the crash. All the processes seem to exit normally.

I have read some past issues that mention this, but their ultimate causes don't seem to be relevant here (e.g. no dialog usage). Moreover, the commits made to address this issue in other forms are present in the latest 4.4.x.

For reasons related to the high traffic volume, running with a higher debug verbosity level or some other fairly obvious ideas (e.g. no forking) aren't practical at all.

Any suggestions welcome!


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.