lunn left a comment (kamailio/kamailio#4185)
I've been helping Mathias with this problem. I've not understood why it is deadlocking, but i have found something along the way.
Some background. https://docs.openssl.org/1.1.1/man7/RAND_DRBG/ documents the "deterministic random bit generator". This is however from version 1.1.1 of openssl, not version 3. Version 3 does not include this documentation any more. However, the basics still seem valid.
drbg makes use of stacked random number generators. The parent generator is connected to the entropy source. Thus it is seeded from entropy. The child generators pull seeds from the parent generator. Seeding happens once when the generator is created, and is then repeated after a time limit, or when sufficient bytes have been taken out of the generator.
The documentation indicates the child generators are expected to be per thread, and so can be accessed without locking. The parent generator is however accessed by multiple children, so does perform locking, and it is explicitly documented as being thread-safe.
Kamailio however does not use a thread model, but a process model with shared memory. As a result, there is a lot of fun and games to make openssl work correctly in a model it is not intended for.
Openssl is setup in the first process. This causes the parent generator and one child generator to be created. Since the openssl memory allocation functions have been replaced with kamailio versions, these generators end up in the shared memory. The address of the child generator is stored into a thread local key by openssl.
The worker processes are then forked off. They then go and overwrite the thread local key of the child generator, setting it back to 0. As soon as there is need for the child generator, openssl will create a new one for the worker process. Since the collection of thread local keys are per process, each worker process gets its own child generator.
The child generators however share the parent generator, which is in the shared memory which all processes have access to. The locking used on the parent looks at first glance to work happily for both threads and processes using shared memory. The pthread library uses atomic operations to try to do as much as it can in userspace. I've not seen anything which indicates user space atomic operations are not valid on shared memory. When the locks need to block, they call into the kernel on a futex. The futex man page also indicates this is valid, so long as you are not using a FUTEX_PRIVATE_FLAG operation.
So the basic scheme looks O.K.
What i did notice however is that our deadlock happens when the parent generator is reseeding. And all child generators are also reseeding. And all child generators processes are trying to reseed the parent generator. Why are they trying to reseed the parent?
``` fork_id = openssl_get_fork_id();
if (drbg->fork_id != fork_id) { drbg->fork_id = fork_id; reseed_required = 1; } ``` There is additional documentation for drgb->fork_id: ``` /* * Stores the return value of openssl_get_fork_id() as of when we last * reseeded. The DRBG reseeds automatically whenever drbg->fork_id != * openssl_get_fork_id(). Used to provide fork-safety and reseed this * DRBG in the child process. */ int fork_id; ```
and `openssl_get_fork_id()` is:
``` int openssl_get_fork_id(void) { return getpid(); } ```
Since kamailio is using a process model, not a thread model, each process has its own pid. So with 8 processes running in parallel, and the system is loaded, it is very likely that the pid is different every time there is a request for the shared primary to generate random data, and so reseeding is happening pretty much every time, rather than infrequently.
As a quick test, i hacked out this fork_id check, so that the primary did not reseed so often. Our test which deadlocks within a handful of seconds ran for a handful of hours without deadlocking. The deadlock is probably still there, but we less frequently get into a situation where the deadlock could happen.
I've not traced where the primary is getting its entropy from. If it is system entropy, that is probably not good for the system as a whole. Other random number generators on the machine might be producing less random numbers? This would be my primary concern with the way openssl is being used.