schneuwlym created an issue (kamailio/kamailio#4185)
### Description
This issue is a follow up of the discussion in the [mailing list](https://lists.kamailio.org/mailman3/hyperkitty/list/sr-users@lists.kamailio....)
We are using Kamailio 5.8.4 in a yocto scarthgap based image. We noticed on multiple nodes, that suddenly phones lost their registrations and the log showed messages like `send_fd(): sendmsg failed sending 55 on 52: Too many references: cannot splice (109)`.
After debugging we guess that the `send_fd` messages are just a symptom, not the cause. Generating a backtrace using `kamctl trap` shows that all TCP receiver are waiting for a `pthread_rwlock_...` lock.
Our deployments, where we randomly see the issue, are very small; 10-30 registrations.
I've tested kamailio version 5.8.4, 5.8.5, 5.8.6 and 6.0.1 and I was able to reproduce the issue with all versions.
I've also tested version 5.5.4, but with an older yocto (dunfell) which used openssl 1.1.1w, **which did not block!**
Yocto Scarthgap uses OpenSSL 3.2.3 and we also updated to version 3.4.1, which didn't show any difference.
### Troubleshooting
#### Reproduction
Meanwhile we found a way to reproduce the issue within seconds. It requires a bit of brute force, but it locks kamailio reliable.
We are generating 3 parallel SIP REGISTER streams with sipexer to 3 different accounts. Following an example of one stream: ``` watch -n 0.1 "./sipexer -register -vl 3 -co -com -ex 3600 -fu "2" -au "2" -ap "test" -tk <path to key>.key -tc <path to certificate>.cer -ti -sd -ru sip:<domain> -cu sip:2@<domain> tls:<fqdn>:5061;sleep 0.1;./sipexer -register -vl 3 -co -com -ex 0 -fu "2" -au "2" -ap "test" -tk <path to key>.key -tc <path to cert>.cer -ti -sd -ru sip:<domain> -cu sip:2@<domain> tls:<fqdn>:5061" ```
Our dialplan is quite complex, but I was also able to reproduce the issue with the default [dialplan](https://raw.githubusercontent.com/kamailio/kamailio/refs/heads/5.8/etc/kamai...) with the following changes: ``` 169a170,171
#!define WITH_TLS
205a208,219
alias=udp:<domain>:5060 alias=udp:127.0.0.1:5060 alias=udp:10.0.6.1:5060 alias=tls:<domain> alias=tls:10.0.6.1:5061 alias=tls:10.165.42.76:5061 alias=tls:<domain>:5061 alias=udp:<domain> listen=udp:127.0.0.1:5060 listen=tls:10.0.6.1:5061 listen=udp:10.0.6.1:5060 listen=tls:10.165.42.76:5061
498c512 < modparam("tls", "config", "/usr/local/etc/kamailio/tls.cfg") ---
modparam("tls", "config", "/etc/kamailio/kamailio-tls.cfg")
```
``` # cat /etc/kamailio/kamailio-tls.cfg [server:default] method = TLSv1+ verify_certificate = yes require_certificate = yes private_key = "<path to key>.key" certificate = "<path to cert>.crt" ca_list = "<path to ca>.pem"
[client:default] method = TLSv1+ verify_certificate = yes require_certificate = yes private_key = "<path to key>.key" certificate = "<path to cert>.crt" ca_list = "<path to ca>.pem"
```
#### Debugging Data
Following the backtrace of one tcp listener process. The full backtrace can be found [here](https://github.com/user-attachments/files/19430313/gdb_kamailio_20250321_154...)
``` ---start 4608 ----------------------------------------------------- [Thread debugging using libthread_db enabled] Using host libthread_db library "/usr/lib/libthread_db.so.1". __futex_abstimed_wait_common (futex_word=futex_word@entry=0x7fc855758a0c, expected=expected@entry=3, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=<optimized out>, cancel=cancel@entry=false) at futex-internal.c:103
warning: 103 futex-internal.c: No such file or directory #0 __futex_abstimed_wait_common (futex_word=futex_word@entry=0x7fc855758a0c, expected=expected@entry=3, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=<optimized out>, cancel=cancel@entry=false) at futex-internal.c:103 err = -512 clockbit = 256 op = <optimized out> #1 0x00007fc85a59429c in __GI___futex_abstimed_wait64 (futex_word=futex_word@entry=0x7fc855758a0c, expected=expected@entry=3, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=<optimized out>) at futex-internal.c:128 No locals. #2 0x00007fc85a59d830 in __pthread_rwlock_wrlock_full64 (abstime=0x0, clockid=0, rwlock=0x7fc855758a00) at /usr/src/debug/glibc/2.39+git/nptl/pthread_rwlock_common.c:730 private = <optimized out> wf = <optimized out> err = <optimized out> prefer_writer = false may_share_futex_used_flag = true wpf = <optimized out> ready = <optimized out> r = <optimized out> may_share_futex_used_flag = <optimized out> r = <optimized out> done = <optimized out> wpf = <optimized out> ready = <optimized out> __value = <optimized out> prefer_writer = <optimized out> private = <optimized out> wf = <optimized out> err = <optimized out> w = <optimized out> w = <optimized out> private = <optimized out> err = <optimized out> w = <optimized out> wf = <optimized out> wf = <optimized out> __value = <optimized out> #3 ___pthread_rwlock_wrlock (rwlock=0x7fc855758a00) at pthread_rwlock_wrlock.c:26 result = <optimized out> #4 0x00007fc859820b49 in CRYPTO_THREAD_write_lock (lock=<optimized out>) at /usr/src/debug/openssl/3.2.3/crypto/threads_pthread.c:108 No locals. #5 0x00007fc8599bd5da in ossl_prov_drbg_generate (drbg=0x7fc855710040, out=0x7fc856471980 "", outlen=48, strength=256, prediction_resistance=0, adin=0x7ffeec8b67e0 "@", adinlen=8) at /usr/src/debug/openssl/3.2.3/providers/implementations/rands/drbg.c:645 fork_id = <optimized out> reseed_required = 0 ret = 0 __func__ = "ossl_prov_drbg_generate" err = <optimized out> now = <optimized out> #6 0x00007fc8599bd9ca in ossl_drbg_get_seed (vdrbg=<optimized out>, pout=0x7ffeec8b6880, entropy=<optimized out>, min_len=<optimized out>, max_len=<optimized out>, prediction_resistance=0, adin=0x7ffeec8b6838 "\220'\037V\310\177", adin_len=8) at /usr/src/debug/openssl/3.2.3/providers/implementations/rands/drbg.c:175 drbg = 0x7fc855710040 bytes_needed = 48 buffer = 0x7fc856471980 "" __func__ = "ossl_drbg_get_seed" #7 0x00007fc8599bc907 in get_entropy (drbg=<optimized out>, drbg@entry=0x7fc8561f2790, pout=pout@entry=0x7ffeec8b6880, entropy=<optimized out>, min_len=48, max_len=48, prediction_resistance=prediction_resistance@entry=0) at /usr/src/debug/openssl/3.2.3/providers/implementations/rands/drbg.c:245 bytes = <optimized out> p_str = 256 __func__ = "get_entropy" #8 0x00007fc8599bd26c in ossl_prov_drbg_reseed_unlocked (drbg=drbg@entry=0x7fc8561f2790, prediction_resistance=prediction_resistance@entry=0, ent=ent@entry=0x0, ent_len=ent_len@entry=0, adin=adin@entry=0x0, adinlen=adinlen@entry=0) at /usr/src/debug/openssl/3.2.3/providers/implementations/rands/drbg.c:573 entropy = 0x0 entropylen = 0 __func__ = "ossl_prov_drbg_reseed_unlocked" #9 0x00007fc8599bd68e in ossl_prov_drbg_generate (adinlen=0, adin=0x0, prediction_resistance=0, strength=<optimized out>, outlen=32, out=0x7fc8564717c8 "", drbg=0x7fc8561f2790) at /usr/src/debug/openssl/3.2.3/providers/implementations/rands/drbg.c:697 fork_id = <optimized out> reseed_required = <optimized out> ret = <optimized out> fork_id = <optimized out> reseed_required = <optimized out> ret = <optimized out> __func__ = "ossl_prov_drbg_generate" err = <optimized out> now = <optimized out> #10 ossl_prov_drbg_generate (drbg=0x7fc8561f2790, out=0x7fc8564717c8 "", outlen=32, strength=<optimized out>, prediction_resistance=0, adin=0x0, adinlen=0) at /usr/src/debug/openssl/3.2.3/providers/implementations/rands/drbg.c:634 fork_id = <optimized out> reseed_required = 0 ret = 0 __func__ = "ossl_prov_drbg_generate" err = <optimized out> now = <optimized out> #11 0x00007fc8597e75c6 in evp_rand_generate_locked (ctx=ctx@entry=0x7fc8561f26f0, out=out@entry=0x7fc8564717c8 "", outlen=outlen@entry=32, strength=strength@entry=0, prediction_resistance=<optimized out>, prediction_resistance@entry=0, addin=0x0, addin_len=0) at /usr/src/debug/openssl/3.2.3/crypto/evp/evp_rand.c:570 chunk = 32 max_request = 65536 params = {{key = 0x7fc8599e9521 "max_request", data_type = 2, data = 0x7ffeec8b6968, data_size = 8, return_size = 8}, {key = 0x0, data_type = 0, data = 0x0, data_size = 0, return_size = 0}} __func__ = "evp_rand_generate_locked" #12 0x00007fc8597e840b in EVP_RAND_generate (ctx=0x7fc8561f26f0, out=out@entry=0x7fc8564717c8 "", outlen=outlen@entry=32, strength=strength@entry=0, prediction_resistance=prediction_resistance@entry=0, addin=addin@entry=0x0, addin_len=0) at /usr/src/debug/openssl/3.2.3/crypto/evp/evp_rand.c:592 res = <optimized out> #13 0x00007fc8598f9282 in RAND_bytes_ex (ctx=warning: could not convert 'ossl_lib_ctx_st' from the host encoding (ANSI_X3.4-1968) to UTF-32. This normally should not happen, please file a bug report. 0x0, buf=0x7fc8564717c8 "", num=32, strength=0) at /usr/src/debug/openssl/3.2.3/crypto/rand/rand_lib.c:380 rand = <optimized out> meth = <optimized out> __func__ = "RAND_bytes_ex" #14 0x00007fc859bdb62c in def_generate_session_id (ssl=ssl@entry=0x7fc856436e50, id=id@entry=0x7fc8564717c8 "", id_len=id_len@entry=0x7ffeec8b6ae4) at /usr/src/debug/openssl/3.2.3/ssl/ssl_sess.c:326 retry = 0 #15 0x00007fc859bdc132 in ssl_generate_session_id (s=s@entry=0x7fc856436e50, ss=ss@entry=0x7fc856471570) at /usr/src/debug/openssl/3.2.3/ssl/ssl_sess.c:409 tmp = 32 cb = 0x7fc859bdb600 <def_generate_session_id> ssl = <optimized out> __func__ = "ssl_generate_session_id" #16 0x00007fc859bdc31e in ssl_get_new_session (s=s@entry=0x7fc856436e50, session=session@entry=1) at /usr/src/debug/openssl/3.2.3/ssl/ssl_sess.c:464 ss = 0x7fc856471570 __func__ = "ssl_get_new_session" #17 0x00007fc859c3f169 in tls_early_post_process_client_hello (s=0x7fc856436e50) at /usr/src/debug/openssl/3.2.3/ssl/statem/statem_srvr.c:1898 i = <optimized out> loop = <optimized out> sctx = 0x7fc855d24280 j = <optimized out> dgrd = DOWNGRADE_NONE protverr = <optimized out> c = <optimized out> clienthello = 0x7fc856470a90 al = 80 id = <optimized out> comp = 0x0 ciphers = 0x7fc856471330 scsvs = 0x7fc8564713c0 ssl = 0x7fc856436e50 j = <optimized out> i = <optimized out> al = <optimized out> protverr = <optimized out> loop = <optimized out> id = <optimized out> comp = <optimized out> c = <optimized out> ciphers = <optimized out> scsvs = <optimized out> clienthello = <optimized out> dgrd = <optimized out> sctx = <optimized out> ssl = <optimized out> __func__ = "tls_early_post_process_client_hello" err = <optimized out> cipher = <optimized out> trc_out = <optimized out> pos = <optimized out> pref_cipher = <optimized out> master_key_length = <optimized out> m = <optimized out> comp_id = <optimized out> k = <optimized out> m = <optimized out> nn = <optimized out> v = <optimized out> done = <optimized out> o = <optimized out> #18 tls_post_process_client_hello (s=0x7fc856436e50, wst=<optimized out>) at /usr/src/debug/openssl/3.2.3/ssl/statem/statem_srvr.c:2295 rv = <optimized out> cipher = <optimized out> ssl = 0x7fc856436e50 err = <optimized out> __func__ = "tls_post_process_client_hello" #19 0x00007fc859c2bc1b in read_state_machine (s=0x7fc856436e50) at /usr/src/debug/openssl/3.2.3/ssl/statem/statem.c:712 mt = 1 pkt = {curr = 0x7fc856466125 "", remaining = 0} max_message_size = 0x7fc859c3e3e0 <ossl_statem_server_max_message_size> ret = <optimized out> process_message = 0x7fc859c42400 <ossl_statem_server_process_message> post_process_message = 0x7fc859c414f0 <ossl_statem_server_post_process_message> st = 0x7fc856436ee0 len = 273 transition = 0x7fc859c3d330 <ossl_statem_server_read_transition> cb = 0x7fc859c91d90 <sr_ssl_ctx_info_callback> ssl = 0x7fc856436e50 st = <optimized out> ret = <optimized out> mt = <optimized out> len = <optimized out> transition = <optimized out> pkt = <optimized out> process_message = <optimized out> post_process_message = <optimized out> max_message_size = <optimized out> cb = <optimized out> ssl = <optimized out> __func__ = "read_state_machine" #20 state_machine (s=0x7fc856436e50, server=1) at /usr/src/debug/openssl/3.2.3/ssl/statem/statem.c:478 buf = 0x0 cb = 0x7fc859c91d90 <sr_ssl_ctx_info_callback> st = 0x7fc856436ee0 ret = -1 ssret = <optimized out> ssl = 0x7fc856436e50 __func__ = "state_machine" #21 0x00007fc859cbf036 in tls_accept (c=c@entry=0x7fc856461be0, error=error@entry=0x7ffeec8b6dd8) at tls_server.c:457 ret = <optimized out> ssl = 0x7fc856436e50 cert = <optimized out> tls_c = 0x7fc8564171b0 tls_log = <optimized out> __func__ = "tls_accept" ```
### Additional Information
* **Kamailio Version** - output of `kamailio -v`
``` version: kamailio 5.8.4 (x86_64/linux) f1d83d flags: USE_TCP, USE_TLS, USE_SCTP, TLS_HOOKS, USE_RAW_SOCKS, DISABLE_NAGLE, USE_MCAST, DNS_IP_HACK, SHM_MMAP, PKG_MALLOC, MEM_JOIN_FREE, Q_MALLOC, F_MALLOC, TLSF_MALLOC, DBG_SR_MEMORY, USE_FUTEX, FAST_LOCK-ADAPTIVE_WAIT, USE_DNS_CACHE, USE_DNS_FAILOVER, USE_NAPTR, USE_DST_BLOCKLIST, HAVE_RESOLV_RES ADAPTIVE_WAIT_LOOPS 1024, MAX_RECV_BUFFER_SIZE 262144, MAX_SEND_BUFFER_SIZE 262144, MAX_URI_SIZE 1024, BUF_SIZE 65535, DEFAULT PKG_SIZE 8MB poll method support: poll, epoll_lt, epoll_et, sigio_rt, select. id: f1d83d compiled on 10:25:58 Nov 12 2024 with x86_64-ruag-linux-gcc 13.3.0
```
* **Operating System**:
It is a yocto scarthgap based image (Version 5.0.6 LTS)
``` Linux ttel 6.6.54-yocto-standard #1 SMP PREEMPT_DYNAMIC Tue Oct 8 03:03:40 UTC 2024 x86_64 GNU/Linux ```
miconda left a comment (kamailio/kamailio#4185)
Is this parameter set to 2?
- https://www.kamailio.org/wikidocs/cookbooks/5.8.x/core/#tls_threads_mode
schneuwlym left a comment (kamailio/kamailio#4185)
Is this parameter set to 2?
I tried 0-3 but it blocks in all cases... I also played with https://www.kamailio.org/docs/modules/stable/modules/tls.html#tls.p.init_mod... but also didn't find a working setup.
miconda left a comment (kamailio/kamailio#4185)
Can you try K 6.0.1? Be sure it is with `tls_threads_mode=2` (or not set, the default)
schneuwlym left a comment (kamailio/kamailio#4185)
Can you try K 6.0.1? Be sure it is with `tls_threads_mode=2` (or not set, the default)
I can give another try, but while tinkering I already tried it without success. But I don't remember whether I used `tls_threads_mode=2`. I can try again later.
At the moment some other tinkering is going on. We just tried to disable RW locks and now the test is running since minutes where it failed before after a few seconds ``` diff --git a/crypto/threads_pthread.c b/crypto/threads_pthread.c index 59ddcdbff832..9f61ddc820b5 100644 --- a/crypto/threads_pthread.c +++ b/crypto/threads_pthread.c @@ -38,9 +38,12 @@
# include <assert.h>
+/* HACK: Disable the use of RW locks */ +/* # ifdef PTHREAD_RWLOCK_INITIALIZER # define USE_RWLOCK # endif +*/
CRYPTO_RWLOCK *CRYPTO_THREAD_lock_new(void) {
```
henningw left a comment (kamailio/kamailio#4185)
Just to add some context, the above change is in OpenSSL. It would be great if you can also try with an unmodified OpenSSL with tls_thread_mode = 2. We had no reports of issues after the introduction of this parameter with OpenSSL 3.x, and many people are also using it on high load environments. It might be specific to some patches or the exact configuration that the embedded yocto linux is using, as most people are using Debian based OS or something of the CentOS family.
schneuwlym left a comment (kamailio/kamailio#4185)
We are using the official openssl recipe from yocto (https://git.yoctoproject.org/poky/tree/meta/recipes-connectivity/openssl?h=s...) version 3.2.3 and it seems quite clean. There are not many patches and nothing which pops in my eyes.
It would be great if you can also try with an unmodified OpenSSL with tls_thread_mode = 2
I can give a try, but as I said, it looks quite clean already and tls_thread_mode didn't work for us.
as most people are using Debian based OS or something of the CentOS family
I don't know the CentOS familiy, but my Ubuntu 24.04 (LTS) uses OpenSSL 3.0.X and Debian Bookworm uses 3.0.X as well. It seems as the RWLOCKs were introduced in 3.2.X.
I'm also surprised that we are the only one who see the issue, but something is odd and we don't know yet where or what...
henningw left a comment (kamailio/kamailio#4185)
It might be something related to OpenSSL 3.2.x, which would be funny again. Ubuntu 25.04 comes with OpenSSL 3.4 it seems, but this is of course not a LTS release so it will not used a lot. This might be another interesting thing to try.
lunn left a comment (kamailio/kamailio#4185)
I've been helping Mathias with this problem. I've not understood why it is deadlocking, but i have found something along the way.
Some background. https://docs.openssl.org/1.1.1/man7/RAND_DRBG/ documents the "deterministic random bit generator". This is however from version 1.1.1 of openssl, not version 3. Version 3 does not include this documentation any more. However, the basics still seem valid.
drbg makes use of stacked random number generators. The parent generator is connected to the entropy source. Thus it is seeded from entropy. The child generators pull seeds from the parent generator. Seeding happens once when the generator is created, and is then repeated after a time limit, or when sufficient bytes have been taken out of the generator.
The documentation indicates the child generators are expected to be per thread, and so can be accessed without locking. The parent generator is however accessed by multiple children, so does perform locking, and it is explicitly documented as being thread-safe.
Kamailio however does not use a thread model, but a process model with shared memory. As a result, there is a lot of fun and games to make openssl work correctly in a model it is not intended for.
Openssl is setup in the first process. This causes the parent generator and one child generator to be created. Since the openssl memory allocation functions have been replaced with kamailio versions, these generators end up in the shared memory. The address of the child generator is stored into a thread local key by openssl.
The worker processes are then forked off. They then go and overwrite the thread local key of the child generator, setting it back to 0. As soon as there is need for the child generator, openssl will create a new one for the worker process. Since the collection of thread local keys are per process, each worker process gets its own child generator.
The child generators however share the parent generator, which is in the shared memory which all processes have access to. The locking used on the parent looks at first glance to work happily for both threads and processes using shared memory. The pthread library uses atomic operations to try to do as much as it can in userspace. I've not seen anything which indicates user space atomic operations are not valid on shared memory. When the locks need to block, they call into the kernel on a futex. The futex man page also indicates this is valid, so long as you are not using a FUTEX_PRIVATE_FLAG operation.
So the basic scheme looks O.K.
What i did notice however is that our deadlock happens when the parent generator is reseeding. And all child generators are also reseeding. And all child generators processes are trying to reseed the parent generator. Why are they trying to reseed the parent?
``` fork_id = openssl_get_fork_id();
if (drbg->fork_id != fork_id) { drbg->fork_id = fork_id; reseed_required = 1; } ``` There is additional documentation for drgb->fork_id: ``` /* * Stores the return value of openssl_get_fork_id() as of when we last * reseeded. The DRBG reseeds automatically whenever drbg->fork_id != * openssl_get_fork_id(). Used to provide fork-safety and reseed this * DRBG in the child process. */ int fork_id; ```
and `openssl_get_fork_id()` is:
``` int openssl_get_fork_id(void) { return getpid(); } ```
Since kamailio is using a process model, not a thread model, each process has its own pid. So with 8 processes running in parallel, and the system is loaded, it is very likely that the pid is different every time there is a request for the shared primary to generate random data, and so reseeding is happening pretty much every time, rather than infrequently.
As a quick test, i hacked out this fork_id check, so that the primary did not reseed so often. Our test which deadlocks within a handful of seconds ran for a handful of hours without deadlocking. The deadlock is probably still there, but we less frequently get into a situation where the deadlock could happen.
I've not traced where the primary is getting its entropy from. If it is system entropy, that is probably not good for the system as a whole. Other random number generators on the machine might be producing less random numbers? This would be my primary concern with the way openssl is being used.