schneuwlym created an issue (kamailio/kamailio#4185)
### Description
This issue is a follow up of the discussion in the [mailing list](https://lists.kamailio.org/mailman3/hyperkitty/list/sr-users@lists.kamailio....)
We are using Kamailio 5.8.4 in a yocto scarthgap based image. We noticed on multiple nodes, that suddenly phones lost their registrations and the log showed messages like `send_fd(): sendmsg failed sending 55 on 52: Too many references: cannot splice (109)`.
After debugging we guess that the `send_fd` messages are just a symptom, not the cause. Generating a backtrace using `kamctl trap` shows that all TCP receiver are waiting for a `pthread_rwlock_...` lock.
Our deployments, where we randomly see the issue, are very small; 10-30 registrations.
I've tested kamailio version 5.8.4, 5.8.5, 5.8.6 and 6.0.1 and I was able to reproduce the issue with all versions.
I've also tested version 5.5.4, but with an older yocto (dunfell) which used openssl 1.1.1w, **which did not block!**
Yocto Scarthgap uses OpenSSL 3.2.3 and we also updated to version 3.4.1, which didn't show any difference.
### Troubleshooting
#### Reproduction
Meanwhile we found a way to reproduce the issue within seconds. It requires a bit of brute force, but it locks kamailio reliable.
We are generating 3 parallel SIP REGISTER streams with sipexer to 3 different accounts. Following an example of one stream: ``` watch -n 0.1 "./sipexer -register -vl 3 -co -com -ex 3600 -fu "2" -au "2" -ap "test" -tk <path to key>.key -tc <path to certificate>.cer -ti -sd -ru sip:<domain> -cu sip:2@<domain> tls:<fqdn>:5061;sleep 0.1;./sipexer -register -vl 3 -co -com -ex 0 -fu "2" -au "2" -ap "test" -tk <path to key>.key -tc <path to cert>.cer -ti -sd -ru sip:<domain> -cu sip:2@<domain> tls:<fqdn>:5061" ```
Our dialplan is quite complex, but I was also able to reproduce the issue with the default [dialplan](https://raw.githubusercontent.com/kamailio/kamailio/refs/heads/5.8/etc/kamai...) with the following changes: ``` 169a170,171
#!define WITH_TLS
205a208,219
alias=udp:<domain>:5060 alias=udp:127.0.0.1:5060 alias=udp:10.0.6.1:5060 alias=tls:<domain> alias=tls:10.0.6.1:5061 alias=tls:10.165.42.76:5061 alias=tls:<domain>:5061 alias=udp:<domain> listen=udp:127.0.0.1:5060 listen=tls:10.0.6.1:5061 listen=udp:10.0.6.1:5060 listen=tls:10.165.42.76:5061
498c512 < modparam("tls", "config", "/usr/local/etc/kamailio/tls.cfg") ---
modparam("tls", "config", "/etc/kamailio/kamailio-tls.cfg")
```
``` # cat /etc/kamailio/kamailio-tls.cfg [server:default] method = TLSv1+ verify_certificate = yes require_certificate = yes private_key = "<path to key>.key" certificate = "<path to cert>.crt" ca_list = "<path to ca>.pem"
[client:default] method = TLSv1+ verify_certificate = yes require_certificate = yes private_key = "<path to key>.key" certificate = "<path to cert>.crt" ca_list = "<path to ca>.pem"
```
#### Debugging Data
Following the backtrace of one tcp listener process. The full backtrace can be found [here](https://github.com/user-attachments/files/19430313/gdb_kamailio_20250321_154...)
``` ---start 4608 ----------------------------------------------------- [Thread debugging using libthread_db enabled] Using host libthread_db library "/usr/lib/libthread_db.so.1". __futex_abstimed_wait_common (futex_word=futex_word@entry=0x7fc855758a0c, expected=expected@entry=3, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=<optimized out>, cancel=cancel@entry=false) at futex-internal.c:103
warning: 103 futex-internal.c: No such file or directory #0 __futex_abstimed_wait_common (futex_word=futex_word@entry=0x7fc855758a0c, expected=expected@entry=3, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=<optimized out>, cancel=cancel@entry=false) at futex-internal.c:103 err = -512 clockbit = 256 op = <optimized out> #1 0x00007fc85a59429c in __GI___futex_abstimed_wait64 (futex_word=futex_word@entry=0x7fc855758a0c, expected=expected@entry=3, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=<optimized out>) at futex-internal.c:128 No locals. #2 0x00007fc85a59d830 in __pthread_rwlock_wrlock_full64 (abstime=0x0, clockid=0, rwlock=0x7fc855758a00) at /usr/src/debug/glibc/2.39+git/nptl/pthread_rwlock_common.c:730 private = <optimized out> wf = <optimized out> err = <optimized out> prefer_writer = false may_share_futex_used_flag = true wpf = <optimized out> ready = <optimized out> r = <optimized out> may_share_futex_used_flag = <optimized out> r = <optimized out> done = <optimized out> wpf = <optimized out> ready = <optimized out> __value = <optimized out> prefer_writer = <optimized out> private = <optimized out> wf = <optimized out> err = <optimized out> w = <optimized out> w = <optimized out> private = <optimized out> err = <optimized out> w = <optimized out> wf = <optimized out> wf = <optimized out> __value = <optimized out> #3 ___pthread_rwlock_wrlock (rwlock=0x7fc855758a00) at pthread_rwlock_wrlock.c:26 result = <optimized out> #4 0x00007fc859820b49 in CRYPTO_THREAD_write_lock (lock=<optimized out>) at /usr/src/debug/openssl/3.2.3/crypto/threads_pthread.c:108 No locals. #5 0x00007fc8599bd5da in ossl_prov_drbg_generate (drbg=0x7fc855710040, out=0x7fc856471980 "", outlen=48, strength=256, prediction_resistance=0, adin=0x7ffeec8b67e0 "@", adinlen=8) at /usr/src/debug/openssl/3.2.3/providers/implementations/rands/drbg.c:645 fork_id = <optimized out> reseed_required = 0 ret = 0 __func__ = "ossl_prov_drbg_generate" err = <optimized out> now = <optimized out> #6 0x00007fc8599bd9ca in ossl_drbg_get_seed (vdrbg=<optimized out>, pout=0x7ffeec8b6880, entropy=<optimized out>, min_len=<optimized out>, max_len=<optimized out>, prediction_resistance=0, adin=0x7ffeec8b6838 "\220'\037V\310\177", adin_len=8) at /usr/src/debug/openssl/3.2.3/providers/implementations/rands/drbg.c:175 drbg = 0x7fc855710040 bytes_needed = 48 buffer = 0x7fc856471980 "" __func__ = "ossl_drbg_get_seed" #7 0x00007fc8599bc907 in get_entropy (drbg=<optimized out>, drbg@entry=0x7fc8561f2790, pout=pout@entry=0x7ffeec8b6880, entropy=<optimized out>, min_len=48, max_len=48, prediction_resistance=prediction_resistance@entry=0) at /usr/src/debug/openssl/3.2.3/providers/implementations/rands/drbg.c:245 bytes = <optimized out> p_str = 256 __func__ = "get_entropy" #8 0x00007fc8599bd26c in ossl_prov_drbg_reseed_unlocked (drbg=drbg@entry=0x7fc8561f2790, prediction_resistance=prediction_resistance@entry=0, ent=ent@entry=0x0, ent_len=ent_len@entry=0, adin=adin@entry=0x0, adinlen=adinlen@entry=0) at /usr/src/debug/openssl/3.2.3/providers/implementations/rands/drbg.c:573 entropy = 0x0 entropylen = 0 __func__ = "ossl_prov_drbg_reseed_unlocked" #9 0x00007fc8599bd68e in ossl_prov_drbg_generate (adinlen=0, adin=0x0, prediction_resistance=0, strength=<optimized out>, outlen=32, out=0x7fc8564717c8 "", drbg=0x7fc8561f2790) at /usr/src/debug/openssl/3.2.3/providers/implementations/rands/drbg.c:697 fork_id = <optimized out> reseed_required = <optimized out> ret = <optimized out> fork_id = <optimized out> reseed_required = <optimized out> ret = <optimized out> __func__ = "ossl_prov_drbg_generate" err = <optimized out> now = <optimized out> #10 ossl_prov_drbg_generate (drbg=0x7fc8561f2790, out=0x7fc8564717c8 "", outlen=32, strength=<optimized out>, prediction_resistance=0, adin=0x0, adinlen=0) at /usr/src/debug/openssl/3.2.3/providers/implementations/rands/drbg.c:634 fork_id = <optimized out> reseed_required = 0 ret = 0 __func__ = "ossl_prov_drbg_generate" err = <optimized out> now = <optimized out> #11 0x00007fc8597e75c6 in evp_rand_generate_locked (ctx=ctx@entry=0x7fc8561f26f0, out=out@entry=0x7fc8564717c8 "", outlen=outlen@entry=32, strength=strength@entry=0, prediction_resistance=<optimized out>, prediction_resistance@entry=0, addin=0x0, addin_len=0) at /usr/src/debug/openssl/3.2.3/crypto/evp/evp_rand.c:570 chunk = 32 max_request = 65536 params = {{key = 0x7fc8599e9521 "max_request", data_type = 2, data = 0x7ffeec8b6968, data_size = 8, return_size = 8}, {key = 0x0, data_type = 0, data = 0x0, data_size = 0, return_size = 0}} __func__ = "evp_rand_generate_locked" #12 0x00007fc8597e840b in EVP_RAND_generate (ctx=0x7fc8561f26f0, out=out@entry=0x7fc8564717c8 "", outlen=outlen@entry=32, strength=strength@entry=0, prediction_resistance=prediction_resistance@entry=0, addin=addin@entry=0x0, addin_len=0) at /usr/src/debug/openssl/3.2.3/crypto/evp/evp_rand.c:592 res = <optimized out> #13 0x00007fc8598f9282 in RAND_bytes_ex (ctx=warning: could not convert 'ossl_lib_ctx_st' from the host encoding (ANSI_X3.4-1968) to UTF-32. This normally should not happen, please file a bug report. 0x0, buf=0x7fc8564717c8 "", num=32, strength=0) at /usr/src/debug/openssl/3.2.3/crypto/rand/rand_lib.c:380 rand = <optimized out> meth = <optimized out> __func__ = "RAND_bytes_ex" #14 0x00007fc859bdb62c in def_generate_session_id (ssl=ssl@entry=0x7fc856436e50, id=id@entry=0x7fc8564717c8 "", id_len=id_len@entry=0x7ffeec8b6ae4) at /usr/src/debug/openssl/3.2.3/ssl/ssl_sess.c:326 retry = 0 #15 0x00007fc859bdc132 in ssl_generate_session_id (s=s@entry=0x7fc856436e50, ss=ss@entry=0x7fc856471570) at /usr/src/debug/openssl/3.2.3/ssl/ssl_sess.c:409 tmp = 32 cb = 0x7fc859bdb600 <def_generate_session_id> ssl = <optimized out> __func__ = "ssl_generate_session_id" #16 0x00007fc859bdc31e in ssl_get_new_session (s=s@entry=0x7fc856436e50, session=session@entry=1) at /usr/src/debug/openssl/3.2.3/ssl/ssl_sess.c:464 ss = 0x7fc856471570 __func__ = "ssl_get_new_session" #17 0x00007fc859c3f169 in tls_early_post_process_client_hello (s=0x7fc856436e50) at /usr/src/debug/openssl/3.2.3/ssl/statem/statem_srvr.c:1898 i = <optimized out> loop = <optimized out> sctx = 0x7fc855d24280 j = <optimized out> dgrd = DOWNGRADE_NONE protverr = <optimized out> c = <optimized out> clienthello = 0x7fc856470a90 al = 80 id = <optimized out> comp = 0x0 ciphers = 0x7fc856471330 scsvs = 0x7fc8564713c0 ssl = 0x7fc856436e50 j = <optimized out> i = <optimized out> al = <optimized out> protverr = <optimized out> loop = <optimized out> id = <optimized out> comp = <optimized out> c = <optimized out> ciphers = <optimized out> scsvs = <optimized out> clienthello = <optimized out> dgrd = <optimized out> sctx = <optimized out> ssl = <optimized out> __func__ = "tls_early_post_process_client_hello" err = <optimized out> cipher = <optimized out> trc_out = <optimized out> pos = <optimized out> pref_cipher = <optimized out> master_key_length = <optimized out> m = <optimized out> comp_id = <optimized out> k = <optimized out> m = <optimized out> nn = <optimized out> v = <optimized out> done = <optimized out> o = <optimized out> #18 tls_post_process_client_hello (s=0x7fc856436e50, wst=<optimized out>) at /usr/src/debug/openssl/3.2.3/ssl/statem/statem_srvr.c:2295 rv = <optimized out> cipher = <optimized out> ssl = 0x7fc856436e50 err = <optimized out> __func__ = "tls_post_process_client_hello" #19 0x00007fc859c2bc1b in read_state_machine (s=0x7fc856436e50) at /usr/src/debug/openssl/3.2.3/ssl/statem/statem.c:712 mt = 1 pkt = {curr = 0x7fc856466125 "", remaining = 0} max_message_size = 0x7fc859c3e3e0 <ossl_statem_server_max_message_size> ret = <optimized out> process_message = 0x7fc859c42400 <ossl_statem_server_process_message> post_process_message = 0x7fc859c414f0 <ossl_statem_server_post_process_message> st = 0x7fc856436ee0 len = 273 transition = 0x7fc859c3d330 <ossl_statem_server_read_transition> cb = 0x7fc859c91d90 <sr_ssl_ctx_info_callback> ssl = 0x7fc856436e50 st = <optimized out> ret = <optimized out> mt = <optimized out> len = <optimized out> transition = <optimized out> pkt = <optimized out> process_message = <optimized out> post_process_message = <optimized out> max_message_size = <optimized out> cb = <optimized out> ssl = <optimized out> __func__ = "read_state_machine" #20 state_machine (s=0x7fc856436e50, server=1) at /usr/src/debug/openssl/3.2.3/ssl/statem/statem.c:478 buf = 0x0 cb = 0x7fc859c91d90 <sr_ssl_ctx_info_callback> st = 0x7fc856436ee0 ret = -1 ssret = <optimized out> ssl = 0x7fc856436e50 __func__ = "state_machine" #21 0x00007fc859cbf036 in tls_accept (c=c@entry=0x7fc856461be0, error=error@entry=0x7ffeec8b6dd8) at tls_server.c:457 ret = <optimized out> ssl = 0x7fc856436e50 cert = <optimized out> tls_c = 0x7fc8564171b0 tls_log = <optimized out> __func__ = "tls_accept" ```
### Additional Information
* **Kamailio Version** - output of `kamailio -v`
``` version: kamailio 5.8.4 (x86_64/linux) f1d83d flags: USE_TCP, USE_TLS, USE_SCTP, TLS_HOOKS, USE_RAW_SOCKS, DISABLE_NAGLE, USE_MCAST, DNS_IP_HACK, SHM_MMAP, PKG_MALLOC, MEM_JOIN_FREE, Q_MALLOC, F_MALLOC, TLSF_MALLOC, DBG_SR_MEMORY, USE_FUTEX, FAST_LOCK-ADAPTIVE_WAIT, USE_DNS_CACHE, USE_DNS_FAILOVER, USE_NAPTR, USE_DST_BLOCKLIST, HAVE_RESOLV_RES ADAPTIVE_WAIT_LOOPS 1024, MAX_RECV_BUFFER_SIZE 262144, MAX_SEND_BUFFER_SIZE 262144, MAX_URI_SIZE 1024, BUF_SIZE 65535, DEFAULT PKG_SIZE 8MB poll method support: poll, epoll_lt, epoll_et, sigio_rt, select. id: f1d83d compiled on 10:25:58 Nov 12 2024 with x86_64-ruag-linux-gcc 13.3.0
```
* **Operating System**:
It is a yocto scarthgap based image (Version 5.0.6 LTS)
``` Linux ttel 6.6.54-yocto-standard #1 SMP PREEMPT_DYNAMIC Tue Oct 8 03:03:40 UTC 2024 x86_64 GNU/Linux ```
miconda left a comment (kamailio/kamailio#4185)
Is this parameter set to 2?
- https://www.kamailio.org/wikidocs/cookbooks/5.8.x/core/#tls_threads_mode
schneuwlym left a comment (kamailio/kamailio#4185)
Is this parameter set to 2?
I tried 0-3 but it blocks in all cases... I also played with https://www.kamailio.org/docs/modules/stable/modules/tls.html#tls.p.init_mod... but also didn't find a working setup.
miconda left a comment (kamailio/kamailio#4185)
Can you try K 6.0.1? Be sure it is with `tls_threads_mode=2` (or not set, the default)
schneuwlym left a comment (kamailio/kamailio#4185)
Can you try K 6.0.1? Be sure it is with `tls_threads_mode=2` (or not set, the default)
I can give another try, but while tinkering I already tried it without success. But I don't remember whether I used `tls_threads_mode=2`. I can try again later.
At the moment some other tinkering is going on. We just tried to disable RW locks and now the test is running since minutes where it failed before after a few seconds ``` diff --git a/crypto/threads_pthread.c b/crypto/threads_pthread.c index 59ddcdbff832..9f61ddc820b5 100644 --- a/crypto/threads_pthread.c +++ b/crypto/threads_pthread.c @@ -38,9 +38,12 @@
# include <assert.h>
+/* HACK: Disable the use of RW locks */ +/* # ifdef PTHREAD_RWLOCK_INITIALIZER # define USE_RWLOCK # endif +*/
CRYPTO_RWLOCK *CRYPTO_THREAD_lock_new(void) {
```
henningw left a comment (kamailio/kamailio#4185)
Just to add some context, the above change is in OpenSSL. It would be great if you can also try with an unmodified OpenSSL with tls_thread_mode = 2. We had no reports of issues after the introduction of this parameter with OpenSSL 3.x, and many people are also using it on high load environments. It might be specific to some patches or the exact configuration that the embedded yocto linux is using, as most people are using Debian based OS or something of the CentOS family.
schneuwlym left a comment (kamailio/kamailio#4185)
We are using the official openssl recipe from yocto (https://git.yoctoproject.org/poky/tree/meta/recipes-connectivity/openssl?h=s...) version 3.2.3 and it seems quite clean. There are not many patches and nothing which pops in my eyes.
It would be great if you can also try with an unmodified OpenSSL with tls_thread_mode = 2
I can give a try, but as I said, it looks quite clean already and tls_thread_mode didn't work for us.
as most people are using Debian based OS or something of the CentOS family
I don't know the CentOS familiy, but my Ubuntu 24.04 (LTS) uses OpenSSL 3.0.X and Debian Bookworm uses 3.0.X as well. It seems as the RWLOCKs were introduced in 3.2.X.
I'm also surprised that we are the only one who see the issue, but something is odd and we don't know yet where or what...
henningw left a comment (kamailio/kamailio#4185)
It might be something related to OpenSSL 3.2.x, which would be funny again. Ubuntu 25.04 comes with OpenSSL 3.4 it seems, but this is of course not a LTS release so it will not used a lot. This might be another interesting thing to try.