<!-- Kamailio Project uses GitHub Issues only for bugs in the code or feature requests. Please use this template only for bug reports.
If you have questions about using Kamailio or related to its configuration file, ask on sr-users mailing list:
* http://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users
If you have questions about developing extensions to Kamailio or its existing C code, ask on sr-dev mailing list:
* http://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-dev
Please try to fill this template as much as possible for any issue. It helps the developers to troubleshoot the issue.
If there is no content to be filled in a section, the entire section can be removed.
You can delete the comments from the template sections when filling.
You can delete next line and everything above before submitting (it is a comment). -->
### Description
Similar to [issue #1886](https://github.com/kamailio/kamailio/issues/1886), we have Kamailio receiving WebRTC connections from clients and we see many errors related to connections being forcibly closed. We see two cases where this occurs, the former of which occurs much more often. ``` Dec 29 01:12:43.909934 tlx-dal-ecv2-staging kamailio_edge[14813]: WARNING: websocket [ws_frame.c:810]: ws_keepalive(): forcibly closing connection Dec 29 01:12:43.910179 tlx-dal-ecv2-staging kamailio_edge[14813]: ERROR: websocket [ws_conn.c:375]: wsconn_close_now(): getting TCP/TLS connection while trying to close. src_ip:src_port=x.x.x.x:49319, dst_port=443 ``` ``` Dec 29 00:53:28.110949 tlx-dal-ecv2-staging kamailio_edge[14817]: WARNING: websocket [ws_frame.c:227]: encode_and_send_ws_frame(): TCP/TLS connection get failed Dec 29 00:53:28.111187 tlx-dal-ecv2-staging kamailio_edge[14817]: ERROR: websocket [ws_frame.c:761]: ping_pong(): sending keepalive. src_ip:src_port=x.x.x.x:46712, dst_port=443 ```
### Troubleshooting There seems to be a common occurrence where the WebSocket keepalive will not receive a pong and close the websocket connection. While closing, we see that the underlying TCP connection has already been closed. While that may be harmless, I am concerned there is a race condition that could lead to other problems.
### Additional Information
* **Kamailio Version** - output of `kamailio -v`
``` version: kamailio 5.3.1 (x86_64/linux) d68f5c flags: USE_TCP, USE_TLS, USE_SCTP, TLS_HOOKS, USE_RAW_SOCKS, DISABLE_NAGLE, USE_MCAST, DNS_IP_HACK, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, TLSF_MALLOC, DBG_SR_MEMORY, USE_FUTEX, FAST_LOCK-ADAPTIVE_WAIT, USE_DNS_CACHE, USE_DNS_FAILOVER, USE_NAPTR, USE_DST_BLACKLIST, HAVE_RESOLV_RES, TLS_PTHREAD_MUTEX_SHARED ADAPTIVE_WAIT_LOOPS 1024, MAX_RECV_BUFFER_SIZE 262144, MAX_URI_SIZE 1024, BUF_SIZE 65535, DEFAULT PKG_SIZE 8MB poll method support: poll, epoll_lt, epoll_et, sigio_rt, select. id: d68f5c compiled on 07:32:16 Dec 29 2021 with gcc 8.3.0 ```
* **Operating System**:
<!-- Details about the operating system, the type: Linux (e.g.,: Debian 8.4, Ubuntu 16.04, CentOS 7.1, ...), MacOS, xBSD, Solaris, ...; Kernel details (output of `lsb_release -a` and `uname -a`) -->
```
lsb_release -a
No LSB modules are available. Distributor ID: Debian Description: Debian GNU/Linux 10 (buster) Release: 10 Codename: buster
uname -a
Linux tlx-dal-ecv2-staging 4.19.0-8-amd64 #1 SMP Debian 4.19.98-1 (2020-01-26) x86_64 GNU/Linux ```
v5.3.x series is out of maintenance, we cannot help troubleshooting older releases. You have to upgrade to latest 5.5.x or at least 5.4.x and see if it is the same behaviour, then attach again the relevant logs and useful information, in order to match with the source code.
Also, Kamailio typically logs error messages on many unexpected transport layer events, but there are cases when these events are happening due to the type of deployment and UAs (e.g., mobile devices loosing often the internet link). One can lower the `debug` parameter value or set per module debug level with debugger module to control the syslog output.
Hi Daniel,
Thank you for the prompt response. We are looking into migrating to 5.5.x shortly.
We did add some additional debugging and got an interesting result. ``` Dec 31 18:29:08.819765 eqx-sjc-ecv3 kamailio_edge[19096]: WARNING: websocket [ws_frame.c:810]: ws_keepalive(): forcibly closing connection
grep "tcp id = 19132" /var/log/kamailio/kamailio_edge.log
Dec 31 18:29:07.815876 eqx-sjc-ecv3 kamailio_edge[19096]: INFO: websocket [ws_conn.c:371]: wsconn_close_now(): wsconn_close_now: wsc = 0xa2ac94e0, tcp con = 0x9edb0e80, tcp id = 19132 Dec 31 18:29:08.819965 eqx-sjc-ecv3 kamailio_edge[19096]: INFO: websocket [ws_conn.c:371]: wsconn_close_now(): wsconn_close_now: wsc = 0xa2ac94e0, tcp con = 0x00000000, tcp id = 19132 Dec 31 18:29:09.824694 eqx-sjc-ecv3 kamailio_edge[19096]: INFO: websocket [ws_conn.c:371]: wsconn_close_now(): wsconn_close_now: wsc = 0xa2ac94e0, tcp con = 0x00000000, tcp id = 19132 Dec 31 18:29:10.833513 eqx-sjc-ecv3 kamailio_edge[19096]: INFO: websocket [ws_conn.c:371]: wsconn_close_now(): wsconn_close_now: wsc = 0xa2ac94e0, tcp con = 0x00000000, tcp id = 19132 Dec 31 18:29:11.838090 eqx-sjc-ecv3 kamailio_edge[19096]: INFO: websocket [ws_conn.c:371]: wsconn_close_now(): wsconn_close_now: wsc = 0xa2ac94e0, tcp con = 0x00000000, tcp id = 19132 Dec 31 18:29:12.841246 eqx-sjc-ecv3 kamailio_edge[19096]: INFO: websocket [ws_conn.c:371]: wsconn_close_now(): wsconn_close_now: wsc = 0xa2ac94e0, tcp con = 0x00000000, tcp id = 19132 Dec 31 18:38:20.299126 eqx-sjc-ecv3 kamailio_edge[19218]: INFO: <core> [core/tcp_main.c:1584]: _tcpconn_detach(): _tcpconn_detach: tcpconn=0x9edb0e80, tcp id = 19132 ```
I did want to get your thoughts on the current implementation of the `ws_keepalive` functionality. ``` if(wsc->state == WS_S_CLOSING || wsc->awaiting_pong) { LM_WARN("forcibly closing connection\n"); wsconn_close_now(wsc); ```
When `wsconn_close_now` is called, the connection state is set to `S_CONN_BAD`. On subsequent `wsconn_close_now` calls, `tcpconn_get` -> `_tcpconn_find` calls will return 0 due to [this line](https://github.com/dialpad/kamailio/blob/master/src/core/tcp_main.c#L1650): `if ((id==c->id)&&(c->state!=S_CONN_BAD)) {`.
Ie. `_tcpconn_find` skips over connections in the `S_CONN_BAD` state.
Why is `wsconn_close_now` being called multiple times? It looks like `ws_keepalive` isn't removing the websocket connection from the list of open connections. Instead, it's [de-referencing an extra time](https://github.com/kamailio/kamailio/blob/master/src/modules/websocket/ws_fr...). ``` void ws_keepalive(unsigned int ticks, void *param) { ... while(list_head[i].id!=-1) { //For each websocket connection, add a refcount wsc = wsconn_get(list_head[i].id); if(wsc && wsc->last_used < check_time) { if(wsc->state == WS_S_CLOSING || wsc->awaiting_pong) { LM_WARN("forcibly closing connection\n"); //Remove extra refcount? wsconn_close_now(wsc); } else if (ws_keepalive_mechanism == KEEPALIVE_MECHANISM_CONCHECK) { tcp_connection_t *con = tcpconn_get(wsc->id, 0, 0, 0, 0); if(con==NULL) { LM_INFO("tcp connection has been lost\n"); wsc->state = WS_S_CLOSING; } else { tcpconn_put(con); } } else { int opcode = (ws_keepalive_mechanism == KEEPALIVE_MECHANISM_PING) ? OPCODE_PING : OPCODE_PONG; ping_pong(wsc, opcode); } } if(wsc) { //Remove refcount wsconn_put_id(list_head[i].id); } i++;
}
wsconn_put_list_ids(list_head); } ```
What about modifying the keepalive to do something like this instead? ``` for(i = 0; list_head[i].id!=-1; i++) { wsc = wsconn_get(list_head[i].id); if(wsc->state == WS_S_CLOSING) { /* Skip closing connections */ continue; } else if (wsc->awaiting_pong) { LM_WARN("forcibly closing connection, no pong received\n"); wsconn_close_now(wsc); continue; } else if ... ```
On skipping `wsconn_close_now` in the `WS_S_CLOSING` case, should it really be the responsibility of the keepalive to close the connection in that case?
I am not sure about the reason of `forcibly closing connection` also for `WS_S_CLOSING`, maybe the developer had in mind some cases at that time -- tracking back, it looks like being from 2012, from the early time of that module being developed.
You can make PR with the changes you propose and we can review and merge. If later proves to have side effects (e.g., delay cleaning up the connection), we can revert or add a modparam to control the behaviour.
@maxdialpad: do still plan to make a PR for the changes you mentioned in comment above?
@miconda it's on our roadmap but not prioritized right now. My estimate is we'll get to this by late April.
We have also seen these warnings. Is there a way to identify which WebRTC user was disconnected?
@davidvoisonics You can try of course relate the IP information to an user, but might be not that easy in some cases. @maxdialpad Any news regarding the PR for the discussed changes from your side?
For some reason ours doesn't include the ip:port information, it just repeats over and over:
Oct 31 16:18:17 server /sbin/kamailio[7769]: WARNING: websocket [ws_frame.c:813]: ws_keepalive(): forcibly closing connection Oct 31 16:18:17 server /sbin/kamailio[7769]: ERROR: websocket [ws_conn.c:375]: wsconn_close_now(): getting TCP/TLS connection
I'm sorry to disappoint - we haven't prioritized this as we've mostly gotten away from these errors. For us, the main source of problems was actually a very slow `sht_rm_name_re` in our configs. We store connection information in an htable and this regex call was causing the WebSocket module to slow down while iterating over each connection. We ended up replacing the regex call with explicit `sht_rm` calls in. our config which solved the problem for us. We do still encounter these WebSocket errors occasionally, and I do believe the pseudocode posted above is still valid.
I'm sorry we haven't take the time to try that out yet.
Thanks for the feedback. Maybe this helps already a bit in getting more information about the error.
This maybe related to https://github.com/kamailio/kamailio/issues/3278
This issue is stale because it has been open 6 weeks with no activity. Remove stale label or comment or this will be closed in 2 weeks.
Closed #2990 as not planned.