### Description
We have segfault in Kamailio v5.3.1 installed on Debain 9.x 64 bit occured while kamailio was shutting down while our script tryed to get metrics using kamcmd utility at the same time.
### Troubleshooting
No troubleshooting was done, since it happened on a production server. We simply restarted the server.
#### Reproduction
The problem periodically happens on production servers in runtime. Kamailio crashes when one of our scripts tried getting statistics about websocket and tls modules using kamcmd. As I see in core dump, shared memory was already freed when rpc_mod_print called in the child process.
#### Debugging Data
``` GNU gdb (Debian 7.12-6) 7.12.0.20161007-git Copyright (C) 2016 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: http://www.gnu.org/software/gdb/bugs/. Find the GDB manual and other documentation resources online at: http://www.gnu.org/software/gdb/documentation/. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from /var/lib/ums/sbin/kamailio...done. [New LWP 17075] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `/var/lib/ums/sbin/kamailio -m 2048 -M 12 -P /var/run/kamailio/kamailio.pid -f /'. Program terminated with signal SIGSEGV, Segmentation fault. #0 __strcmp_sse2_unaligned () at ../sysdeps/x86_64/multiarch/strcmp-sse2-unaligned.S:32 #0 __strcmp_sse2_unaligned () at ../sysdeps/x86_64/multiarch/strcmp-sse2-unaligned.S:32 #1 0x00007fcae1ef62f5 in rpc_mod_print (rpc=0x7fcae180b540 <binrpc_callbacks>, ctx=0x7ffe7a6f7808, mname=0x1edd0c9 "websocket", stats=0x1ef8310, flag=2) at mod_stats.c:117 #2 0x00007fcae1ef60eb in rpc_mod_print_one (rpc=0x7fcae180b540 <binrpc_callbacks>, ctx=0x7ffe7a6f7808, mname=0x1edd0c9 "websocket", pkg_stats=0x1ef69d0, shm_stats=0x1ef8310, flag=2) at mod_stats.c:159 #3 0x00007fcae1ef5ee1 in rpc_mod_mem_stats_mode (rpc=0x7fcae180b540 <binrpc_callbacks>, ctx=0x7ffe7a6f7808, fmode=0) at mod_stats.c:239 #4 0x00007fcae1ef584f in rpc_mod_mem_stats (rpc=0x7fcae180b540 <binrpc_callbacks>, ctx=0x7ffe7a6f7808) at mod_stats.c:251 #5 0x00007fcae15dac80 in process_rpc_req (buf=0x1edd0b4 "\241\003\035\020\333Qg\221\nmod.stats", size=36, bytes_needed=0x7ffe7a6f7c50, sh=0x7ffe7a6f7bc0, saved_state=0x1eed0b8) at binrpc_run.c:678 #6 0x00007fcae15c872f in handle_stream_read (s_c=0x1edd080, idx=-1) at io_listener.c:511 #7 0x00007fcae15c4121 in handle_io (fm=0x7fcb65600cb0, events=1, idx=-1) at io_listener.c:706 #8 0x00007fcae15c293a in io_wait_loop_epoll (h=0x7fcae180b348 <io_h>, t=10, repeat=0) at ./../../core/io_wait.h:1062 #9 0x00007fcae15b662c in io_listen_loop (fd_no=2, cs_lst=0x1df1940) at io_listener.c:281 #10 0x00007fcae15ec72c in mod_child (rank=0) at ctl.c:338 #11 0x0000000000638c14 in init_mod_child (m=0x7fcb6547f4b0, rank=0) at core/sr_module.c:780 #12 0x000000000063862d in init_mod_child (m=0x7fcb6547fb78, rank=0) at core/sr_module.c:776 #13 0x000000000063862d in init_mod_child (m=0x7fcb65480018, rank=0) at core/sr_module.c:776 #14 0x000000000063862d in init_mod_child (m=0x7fcb65480528, rank=0) at core/sr_module.c:776 #15 0x000000000063862d in init_mod_child (m=0x7fcb654809c8, rank=0) at core/sr_module.c:776 #16 0x000000000063862d in init_mod_child (m=0x7fcb65481140, rank=0) at core/sr_module.c:776 #17 0x000000000063862d in init_mod_child (m=0x7fcb654817b0, rank=0) at core/sr_module.c:776 #18 0x000000000063862d in init_mod_child (m=0x7fcb65481c38, rank=0) at core/sr_module.c:776 #19 0x00000000006385b2 in init_child (rank=0) at core/sr_module.c:825 #20 0x000000000043140c in main_loop () at main.c:1753 #21 0x000000000043df6f in main (argc=9, argv=0x7ffe7a6fbf88) at main.c:2802 ```
#### Log Messages
No any useful logs available.
#### SIP Traffic
No SIP traffic available.
### Possible Solutions
<!-- If you found a solution or workaround for the issue, describe it. Ideally, provide a pull request with a fix. -->
### Additional Information
* **This is a sequence of commands that python script runs every 10 seconds:**
``` /var/lib/ums/sbin/kamcmd -s tcp:localhost:2048 stats.get_statistics websocket: /var/lib/ums/sbin/kamcmd -s tcp:localhost:2048 stats.get_statistics tcp: /var/lib/ums/sbin/kamcmd -s tcp:localhost:2048 stats.get_statistics shmem: /var/lib/ums/sbin/kamcmd -s tcp:localhost:2048 mod.stats tls pkg /var/lib/ums/sbin/kamcmd -s tcp:localhost:2048 mod.stats tls shm /var/lib/ums/sbin/kamcmd -s tcp:localhost:2048 mod.stats websocket pkg /var/lib/ums/sbin/kamcmd -s tcp:localhost:2048 mod.stats websocket shm /var/lib/ums/sbin/kamcmd -s tcp:localhost:2048 mod.stats core shm ```
* **Kamailio Version** - output of `kamailio -v`
``` version: kamailio 5.3.1 (x86_64/linux) 283e46 flags: USE_TCP, USE_TLS, USE_SCTP, TLS_HOOKS, USE_RAW_SOCKS, DISABLE_NAGLE, USE_MCAST, DNS_IP_HACK, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, TLSF_MALLOC, DBG_SR_MEMORY, USE_FUTEX, FAST_LOCK-ADAPTIVE_WAIT, USE_DNS_CACHE, USE_DNS_FAILOVER, USE_NAPTR, USE_DST_BLACKLIST, HAVE_RESOLV_RES ADAPTIVE_WAIT_LOOPS 1024, MAX_RECV_BUFFER_SIZE 262144, MAX_URI_SIZE 1024, BUF_SIZE 65535, DEFAULT PKG_SIZE 8MB poll method support: poll, epoll_lt, epoll_et, sigio_rt, select. id: 283e46 compiled on 14:23:37 Jul 28 2020 with clang 9.0 ```
* **Operating System**:
``` Linux devhpbx005-1.vx 4.9.0-12-amd64 #1 SMP Debian 4.9.210-1 (2020-01-20) x86_64 GNU/Linux ```
I pushed a few commits in master branch trying to address this issue. I haven't implemented the memory mod stats, but as I could see in the code, it didn't seem to be protected for races on accessing the shm fragments.
You would need to test with master branch or by using patches from the next commits:
* aa458a62f034c2cb57639bdc713ed3c51b0292c7 * 761eb0616fea2a859a2c0abb652b22feb6f59859 * 9645be245f899fa8ae11a6be045d2ef83fd66bf5
Thanks Daniel, I'll be testing it as soon as I deal with my current tasks.
Closing with an additional comment that doing mod.stats shm is locking shared memory access and it is rather lengthy process in a loaded server with a lot of shm usage (it needs to do grouping for all allocated chunks under shm lock), therefore this is not recommended at all to execute it often, but rather avoid it. Better get the overall shm stats and in case one discover a potential leak or unexpected large use of shm, then leverage mod.stats shm to find the source of the behaviour.
The commits are in master branch, I am not sure yet if they should be backported, the design of those commands seemed to be more for troubleshooting (run it when out of memory to get the reports and then restart), rather than runtime continous monitoring (the above commits try to fix it, but due to lack of extensive testing there can be side effects, thus not rushing to backport changes related to memory manager interaction). For now, just avoid using these rpc commands.
Closed #2460.
Hi Daniel,
I've performed some load tests. The loading was about 400 calls per second on peak. In parallel, I started a script calling commands below with a frequency of 200 milliseconds:
``` /var/lib/ums/sbin/kamcmd -s udp:localhost:2048 stats.get_statistics websocket: /var/lib/ums/sbin/kamcmd -s udp:localhost:2048 stats.get_statistics tcp: /var/lib/ums/sbin/kamcmd -s udp:localhost:2048 stats.get_statistics shmem: /var/lib/ums/sbin/kamcmd -s udp:localhost:2048 mod.stats tls pkg /var/lib/ums/sbin/kamcmd -s udp:localhost:2048 mod.stats tls shm /var/lib/ums/sbin/kamcmd -s udp:localhost:2048 mod.stats websocket pkg /var/lib/ums/sbin/kamcmd -s udp:localhost:2048 mod.stats websocket shm /var/lib/ums/sbin/kamcmd -s udp:localhost:2048 mod.stats core shm
```
This is a graph of loading:

The tests have finished successfully, there were no any fails ocurred.
Andrey
OK, no longer crashing. The comments about over loading risks were expressed before, it's like locking a database table from an external source and kamailio needs to access it.
Thanks, I will take into account your comments.