Hi all
I have a setup of several proxies behind a load balancer and all of them have several coredumps every day. I've tried versions 5.3, 5.5 and 5.8 and all of them crash
I didn't have dbgsym packages installed but now I've built a 5.8.4 version in one of them and can gdb the coredumps. Can't see anything particular and I'm wondering if I'm doing anything wrong.
I'm descompressing the coredump with lz4cat. Sometimes I see that the system generates 3 coredumps at the same time. I've tried running gdb in one of them and exec "bt full"
I usually see "can't access memory" in some of them but inb others I don't see anything relevant.
I'm attaching one coredump. I don't even know if I'm doing it properly. Could you please guide me to how to debug what's going on?
thanks
Hello,
is the attached backtrace from 5.8.4?
If you get many core dump files at the same time, attach the full backtrace for each of them, because usually one is revealing the reason of crash and the others are just side effects, but all need to be investigated in order to see which one is important for troubleshooting.
Some other details would be useful:
- what is the operating system you run? - is it a dedicated server, or some virtualization system (docker/kubernetes, virtual machine, ...)? - is it under high load when it happens, or some resources not available (e.g., database backend)? - can you list the modules that are loaded in kamailio config? Any with custom code, or all from stock kamailio repo?
Cheers, Daniel
On 28.01.25 11:12, Jon Bonilla (Manwe) via sr-users wrote:
Hi all
I have a setup of several proxies behind a load balancer and all of them have several coredumps every day. I've tried versions 5.3, 5.5 and 5.8 and all of them crash
I didn't have dbgsym packages installed but now I've built a 5.8.4 version in one of them and can gdb the coredumps. Can't see anything particular and I'm wondering if I'm doing anything wrong.
I'm descompressing the coredump with lz4cat. Sometimes I see that the system generates 3 coredumps at the same time. I've tried running gdb in one of them and exec "bt full"
I usually see "can't access memory" in some of them but inb others I don't see anything relevant.
I'm attaching one coredump. I don't even know if I'm doing it properly. Could you please guide me to how to debug what's going on?
thanks
Kamailio - Users Mailing List - Non Commercial Discussions -- sr-users@lists.kamailio.org To unsubscribe send an email to sr-users-leave@lists.kamailio.org Important: keep the mailing list in the recipients, do not reply only to the sender!
El Tue, 28 Jan 2025 11:33:19 +0100 Daniel-Constantin Mierla miconda@gmail.com escribió:
Hello,
Hi Daniel
is the attached backtrace from 5.8.4?
Yes, it's a 5.8.4 with some sipwise patches. I built the upstream master of their repo.
version: kamailio 5.8.4 (x86_64/linux) 23a581-dirty
If you get many core dump files at the same time, attach the full backtrace for each of them, because usually one is revealing the reason of crash and the others are just side effects, but all need to be investigated in order to see which one is important for troubleshooting.
Ok. I'm attaching the trace of 2 crashes, 3 coredumps each.
Some other details would be useful:
- what is the operating system you run?
It's debian 10
- is it a dedicated server, or some virtualization system (docker/kubernetes, virtual machine, ...)?
All servers are bare metal
- is it under high load when it happens, or some resources not available (e.g., database backend)?
No, it happens both high load and low load (night and day). The ones attaching happened during the night and low load.
- can you list the modules that are loaded in kamailio config? Any with custom code, or all from stock kamailio repo?
Tested with 3 versions of kamailio but they are sipwise version. I know they push upstream but it won't be 100% stock kamailio.
modules are a bunch of them really.
loadmodule "db_mysql.so" loadmodule "db_redis.so" loadmodule "auth.so" loadmodule "auth_db.so" loadmodule "tm.so" loadmodule "tmx.so" loadmodule "sl.so" loadmodule "rr.so" loadmodule "pv.so" loadmodule "maxfwd.so" loadmodule "usrloc.so" loadmodule "registrar.so" loadmodule "textops.so" loadmodule "uri_db.so" loadmodule "siputils.so" loadmodule "utils.so" loadmodule "xlog.so" loadmodule "sanity.so" loadmodule "acc.so" loadmodule "nathelper.so" loadmodule "rtpengine.so" loadmodule "domain.so" loadmodule "ctl.so" loadmodule "xmlrpc.so" loadmodule "cfg_rpc.so" loadmodule "cfgutils.so" loadmodule "avpops.so" loadmodule "sqlops.so" loadmodule "uac.so" loadmodule "kex.so" loadmodule "lcr.so" loadmodule "dispatcher.so" loadmodule "permissions.so" loadmodule "uac_redirect.so" loadmodule "dialplan.so" loadmodule "speeddial.so" loadmodule "dialog.so" loadmodule "tmrec.so" loadmodule "diversion.so" loadmodule "corex.so" loadmodule "textopsx.so" loadmodule "sdpops.so" loadmodule "htable.so" loadmodule "jansson.so" loadmodule "pv_headers.so" loadmodule "secsipid.so" loadmodule "jsonrpcs.so" loadmodule "app_lua.so"
A cause for the crash is revealed by the backtrace:
#0 0x00007fbd7e07b8eb in raise () from /lib/x86_64-linux-gnu/libc.so.6 No symbol table info available. #1 0x00007fbd7e066535 in abort () from /lib/x86_64-linux-gnu/libc.so.6 No symbol table info available. #2 0x000055fff797d98f in qm_debug_check_frag (qm=qm@entry=0x7fbb6e9b9000, f=f@entry=0x7fbb6ee08940, file=file@entry=0x7fbd6eeb34f5 "permissions: hash.c", line=line@entry=402, eline=eline@entry=603, efile=0x55fff7ac8d48 "core/mem/q_malloc.c") at core/mem/q_malloc.c:132 p = <optimized out> __func__ = "qm_debug_check_frag"
Which point to the abort line at src/core/mem/q_malloc.c:132, respectively:
if(f->check != ST_CHECK_PATTERN) { LM_CRIT("BUG: qm: fragm. %p (address %p) " "beginning overwritten (%lx)! Memory allocator was called " "from %s:%u. Fragment marked by %s:%lu. Exec from %s:%u.\n", f, (char *)f + sizeof(struct qm_frag), f->check, file, line, f->file, f->line, efile, eline); qm_status(qm); abort(); };
That means there is a buffer overflow or writing at a wrong address.
I would suggest that you review the additional patches you apply to the stock Kamailio, because no similar crash has been reported to the project and such case wouldn't last long to show up in deployments, it is a high chance that the fault is coming from those additional patches.
Cheers, Daniel
On 28.01.25 12:29, Jon Bonilla (Manwe) wrote:
El Tue, 28 Jan 2025 11:33:19 +0100 Daniel-Constantin Mierla miconda@gmail.com escribió:
Hello,
Hi Daniel
is the attached backtrace from 5.8.4?
Yes, it's a 5.8.4 with some sipwise patches. I built the upstream master of their repo.
version: kamailio 5.8.4 (x86_64/linux) 23a581-dirty
If you get many core dump files at the same time, attach the full backtrace for each of them, because usually one is revealing the reason of crash and the others are just side effects, but all need to be investigated in order to see which one is important for troubleshooting.
Ok. I'm attaching the trace of 2 crashes, 3 coredumps each.
Some other details would be useful:
- what is the operating system you run?
It's debian 10
- is it a dedicated server, or some virtualization system (docker/kubernetes, virtual machine, ...)?
All servers are bare metal
- is it under high load when it happens, or some resources not available (e.g., database backend)?
No, it happens both high load and low load (night and day). The ones attaching happened during the night and low load.
- can you list the modules that are loaded in kamailio config? Any with custom code, or all from stock kamailio repo?
Tested with 3 versions of kamailio but they are sipwise version. I know they push upstream but it won't be 100% stock kamailio.
modules are a bunch of them really.
loadmodule "db_mysql.so" loadmodule "db_redis.so" loadmodule "auth.so" loadmodule "auth_db.so" loadmodule "tm.so" loadmodule "tmx.so" loadmodule "sl.so" loadmodule "rr.so" loadmodule "pv.so" loadmodule "maxfwd.so" loadmodule "usrloc.so" loadmodule "registrar.so" loadmodule "textops.so" loadmodule "uri_db.so" loadmodule "siputils.so" loadmodule "utils.so" loadmodule "xlog.so" loadmodule "sanity.so" loadmodule "acc.so" loadmodule "nathelper.so" loadmodule "rtpengine.so" loadmodule "domain.so" loadmodule "ctl.so" loadmodule "xmlrpc.so" loadmodule "cfg_rpc.so" loadmodule "cfgutils.so" loadmodule "avpops.so" loadmodule "sqlops.so" loadmodule "uac.so" loadmodule "kex.so" loadmodule "lcr.so" loadmodule "dispatcher.so" loadmodule "permissions.so" loadmodule "uac_redirect.so" loadmodule "dialplan.so" loadmodule "speeddial.so" loadmodule "dialog.so" loadmodule "tmrec.so" loadmodule "diversion.so" loadmodule "corex.so" loadmodule "textopsx.so" loadmodule "sdpops.so" loadmodule "htable.so" loadmodule "jansson.so" loadmodule "pv_headers.so" loadmodule "secsipid.so" loadmodule "jsonrpcs.so" loadmodule "app_lua.so"
El Wed, 29 Jan 2025 21:27:13 +0100 Daniel-Constantin Mierla miconda@gmail.com escribió:
That means there is a buffer overflow or writing at a wrong address.
I would suggest that you review the additional patches you apply to the stock Kamailio, because no similar crash has been reported to the project and such case wouldn't last long to show up in deployments, it is a high chance that the fault is coming from those additional patches.
Hi Daniel
Sorry for replying so late. I've been working on this and testing stability takes some days for every change.
First I realized that there was a big memory pressure in the systems by other processes and took care of that. It helped a bit but still not enough.
I had also a bottleneck in a shared redis-server where the dialog module was storing dialog and dialog profiles. Removing that shared redis-server and removing the dialog profiles has helped a lot with overall system stability. From several crashes per day to almost 0 crashes.
I'm not sure additional patches are the cause here. Anyway I think that now new coredumps, if they show up, will be more reliable because there are fewer external noise and the systems run smoother.
I'll continue testing and will let you know.
thanks,
Jon