[OpenSER-Users] OpenSER Randomly crashes.

Thu Mar 6 11:14:30 CET 2008

Hi Sergio,

please upload your report on the tracker (bug section).

Regards,
Bogdan

PS: please enable the memory debug support (DBG_QM_MALLOC) and run it 
like this - it might provide more infos when crashing.

Sergio Gutierrez wrote:
> Hi Henning.
>
> I apologize in advance for the long post.
>
> These days, Openser still has been crashing randomly.
>
> Using GDB in one of the generated core files, I found something curious:
>
> #0  0x000bcfbc in fm_malloc (qm=0x185320, size=24, file=0xfedbac10 
> "res.c",
>     func=0xfedbac70 "db_mysql_get_columns", line=62) at mem/f_malloc.c:267
> #1  0xfedb74b0 in db_mysql_get_columns (_h=0x1cbf68, _r=0x24dde8) at 
> res.c:62
> #2  0xfedb79f0 in db_mysql_convert_result (_h=0x1cbf68, _r=0x24dde8) 
> at res.c:167
> #3  0xfedb28c4 in db_mysql_store_result (_h=0x1cbf68, _r=0xffbff830) 
> at dbase.c:209
> #4  0xfedb40e8 in db_mysql_raw_query (_h=0x1cbf68,
>     _s=0xff07e668 "select received, contact, socket, cflags, path from 
> location where expires > '2008-03-04 13:37:51' and cflags & 64 = 64 
> and id % 1 = 0", _r=0xffbff830) at dbase.c:447
> #5  0xff053260 in get_all_db_ucontacts (buf=0x1ceec0, len=320054, 
> flags=64, part_idx=0, part_max=1)
>     at dlist.c:128
> #6  0xff0528c8 in get_all_ucontacts (buf=0x1ceec0, len=320058, 
> flags=64, part_idx=0, part_max=1) at dlist.c:356
> #7  0xfee57c6c in pingClients (ticks=60, param=0x0) at functions.h:60
> #8  0x000aa430 in timer_ticker (timer_list=0x163c00) at timer.c:275
> #9  0x000aa180 in run_timer_process (tpl=0x1c5808, do_jiffies=1) at 
> timer.c:357
> #10 0x000aa6fc in start_timer_processes () at timer.c:386
> #11 0x00036788 in main_loop () at main.c:873
> #12 0x0003a0c4 in main (argc=1137536, argv=0x155f1c) at main.c:1372
>
> By inspecting in detail the frame 0, in particular the qm variable:
>
> (gdb) print qm
> $1 = (struct fm_block *) 0x185320
>
>
> Which is the fm_block structure defined at mem/f_malloc.h.
>
> (gdb) frame 0
> #0  0x000bcfbc in fm_malloc (qm=0x185320, size=24, file=0xfedbac10 
> "res.c",
>     func=0xfedbac70 "db_mysql_get_columns", line=62) at mem/f_malloc.c:267
> 267                             if ((*f)->size>=size) goto found;
> (gdb) list
> 262             /*search for a suitable free frag*/
> 263
> 264             for(hash=GET_HASH(size);hash<F_HASH_SIZE;hash++){
> 265                     f=&(qm->free_hash[hash].first);
> 266                     for(;(*f); f=&((*f)->u.nxt_free))
> 267                             if ((*f)->size>=size) goto found;
> 268                     /* try in a bigger bucket */
> 269             }
> 270             /* not found, bad! */
> 271             return 0;
>
>
> If I print the qm->free_hash array, I found that is mainly empty; For 
> the particular case of my core file, hash has a value of three, when 
> printing that position I have the following:
>
> (gdb) print qm->free_hash[hash]
> $1 = {first = 0x69703a31, no = 1}
> (gdb) print qm->free_hash
> $2 = {{first = 0x0, no = 0}, {first = 0x0, no = 0}, {first = 0x0, no = 
> 0}, {first = 0x69703a31, no = 1}, {
>     first = 0x0, no = 0}, {first = 0x0, no = 0}, {first = 0x0, no = 
> 0}, {first = 0x0, no = 0}, {first = 0x0,
>     no = 0}, {first = 0x0, no = 0}, {first = 0x24dd68, no = 4641}, 
> {first = 0x0, no = 0} <repeats 21 times>, {
>     first = 0x1ced90, no = 1}, {first = 0x0, no = 0} <repeats 679 
> times>, {first = 0x1cef40, no = 1}, {
>     first = 0x0, no = 0} <repeats 1337 times>, {first = 0x1cef40, no = 
> 1}, {first = 0x0, no = 0}, {
>     first = 0x24de38, no = 1}, {first = 0x0, no = 0} <repeats 11 
> times>, {first = 0x21d100, no = 1}, {
>     first = 0x0, no = 0}, {first = 0x0, no = 0}}
> (gdb) print qm->free_hash.no
> $3 = 0
> (gdb) print qm->free_hash[hash].first
> $4 = (struct fm_frag *) 0x69703a31
> (gdb) x/s 0x69703a31
> 0x69703a31:      <Address 0x69703a31 out of bounds>
>
>
> So, the error happened because from the list of memory fragments, an 
> invalid one was referred.
>
> I have two questions:
>
> 1. Is it normal that free_hash array at fm_block has some positions 
> pointing to invalid locations?
> 2. I could see that the fm_frag_lnk struct has a member called no, 
> which, for the printing, I see it is 0 for most of the values at the 
> array, and it is 1 at some members, including the one which causes the 
> crash; would it not be possible to use that member for a check before 
> trying the allocation? What exactly means the no member, as I also see 
> that for some of the members it has a value higher than 1.
>
> Thanks in advance for any help, and again, I apologize for the long post.
>
> Best regards.
>
> Sergio Gutierrez
>
>
> On Thu, Feb 28, 2008 at 11:49 AM, Sergio Gutierrez <saguti at gmail.com 
> <mailto:saguti at gmail.com>> wrote:
>
>     Hi Henning.
>
>     Thanks a lot for your answer.
>
>     Currently, the machine does not report any hardware problem;
>     Solaris 10 has a service called Fault Manager, which is running on
>     my machine, and it has not reported any error or problem related
>     to it.
>
>     At this moment, I am testing a Openser installation compiled using
>     an optimized version of GCC released by Sun to be used on Sparc
>     Systems; this release is based on gcc 4, and at this time, OpenSER
>     has been running for almost 18 hours without crash.
>
>     I will inspect the core file again, and I will be posting what I find.
>
>     Best regards, and thanks again.
>
>     Sergio Gutierrez.
>
>
>
>
>     On Thu, Feb 28, 2008 at 5:19 AM, Henning Westerholt
>     <henning.westerholt at 1und1.de <mailto:henning.westerholt at 1und1.de>>
>     wrote:
>
>         On Thursday 28 February 2008, Sergio Gutierrez wrote:
>         > My OpenSER 1.3 installation running on Solaris Sparc is
>         facing random and
>         > unexpected crashes, in appearance related to timer process.
>         >
>         > The last core presents the following backtrace
>         >
>         > #0  0xfe977a04 in get_expired_dlgs (time=4233810208) at
>         dlg_timer.c:194
>         > #1  0xfe977540 in dlg_timer_routine (ticks=7980, attr=0x0) at
>         > dlg_timer.c:210
>         > #2  0x000a839c in timer_ticker (timer_list=0x15ec00) at
>         timer.c:275
>         > #3  0x000a80ec in run_timer_process (tpl=0x1b8088,
>         do_jiffies=1) at timer.c
>         >
>         > :357
>         >
>         > #4  0x000a8668 in start_timer_processes () at timer.c:386
>         > #5  0x00035ea8 in main_loop () at main.c:873
>         > #6  0x000397c4 in main (argc=-4195024, argv=0x150e9c) at
>         main.c:1372
>         >
>         >
>         > Thanks in advance for any hint you can give me.
>
>         Hi Sergio,
>
>         signal 10 is SIGBUS on solaris. This could be caused from an
>         invalid address
>         alignment, a segmention fault on a physical address and a
>         object hardware
>         error (wikipedia).
>
>         The first crashes were both caused from a get_all_ucontact,
>         triggered by a
>         timer. This crash is now another timer, deletion of expired
>         dialogs,
>         strange.. Is this machine otherwise stable, when (openser
>         release) does this
>         crashes started?
>
>         Do you have already inspected with the debugger the
>         datastructures in the code
>         of the get_expired_dlgs functions? Perhaps there is something
>         wrong in
>         there..
>
>         Cheers,
>
>         Henning
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Users mailing list
> Users at lists.openser.org
> http://lists.openser.org/cgi-bin/mailman/listinfo/users
>