[OpenSER-Users] OpenSER Randomly crashes.

Wed Mar 5 03:28:05 CET 2008

Hi Henning.

I apologize in advance for the long post.

These days, Openser still has been crashing randomly.

Using GDB in one of the generated core files, I found something curious:

#0  0x000bcfbc in fm_malloc (qm=0x185320, size=24, file=0xfedbac10 "res.c",
    func=0xfedbac70 "db_mysql_get_columns", line=62) at mem/f_malloc.c:267
#1  0xfedb74b0 in db_mysql_get_columns (_h=0x1cbf68, _r=0x24dde8) at res.c
:62
#2  0xfedb79f0 in db_mysql_convert_result (_h=0x1cbf68, _r=0x24dde8) at
res.c:167
#3  0xfedb28c4 in db_mysql_store_result (_h=0x1cbf68, _r=0xffbff830) at
dbase.c:209
#4  0xfedb40e8 in db_mysql_raw_query (_h=0x1cbf68,
    _s=0xff07e668 "select received, contact, socket, cflags, path from
location where expires > '2008-03-04 13:37:51' and cflags & 64 = 64 and id %
1 = 0", _r=0xffbff830) at dbase.c:447
#5  0xff053260 in get_all_db_ucontacts (buf=0x1ceec0, len=320054, flags=64,
part_idx=0, part_max=1)
    at dlist.c:128
#6  0xff0528c8 in get_all_ucontacts (buf=0x1ceec0, len=320058, flags=64,
part_idx=0, part_max=1) at dlist.c:356
#7  0xfee57c6c in pingClients (ticks=60, param=0x0) at functions.h:60
#8  0x000aa430 in timer_ticker (timer_list=0x163c00) at timer.c:275
#9  0x000aa180 in run_timer_process (tpl=0x1c5808, do_jiffies=1) at timer.c
:357
#10 0x000aa6fc in start_timer_processes () at timer.c:386
#11 0x00036788 in main_loop () at main.c:873
#12 0x0003a0c4 in main (argc=1137536, argv=0x155f1c) at main.c:1372

By inspecting in detail the frame 0, in particular the qm variable:

(gdb) print qm
$1 = (struct fm_block *) 0x185320

Which is the fm_block structure defined at mem/f_malloc.h.

(gdb) frame 0
#0  0x000bcfbc in fm_malloc (qm=0x185320, size=24, file=0xfedbac10 "res.c",
    func=0xfedbac70 "db_mysql_get_columns", line=62) at mem/f_malloc.c:267
267                             if ((*f)->size>=size) goto found;
(gdb) list
262             /*search for a suitable free frag*/
263
264             for(hash=GET_HASH(size);hash<F_HASH_SIZE;hash++){
265                     f=&(qm->free_hash[hash].first);
266                     for(;(*f); f=&((*f)->u.nxt_free))
267                             if ((*f)->size>=size) goto found;
268                     /* try in a bigger bucket */
269             }
270             /* not found, bad! */
271             return 0;

If I print the qm->free_hash array, I found that is mainly empty; For the
particular case of my core file, hash has a value of three, when printing
that position I have the following:

(gdb) print qm->free_hash[hash]
$1 = {first = 0x69703a31, no = 1}
(gdb) print qm->free_hash
$2 = {{first = 0x0, no = 0}, {first = 0x0, no = 0}, {first = 0x0, no = 0},
{first = 0x69703a31, no = 1}, {
    first = 0x0, no = 0}, {first = 0x0, no = 0}, {first = 0x0, no = 0},
{first = 0x0, no = 0}, {first = 0x0,
    no = 0}, {first = 0x0, no = 0}, {first = 0x24dd68, no = 4641}, {first =
0x0, no = 0} <repeats 21 times>, {
    first = 0x1ced90, no = 1}, {first = 0x0, no = 0} <repeats 679 times>,
{first = 0x1cef40, no = 1}, {
    first = 0x0, no = 0} <repeats 1337 times>, {first = 0x1cef40, no = 1},
{first = 0x0, no = 0}, {
    first = 0x24de38, no = 1}, {first = 0x0, no = 0} <repeats 11 times>,
{first = 0x21d100, no = 1}, {
    first = 0x0, no = 0}, {first = 0x0, no = 0}}
(gdb) print qm->free_hash.no
$3 = 0
(gdb) print qm->free_hash[hash].first
$4 = (struct fm_frag *) 0x69703a31
(gdb) x/s 0x69703a31
0x69703a31:      <Address 0x69703a31 out of bounds>

So, the error happened because from the list of memory fragments, an invalid
one was referred.

I have two questions:

1. Is it normal that free_hash array at fm_block has some positions pointing
to invalid locations?
2. I could see that the fm_frag_lnk struct has a member called no, which,
for the printing, I see it is 0 for most of the values at the array, and it
is 1 at some members, including the one which causes the crash; would it not
be possible to use that member for a check before trying the allocation?
What exactly means the no member, as I also see that for some of the members
it has a value higher than 1.

Thanks in advance for any help, and again, I apologize for the long post.

Best regards.

Sergio Gutierrez

On Thu, Feb 28, 2008 at 11:49 AM, Sergio Gutierrez <saguti at gmail.com> wrote:

> Hi Henning.
>
> Thanks a lot for your answer.
>
> Currently, the machine does not report any hardware problem; Solaris 10
> has a service called Fault Manager, which is running on my machine, and it
> has not reported any error or problem related to it.
>
> At this moment, I am testing a Openser installation compiled using an
> optimized version of GCC released by Sun to be used on Sparc Systems; this
> release is based on gcc 4, and at this time, OpenSER has been running for
> almost 18 hours without crash.
>
> I will inspect the core file again, and I will be posting what I find.
>
> Best regards, and thanks again.
>
> Sergio Gutierrez.
>
>
>
>
> On Thu, Feb 28, 2008 at 5:19 AM, Henning Westerholt <
> henning.westerholt at 1und1.de> wrote:
>
> > On Thursday 28 February 2008, Sergio Gutierrez wrote:
> > > My OpenSER 1.3 installation running on Solaris Sparc is facing random
> > and
> > > unexpected crashes, in appearance related to timer process.
> > >
> > > The last core presents the following backtrace
> > >
> > > #0  0xfe977a04 in get_expired_dlgs (time=4233810208) at
> > dlg_timer.c:194
> > > #1  0xfe977540 in dlg_timer_routine (ticks=7980, attr=0x0) at
> > > dlg_timer.c:210
> > > #2  0x000a839c in timer_ticker (timer_list=0x15ec00) at timer.c:275
> > > #3  0x000a80ec in run_timer_process (tpl=0x1b8088, do_jiffies=1) at
> > timer.c
> > >
> > > :357
> > >
> > > #4  0x000a8668 in start_timer_processes () at timer.c:386
> > > #5  0x00035ea8 in main_loop () at main.c:873
> > > #6  0x000397c4 in main (argc=-4195024, argv=0x150e9c) at main.c:1372
> > >
> > >
> > > Thanks in advance for any hint you can give me.
> >
> > Hi Sergio,
> >
> > signal 10 is SIGBUS on solaris. This could be caused from an invalid
> > address
> > alignment, a segmention fault on a physical address and a object
> > hardware
> > error (wikipedia).
> >
> > The first crashes were both caused from a get_all_ucontact, triggered by
> > a
> > timer. This crash is now another timer, deletion of expired dialogs,
> > strange.. Is this machine otherwise stable, when (openser release) does
> > this
> > crashes started?
> >
> > Do you have already inspected with the debugger the datastructures in
> > the code
> > of the get_expired_dlgs functions? Perhaps there is something wrong in
> > there..
> >
> > Cheers,
> >
> > Henning
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.sip-router.org/pipermail/sr-users/attachments/20080304/be8b6f81/attachment.htm>