Hi Henning.

I apologize in advance for the long post.

These days, Openser still has been crashing randomly.

Using GDB in one of the generated core files, I found something curious:

#0 0x000bcfbc in fm_malloc (qm=0x185320, size=24, file=0xfedbac10 "res.c",
    func=0xfedbac70 "db_mysql_get_columns", line=62) at mem/f_malloc.c:267
#1 0xfedb74b0 in db_mysql_get_columns (_h=0x1cbf68, _r=0x24dde8) at res.c:62
#2 0xfedb79f0 in db_mysql_convert_result (_h=0x1cbf68, _r=0x24dde8) at res.c:167
#3 0xfedb28c4 in db_mysql_store_result (_h=0x1cbf68, _r=0xffbff830) at dbase.c:209
#4 0xfedb40e8 in db_mysql_raw_query (_h=0x1cbf68,
    _s=0xff07e668 "select received, contact, socket, cflags, path from location where expires > '2008-03-04 13:37:51' and cflags & 64 = 64 and id % 1 = 0", _r=0xffbff830) at dbase.c:447
#5 0xff053260 in get_all_db_ucontacts (buf=0x1ceec0, len=320054, flags=64, part_idx=0, part_max=1)
    at dlist.c:128
#6 0xff0528c8 in get_all_ucontacts (buf=0x1ceec0, len=320058, flags=64, part_idx=0, part_max=1) at dlist.c:356
#7 0xfee57c6c in pingClients (ticks=60, param=0x0) at functions.h:60
#8 0x000aa430 in timer_ticker (timer_list=0x163c00) at timer.c:275
#9 0x000aa180 in run_timer_process (tpl=0x1c5808, do_jiffies=1) at timer.c:357
#10 0x000aa6fc in start_timer_processes () at timer.c:386
#11 0x00036788 in main_loop () at main.c:873
#12 0x0003a0c4 in main (argc=1137536, argv=0x155f1c) at main.c:1372

By inspecting in detail the frame 0, in particular the qm variable:

(gdb) print qm
$1 = (struct fm_block *) 0x185320

Which is the fm_block structure defined at mem/f_malloc.h.

(gdb) frame 0
#0 0x000bcfbc in fm_malloc (qm=0x185320, size=24, file=0xfedbac10 "res.c",
    func=0xfedbac70 "db_mysql_get_columns", line=62) at mem/f_malloc.c:267
267                             if ((*f)->size>=size) goto found;
(gdb) list
262             /*search for a suitable free frag*/
263
264             for(hash=GET_HASH(size);hash<F_HASH_SIZE;hash++){
265                     f=&(qm->free_hash[hash].first);
266                     for(;(*f); f=&((*f)->u.nxt_free))
267                             if ((*f)->size>=size) goto found;
268                     /* try in a bigger bucket */
269             }
270             /* not found, bad! */
271             return 0;

If I print the qm->free_hash array, I found that is mainly empty; For the particular case of my core file, hash has a value of three, when printing that position I have the following:

(gdb) print qm->free_hash[hash]
$1 = {first = 0x69703a31, no = 1}
(gdb) print qm->free_hash
$2 = {{first = 0x0, no = 0}, {first = 0x0, no = 0}, {first = 0x0, no = 0}, {first = 0x69703a31, no = 1}, {
    first = 0x0, no = 0}, {first = 0x0, no = 0}, {first = 0x0, no = 0}, {first = 0x0, no = 0}, {first = 0x0,
    no = 0}, {first = 0x0, no = 0}, {first = 0x24dd68, no = 4641}, {first = 0x0, no = 0} <repeats 21 times>, {
    first = 0x1ced90, no = 1}, {first = 0x0, no = 0} <repeats 679 times>, {first = 0x1cef40, no = 1}, {
    first = 0x0, no = 0} <repeats 1337 times>, {first = 0x1cef40, no = 1}, {first = 0x0, no = 0}, {
    first = 0x24de38, no = 1}, {first = 0x0, no = 0} <repeats 11 times>, {first = 0x21d100, no = 1}, {
    first = 0x0, no = 0}, {first = 0x0, no = 0}}
(gdb) print qm->free_hash.no
$3 = 0
(gdb) print qm->free_hash[hash].first
$4 = (struct fm_frag *) 0x69703a31
(gdb) x/s 0x69703a31
0x69703a31:      <Address 0x69703a31 out of bounds>

So, the error happened because from the list of memory fragments, an invalid one was referred.

I have two questions:

1. Is it normal that free_hash array at fm_block has some positions pointing to invalid locations?
2. I could see that the fm_frag_lnk struct has a member called no, which, for the printing, I see it is 0 for most of the values at the array, and it is 1 at some members, including the one which causes the crash; would it not be possible to use that member for a check before trying the allocation? What exactly means the no member, as I also see that for some of the members it has a value higher than 1.

Thanks in advance for any help, and again, I apologize for the long post.

Best regards.

Sergio Gutierrez

On Thu, Feb 28, 2008 at 11:49 AM, Sergio Gutierrez <saguti@gmail.com> wrote:

Hi Henning.

Thanks a lot for your answer.

Currently, the machine does not report any hardware problem; Solaris 10 has a service called Fault Manager, which is running on my machine, and it has not reported any error or problem related to it.

At this moment, I am testing a Openser installation compiled using an optimized version of GCC released by Sun to be used on Sparc Systems; this release is based on gcc 4, and at this time, OpenSER has been running for almost 18 hours without crash.

I will inspect the core file again, and I will be posting what I find.

Best regards, and thanks again.

Sergio Gutierrez.

On Thu, Feb 28, 2008 at 5:19 AM, Henning Westerholt <henning.westerholt@1und1.de> wrote:

On Thursday 28 February 2008, Sergio Gutierrez wrote:
> My OpenSER 1.3 installation running on Solaris Sparc is facing random and
> unexpected crashes, in appearance related to timer process.
>
> The last core presents the following backtrace
>
> #0 0xfe977a04 in get_expired_dlgs (time=4233810208) at dlg_timer.c:194
> #1 0xfe977540 in dlg_timer_routine (ticks=7980, attr=0x0) at
> dlg_timer.c:210
> #2 0x000a839c in timer_ticker (timer_list=0x15ec00) at timer.c:275
> #3 0x000a80ec in run_timer_process (tpl=0x1b8088, do_jiffies=1) at timer.c
>
> :357
>
> #4 0x000a8668 in start_timer_processes () at timer.c:386
> #5 0x00035ea8 in main_loop () at main.c:873
> #6 0x000397c4 in main (argc=-4195024, argv=0x150e9c) at main.c:1372
>
>
> Thanks in advance for any hint you can give me.

Hi Sergio,

signal 10 is SIGBUS on solaris. This could be caused from an invalid address
alignment, a segmention fault on a physical address and a object hardware
error (wikipedia).

The first crashes were both caused from a get_all_ucontact, triggered by a
timer. This crash is now another timer, deletion of expired dialogs,
strange.. Is this machine otherwise stable, when (openser release) does this
crashes started?

Do you have already inspected with the debugger the datastructures in the code
of the get_expired_dlgs functions? Perhaps there is something wrong in
there..

Cheers,

Henning