Hi Henning.<br><br>I apologize in advance for the long post.<br><br>These days, Openser still has been crashing randomly.<br><br>Using GDB in one of the generated core files, I found something curious:<br><br>#0 0x000bcfbc in fm_malloc (qm=0x185320, size=24, file=0xfedbac10 "res.c",<br>
func=0xfedbac70 "db_mysql_get_columns", line=62) at mem/f_malloc.c:267<br>#1 0xfedb74b0 in db_mysql_get_columns (_h=0x1cbf68, _r=0x24dde8) at res.c:62<br>#2 0xfedb79f0 in db_mysql_convert_result (_h=0x1cbf68, _r=0x24dde8) at res.c:167<br>
#3 0xfedb28c4 in db_mysql_store_result (_h=0x1cbf68, _r=0xffbff830) at dbase.c:209<br>#4 0xfedb40e8 in db_mysql_raw_query (_h=0x1cbf68,<br> _s=0xff07e668 "select received, contact, socket, cflags, path from location where expires > '2008-03-04 13:37:51' and cflags & 64 = 64 and id % 1 = 0", _r=0xffbff830) at dbase.c:447<br>
#5 0xff053260 in get_all_db_ucontacts (buf=0x1ceec0, len=320054, flags=64, part_idx=0, part_max=1)<br> at dlist.c:128<br>#6 0xff0528c8 in get_all_ucontacts (buf=0x1ceec0, len=320058, flags=64, part_idx=0, part_max=1) at dlist.c:356<br>
#7 0xfee57c6c in pingClients (ticks=60, param=0x0) at functions.h:60<br>#8 0x000aa430 in timer_ticker (timer_list=0x163c00) at timer.c:275<br>#9 0x000aa180 in run_timer_process (tpl=0x1c5808, do_jiffies=1) at timer.c:357<br>
#10 0x000aa6fc in start_timer_processes () at timer.c:386<br>#11 0x00036788 in main_loop () at main.c:873<br>#12 0x0003a0c4 in main (argc=1137536, argv=0x155f1c) at main.c:1372<br><br>By inspecting in detail the frame 0, in particular the qm variable:<br>
<br>(gdb) print qm<br>$1 = (struct fm_block *) 0x185320<br><br><br>Which is the fm_block structure defined at mem/f_malloc.h.<br><br>(gdb) frame 0<br>#0 0x000bcfbc in fm_malloc (qm=0x185320, size=24, file=0xfedbac10 "res.c",<br>
func=0xfedbac70 "db_mysql_get_columns", line=62) at mem/f_malloc.c:267<br>267 if ((*f)->size>=size) goto found;<br>(gdb) list<br>262 /*search for a suitable free frag*/<br>
263<br>264 for(hash=GET_HASH(size);hash<F_HASH_SIZE;hash++){<br>265 f=&(qm->free_hash[hash].first);<br>266 for(;(*f); f=&((*f)->u.nxt_free))<br>267 if ((*f)->size>=size) goto found;<br>
268 /* try in a bigger bucket */<br>269 }<br>270 /* not found, bad! */<br>271 return 0;<br><br><br>If I print the qm->free_hash array, I found that is mainly empty; For the particular case of my core file, hash has a value of three, when printing that position I have the following:<br>
<br>(gdb) print qm->free_hash[hash]<br>$1 = {first = 0x69703a31, no = 1}<br>(gdb) print qm->free_hash<br>$2 = {{first = 0x0, no = 0}, {first = 0x0, no = 0}, {first = 0x0, no = 0}, {first = 0x69703a31, no = 1}, {<br>
first = 0x0, no = 0}, {first = 0x0, no = 0}, {first = 0x0, no = 0}, {first = 0x0, no = 0}, {first = 0x0,<br> no = 0}, {first = 0x0, no = 0}, {first = 0x24dd68, no = 4641}, {first = 0x0, no = 0} <repeats 21 times>, {<br>
first = 0x1ced90, no = 1}, {first = 0x0, no = 0} <repeats 679 times>, {first = 0x1cef40, no = 1}, {<br> first = 0x0, no = 0} <repeats 1337 times>, {first = 0x1cef40, no = 1}, {first = 0x0, no = 0}, {<br>
first = 0x24de38, no = 1}, {first = 0x0, no = 0} <repeats 11 times>, {first = 0x21d100, no = 1}, {<br> first = 0x0, no = 0}, {first = 0x0, no = 0}}<br>(gdb) print qm->free_hash.no<br>$3 = 0<br>(gdb) print qm->free_hash[hash].first<br>
$4 = (struct fm_frag *) 0x69703a31<br>(gdb) x/s 0x69703a31<br>0x69703a31: <Address 0x69703a31 out of bounds><br><br><br>So, the error happened because from the list of memory fragments, an invalid one was referred.<br>
<br>I have two questions:<br><br>1. Is it normal that free_hash array at fm_block has some positions pointing to invalid locations?<br>2. I could see that the fm_frag_lnk struct has a member called no, which, for the printing, I see it is 0 for most of the values at the array, and it is 1 at some members, including the one which causes the crash; would it not be possible to use that member for a check before trying the allocation? What exactly means the no member, as I also see that for some of the members it has a value higher than 1.<br>
<br>Thanks in advance for any help, and again, I apologize for the long post.<br><br>Best regards.<br><br>Sergio Gutierrez<br><br><br><div class="gmail_quote">On Thu, Feb 28, 2008 at 11:49 AM, Sergio Gutierrez <<a href="mailto:saguti@gmail.com">saguti@gmail.com</a>> wrote:<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Hi Henning.<br><br>Thanks a lot for your answer.<br><br>Currently, the machine does not report any hardware problem; Solaris 10 has a service called Fault Manager, which is running on my machine, and it has not reported any error or problem related to it.<br>
<br>At this moment, I am testing a Openser installation compiled using an optimized version of GCC released by Sun to be used on Sparc Systems; this release is based on gcc 4, and at this time, OpenSER has been running for almost 18 hours without crash.<br>
<br>I will inspect the core file again, and I will be posting what I find.<br><br>Best regards, and thanks again.<br><font color="#888888"><br>Sergio Gutierrez.</font><div><div></div><div class="Wj3C7c"><br><br><br><br><div class="gmail_quote">
On Thu, Feb 28, 2008 at 5:19 AM, Henning Westerholt <<a href="mailto:henning.westerholt@1und1.de" target="_blank">henning.westerholt@1und1.de</a>> wrote:<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div>On Thursday 28 February 2008, Sergio Gutierrez wrote:<br>
> My OpenSER 1.3 installation running on Solaris Sparc is facing random and<br>
> unexpected crashes, in appearance related to timer process.<br>
><br>
> The last core presents the following backtrace<br>
><br>
> #0 0xfe977a04 in get_expired_dlgs (time=4233810208) at dlg_timer.c:194<br>
> #1 0xfe977540 in dlg_timer_routine (ticks=7980, attr=0x0) at<br>
> dlg_timer.c:210<br>
> #2 0x000a839c in timer_ticker (timer_list=0x15ec00) at timer.c:275<br>
> #3 0x000a80ec in run_timer_process (tpl=0x1b8088, do_jiffies=1) at timer.c<br>
><br>
> :357<br>
><br>
> #4 0x000a8668 in start_timer_processes () at timer.c:386<br>
> #5 0x00035ea8 in main_loop () at main.c:873<br>
> #6 0x000397c4 in main (argc=-4195024, argv=0x150e9c) at main.c:1372<br>
><br>
><br>
> Thanks in advance for any hint you can give me.<br>
<br>
</div>Hi Sergio,<br>
<br>
signal 10 is SIGBUS on solaris. This could be caused from an invalid address<br>
alignment, a segmention fault on a physical address and a object hardware<br>
error (wikipedia).<br>
<br>
The first crashes were both caused from a get_all_ucontact, triggered by a<br>
timer. This crash is now another timer, deletion of expired dialogs,<br>
strange.. Is this machine otherwise stable, when (openser release) does this<br>
crashes started?<br>
<br>
Do you have already inspected with the debugger the datastructures in the code<br>
of the get_expired_dlgs functions? Perhaps there is something wrong in<br>
there..<br>
<br>
Cheers,<br>
<font color="#888888"><br>
Henning<br>
</font></blockquote></div><br>
</div></div></blockquote></div><br>