[sr-dev] Fwd: crash & 100% CPU usage problem with kamailio 3.1.2

Andrei Pelinescu-Onciul andrei at iptel.org
Tue Apr 12 17:48:29 CEST 2011



Is this really kamailio 3.1.2? There are some line numbers that do not
match.

How was it compiled? Could you send me the output of 
kamailio -V
(or ser -V, whichever name you use)

Could you describe a bit the stress tests? Do you use an open source
tool (if so which, I would like to try to reproduce it) or
something proprietary?
If the later, do you open lots of connections all the time, or only a
few connections with heavy traffic?
How many cores has the machine on which you run kamailio?

Do you still have the coredumps? (I would be interested in some variable
values if you do).

At first sight it could either be some kind of race (that's why I asked
about the number of cores) or something completely unrelated to the tcp
code, that overwrites something in a memory area that happens to belong
to a tcp connections (especially the double free hints towards this).


Andrei


On Apr 01, 2011 at 09:32, Zunnun <zunnun at gmail.com> wrote:
> ---------- Forwarded message ----------
> From: Zunnun <zunnun at gmail.com>
> Date: Wed, Mar 30, 2011 at 10:51 AM
> Subject: crash & 100% CPU usage problem with kamailio 3.1.2
> To: sr-dev at lists.sip-router.org
> 
> 
> *kamailio 3.1.2 issues*
> 
> *Problem 1:*
> 
> Running heavy stress (for few hours), we have seen 100 % CPU usage
> Reason: the linked list is circular. next pointer points itself & the loop
> never breaks.
> 
> file: tcp_main.c
> 
> function:
> 
> inline static int _tcpconn_add_alias_unsafe(struct tcp_connection* c, int
> port,struct ip_addr* l_ip, int l_port,int flags)
> 
> for (a=tcpconn_aliases_hash[hash], nxt=0; a; a=nxt){
>                         nxt=a->next;
> 
> here a->next points to a & loop never breaks
> 
> 
> 
> 
> *Problem 2: *
> kamailio process terminates (heavy stress for over 24 hours)
> 
> Reason: it calls abort()
> 
> file: tcp_main.c
> function
> 
> inline static int tcpconn_chld_put(struct tcp_connection* tcpconn)
> {
>         if (unlikely(atomic_dec_and_test(&tcpconn->refcnt))){
>                 DBG("tcpconn_chld_put: destroying connection %p (%d, %d) "
>                                 "flags %04x\n", tcpconn, tcpconn->id,
>                                 tcpconn->s, tcpconn->flags);
>                 /* sanity checks */
>                 membar_read_atomic_op(); /* make sure we see the current
> flags */
>                 if (unlikely(!(tcpconn->flags & F_CONN_FD_CLOSED) ||
>                         (tcpconn->flags &
>                                 (F_CONN_HASHED|F_CONN_MAIN_TIMER|
>                                  F_CONN_READ_W|F_CONN_WRITE_W)) )){
>                         LOG(L_CRIT, "BUG: tcpconn_chld_put: %p bad flags =
> %0x\n",
>                                         tcpconn, tcpconn->flags);
>                         abort(); //CALLS abort
>                 }
>                 _tcpconn_free(tcpconn); /* destroys also the wbuf_q if still
> present*/
>                 return 1;
>         }
>         return 0;
> }
> 
> 
> *Problem 3: *
> kamailio crashed (heavy stress, seen it twice after 4 days 8 hours)
> Reason: Circular link list is bad, prev pointer is NULL & kamailio access it
> 
> #0  local_timer_list_expire (lt=0x82eea0, saved_ticks=1295476481) at
> local_timer.c:221
> 221                     _timer_rm_list(tl); /* detach */
> (gdb) bt
> #0  local_timer_list_expire (lt=0x82eea0, saved_ticks=1295476481) at
> local_timer.c:221
> #1  local_timer_expire (lt=0x82eea0, saved_ticks=1295476481) at
> local_timer.c:250
> #2  local_timer_run (lt=0x82eea0, saved_ticks=1295476481) at
> local_timer.c:274
> #3  0x0000000000510c3e in tcp_timer_run () at tcp_main.c:4384
> #4  tcp_main_loop () at tcp_main.c:4564
> #5  0x0000000000469eba in main_loop () at main.c:1641
> #6  0x000000000046c04f in main (argc=<value optimized out>,
> argv=0x7fff3d28a3c8) at main.c:2398
> 
> (gdb) print tl
> $1 = <value optimized out>
> (gdb) print h
> $2 = (struct timer_head *) 0x855eb8
> (gdb) print *h
> $3 = {next = 0x0, prev = 0x2acbac5398b8}
> (gdb) print *h->prev
> $4 = {next = 0x0, prev = 0x855eb8, expire = 1295476481, initial_timeout =
> 1920, data = 0x2acbac5397d0, f = 0x4f8310 <tcpconn_main_timeout>, flags =
> 512, slow_idx = 0}
> (gdb)
> 
> once prev pointer was NULL & next crash next pointer was NULL
> 
> *Problem 4:*
> kamailio process terminated (heavy stress, found it twice, after 16 hours)
> Reason: it calls abort()
> 
> file : mem/q_malloc.c
> 
> function
> 
> void qm_free(struct qm_block* qm, void* p)
> 
> partial code:
> 
> #ifdef DBG_QM_MALLOC
>         qm_debug_frag(qm, f);
>         if (f->u.is_free){
>                 LOG(L_CRIT, "BUG: qm_free: freeing already freed pointer,"
>                                 " first free: %s: %s(%ld) - aborting\n",
>                                 f->file, f->func, f->line);
>                 abort(); //CALLS ABORT
> 
>         }
>         MDBG("qm_free: freeing frag. %p alloc'ed from %s: %s(%ld)\n",
>                         f, f->file, f->func, f->line);
> #endif
> 
> *problem 5*:
> infinite loop - log file is full of these messages 100% CPU at that time
> /kamailio[21562]: : <core> [io_wait.h:617]: BUG: io_watch_del: invalid fd
> -1, not in [0, 2)
> //kamailio[21562]: : <core> [tcp_read.c:1218]: ERROR: tcpconn_receive:
> handle_io: io_watch_del failed for 0x2acbac5397d0

> _______________________________________________
> sr-dev mailing list
> sr-dev at lists.sip-router.org
> http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev




More information about the sr-dev mailing list