crash (segfault) on Kamailio master branch

List overview All Threads
Download

newer

older

git:master:98c71597: core - test:...

git:master:8ff19535:...

Jason Penton

5 Oct 2015 5 Oct '15

6:29 a.m.

Hey guys,

Not sure if there have been any changes but I have an interesting problem here when using TCP:

*The BT is as follows:*

#0 local_timer_list_expire (h=0xa0f128 <tcp_reader_ltimer+178664>, t=723807134, l=0x9e3740 <tcp_reader_ltimer>) at local_timer.c:198 #1 local_timer_expire (t=723807134, h=<optimized out>) at local_timer.c:227 #2 local_timer_run (lt=lt@entry=0x9e3740 <tcp_reader_ltimer>, saved_ticks=723807150) at local_timer.c:250 #3 0x00000000005d8417 in tcp_reader_timer_run () at tcp_read.c:1682 #4 tcp_receive_loop (unix_sock=<optimized out>) at tcp_read.c:1734 #5 0x00000000005c81c8 in tcp_init_children () at tcp_main.c:4788 #6 0x00000000004a9da3 in main_loop () at main.c:1664 #7 0x000000000042411e in main (argc=<optimized out>, argv=<optimized out>) at main.c:2566

This seems to be related to clearing timers for TCP connections. The crash is related to the following code:

*_timer_rm_list(tl)*

where it does a null pointer deref on tl->next and tl->prev, which, according to the bt, are null (see below).

*(gdb) print *tl* $14 = {next = 0x0, prev = 0x0, expire = 723807134, initial_timeout = 32, data = 0x7fbbb05aa628, f = 0x5d02f0 <tcpconn_read_timeout>, flags = 512, slow_idx = 0}

Any ideas?

Cheers Jason

Attachments:

attachment.html (text/html — 1.7 KB)

Show replies by date

Daniel-Constantin Mierla

5 Oct 5 Oct

1:42 p.m.

Hello,

the only recent change in the tcp core was resetting a flag to avoid infinite log messages (until connection was closed) via commit d36734d658cd0bcfc8357c7e85ca32da0612aaee .

Does it happen often, or you got it just once so far?

Cheers, Daniel

On 05/10/15 08:29, Jason Penton wrote:

...

Hey guys,

Not sure if there have been any changes but I have an interesting problem here when using TCP:

*The BT is as follows:*

#0 local_timer_list_expire (h=0xa0f128 <tcp_reader_ltimer+178664>, t=723807134, l=0x9e3740 <tcp_reader_ltimer>) at local_timer.c:198 #1 local_timer_expire (t=723807134, h=<optimized out>) at local_timer.c:227 #2 local_timer_run (lt=lt@entry=0x9e3740 <tcp_reader_ltimer>, saved_ticks=723807150) at local_timer.c:250 #3 0x00000000005d8417 in tcp_reader_timer_run () at tcp_read.c:1682 #4 tcp_receive_loop (unix_sock=<optimized out>) at tcp_read.c:1734 #5 0x00000000005c81c8 in tcp_init_children () at tcp_main.c:4788 #6 0x00000000004a9da3 in main_loop () at main.c:1664 #7 0x000000000042411e in main (argc=<optimized out>, argv=<optimized out>) at main.c:2566

This seems to be related to clearing timers for TCP connections. The crash is related to the following code:

*_timer_rm_list(tl)*

where it does a null pointer deref on tl->next and tl->prev, which, according to the bt, are null (see below).

*(gdb) print *tl* $14 = {next = 0x0, prev = 0x0, expire = 723807134, initial_timeout = 32, data = 0x7fbbb05aa628, f = 0x5d02f0 <tcpconn_read_timeout>, flags = 512, slow_idx = 0}

Any ideas?

Cheers Jason

sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

-- Daniel-Constantin Mierla http://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda Book: SIP Routing With Kamailio - http://www.asipto.com Kamailio Advanced Training, Sep 28-30, 2015, in Berlin - http://asipto.com/u/kat

Jason Penton

1:47 p.m.

Hey Daniel,

Got it once so far... I'll take a look at that commit and see what it did - but very strange indeed.

Cheers Jason

On Mon, 5 Oct 2015 at 15:42 Daniel-Constantin Mierla miconda@gmail.com wrote:

...

Hello,

the only recent change in the tcp core was resetting a flag to avoid infinite log messages (until connection was closed) via commit d36734d658cd0bcfc8357c7e85ca32da0612aaee .

Does it happen often, or you got it just once so far?

Cheers, Daniel

On 05/10/15 08:29, Jason Penton wrote:

Hey guys,

Not sure if there have been any changes but I have an interesting problem here when using TCP:

*The BT is as follows:*

#0 local_timer_list_expire (h=0xa0f128 <tcp_reader_ltimer+178664>, t=723807134, l=0x9e3740 <tcp_reader_ltimer>) at local_timer.c:198 #1 local_timer_expire (t=723807134, h=<optimized out>) at local_timer.c:227 #2 local_timer_run (lt=lt@entry=0x9e3740 <tcp_reader_ltimer>, saved_ticks=723807150) at local_timer.c:250 #3 0x00000000005d8417 in tcp_reader_timer_run () at tcp_read.c:1682 #4 tcp_receive_loop (unix_sock=<optimized out>) at tcp_read.c:1734 #5 0x00000000005c81c8 in tcp_init_children () at tcp_main.c:4788 #6 0x00000000004a9da3 in main_loop () at main.c:1664 #7 0x000000000042411e in main (argc=<optimized out>, argv=<optimized out>) at main.c:2566

This seems to be related to clearing timers for TCP connections. The crash is related to the following code:

*_timer_rm_list(tl)*

where it does a null pointer deref on tl->next and tl->prev, which, according to the bt, are null (see below).

*(gdb) print *tl* $14 = {next = 0x0, prev = 0x0, expire = 723807134, initial_timeout = 32, data = 0x7fbbb05aa628, f = 0x5d02f0 <tcpconn_read_timeout>, flags = 512, slow_idx = 0}

Any ideas?

Cheers Jason

sr-dev mailing listsr-dev@lists.sip-router.orghttp://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

-- Daniel-Constantin Mierlahttp://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda Book: SIP Routing With Kamailio - http://www.asipto.com Kamailio Advanced Training, Sep 28-30, 2015, in Berlin - http://asipto.com/u/kat

sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

Daniel-Constantin Mierla

1:54 p.m.

It could be some other reason, I will investigate as well. Using any modules on top of tcp (like tcpops, tls or websocket)?

Cheers, Daniel

On 05/10/15 15:47, Jason Penton wrote:

...

Hey Daniel,

Got it once so far... I'll take a look at that commit and see what it did - but very strange indeed.

Cheers Jason

On Mon, 5 Oct 2015 at 15:42 Daniel-Constantin Mierla <miconda@gmail.com mailto:miconda@gmail.com> wrote:

Hello,

the only recent change in the tcp core was resetting a flag to
avoid infinite log messages (until connection was closed) via
commit d36734d658cd0bcfc8357c7e85ca32da0612aaee .

Does it happen often, or you got it just once so far?

Cheers,
Daniel


On 05/10/15 08:29, Jason Penton wrote:

...

Hey guys, 

Not sure if there have been any changes but I have an interesting
problem here when using TCP:

*The BT is as follows:*

#0  local_timer_list_expire (h=0xa0f128
<tcp_reader_ltimer+178664>, t=723807134, l=0x9e3740
<tcp_reader_ltimer>) at local_timer.c:198
#1  local_timer_expire (t=723807134, h=<optimized out>) at
local_timer.c:227
#2  local_timer_run (lt=lt@entry=0x9e3740 <tcp_reader_ltimer>,
saved_ticks=723807150) at local_timer.c:250
#3  0x00000000005d8417 in tcp_reader_timer_run () at tcp_read.c:1682
#4  tcp_receive_loop (unix_sock=<optimized out>) at tcp_read.c:1734
#5  0x00000000005c81c8 in tcp_init_children () at tcp_main.c:4788
#6  0x00000000004a9da3 in main_loop () at main.c:1664
#7  0x000000000042411e in main (argc=<optimized out>,
argv=<optimized out>) at main.c:2566

This seems to be related to clearing timers for TCP connections.
The crash is related to the following code:

*_timer_rm_list(tl)*

where it does a null pointer deref on tl->next and tl->prev,
which, according to the bt, are null (see below).

*(gdb) print *tl*
$14 = {next = 0x0, prev = 0x0, expire = 723807134,
initial_timeout = 32, data = 0x7fbbb05aa628, f = 0x5d02f0
<tcpconn_read_timeout>, flags = 512, slow_idx = 0}

Any ideas?

Cheers
Jason






_______________________________________________
sr-dev mailing list
sr-dev@lists.sip-router.org <mailto:sr-dev@lists.sip-router.org>
http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

-- 
Daniel-Constantin Mierla
http://twitter.com/#!/miconda <http://twitter.com/#%21/miconda> - http://www.linkedin.com/in/miconda
Book: SIP Routing With Kamailio - http://www.asipto.com
Kamailio Advanced Training, Sep 28-30, 2015, in Berlin - http://asipto.com/u/kat

_______________________________________________
sr-dev mailing list
sr-dev@lists.sip-router.org <mailto:sr-dev@lists.sip-router.org>
http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

Jason Penton

1:56 p.m.

Hey Daniel

Using TLS only. not using tcpops nor websocket

On Mon, 5 Oct 2015 at 15:54 Daniel-Constantin Mierla miconda@gmail.com wrote:

...

It could be some other reason, I will investigate as well. Using any modules on top of tcp (like tcpops, tls or websocket)?

Cheers, Daniel

On 05/10/15 15:47, Jason Penton wrote:

Hey Daniel,

Got it once so far... I'll take a look at that commit and see what it did

but very strange indeed.

Cheers Jason

On Mon, 5 Oct 2015 at 15:42 Daniel-Constantin Mierla miconda@gmail.com wrote:

...
Hello,

the only recent change in the tcp core was resetting a flag to avoid infinite log messages (until connection was closed) via commit d36734d658cd0bcfc8357c7e85ca32da0612aaee .

Does it happen often, or you got it just once so far?

Cheers, Daniel

On 05/10/15 08:29, Jason Penton wrote:

Hey guys,

Not sure if there have been any changes but I have an interesting problem here when using TCP:

*The BT is as follows:*

#0 local_timer_list_expire (h=0xa0f128 <tcp_reader_ltimer+178664>, t=723807134, l=0x9e3740 <tcp_reader_ltimer>) at local_timer.c:198 #1 local_timer_expire (t=723807134, h=<optimized out>) at local_timer.c:227 #2 local_timer_run (lt=lt@entry=0x9e3740 <tcp_reader_ltimer>, saved_ticks=723807150) at local_timer.c:250 #3 0x00000000005d8417 in tcp_reader_timer_run () at tcp_read.c:1682 #4 tcp_receive_loop (unix_sock=<optimized out>) at tcp_read.c:1734 #5 0x00000000005c81c8 in tcp_init_children () at tcp_main.c:4788 #6 0x00000000004a9da3 in main_loop () at main.c:1664 #7 0x000000000042411e in main (argc=<optimized out>, argv=<optimized out>) at main.c:2566

This seems to be related to clearing timers for TCP connections. The crash is related to the following code:

*_timer_rm_list(tl)*

where it does a null pointer deref on tl->next and tl->prev, which, according to the bt, are null (see below).

*(gdb) print *tl* $14 = {next = 0x0, prev = 0x0, expire = 723807134, initial_timeout = 32, data = 0x7fbbb05aa628, f = 0x5d02f0 <tcpconn_read_timeout>, flags = 512, slow_idx = 0}

Any ideas?

Cheers Jason

sr-dev mailing listsr-dev@lists.sip-router.orghttp://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

-- Daniel-Constantin Mierlahttp://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda Book: SIP Routing With Kamailio - http://www.asipto.com Kamailio Advanced Training, Sep 28-30, 2015, in Berlin - http://asipto.com/u/kat

sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

-- Daniel-Constantin Mierlahttp://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda Book: SIP Routing With Kamailio - http://www.asipto.com Kamailio Advanced Training, Sep 28-30, 2015, in Berlin - http://asipto.com/u/kat

Jason Penton

6 Oct 6 Oct

1:41 p.m.

I can confirm this is not related to TLS as I have recreated using TCP only. Here is another BT (it seems the entire tcpconn structure is corrupt)....

*(gdb) bt* #0 local_timer_del (h=h@entry=0x99f3a0 <tcp_main_ltimer>, tl=tl@entry=0x7ff3c82ce910) at local_timer.c:178 #1 0x00000000005ad47c in handle_tcpconn_ev (tcpconn=0x7ff3c82ce828, ev=<optimized out>, fd_i=-1) at tcp_main.c:4139 #2 0x00000000005c3e57 in io_wait_loop_epoll (h=<optimized out>, t=<optimized out>, repeat=<optimized out>) at io_wait.h:1061 #3 tcp_main_loop () at tcp_main.c:4481 #4 0x00000000004ab79f in main_loop () at main.c:1672 #5 0x000000000042411e in main (argc=<optimized out>, argv=<optimized out>) at main.c:2566 (gdb) frame 1 #1 0x00000000005ad47c in handle_tcpconn_ev (tcpconn=0x7ff3c82ce828, ev=<optimized out>, fd_i=-1) at tcp_main.c:4139 4139 tcp_main.c: No such file or directory.

*(gdb) frame 0* *(gdb) p *tl* $3 = {next = 0x616a1a160109010d, prev = 0x746e65702e6e6f73, expire = 1933602415, initial_timeout = 1701603693, data = 0x6d6f632e736d6f63, f = 0x9060d3022018230, flags = 34346, slow_idx = 34376} Looks like there is corruption....

*(gdb) frame 1* *(gdb) print *tcpconn* $11 = {s = 77, fd = -1, write_lock = {val = 14}, id = 16254464, reader_pid = -2110739197, rcv = {src_ip = {af = 60854530, len = 33685762, u = {addrl = {9667549471699894785, 363385298974377544}, addr32 = { 221250049, 2250901766, 234325576, 84607233}, addr16 = {513, 3376, 2310, 34346, 34376, 3575, 257, 1291}, addr = "\001\002\060\r\006\t*\206H\206\367\r\001\001\v\005"}}, dst_ip = {af = 825700352, len = 101265419, u = {addrl = {5140016830441870595, 312159073630948145}, addr32 = {100947203, 1196753427, 154143537, 72680198}, addr16 = {21763, 1540, 531, 18261, 2865, 2352, 774, 1109}, addr = "\003U\004\006\023\002UG1\v0\t\006\003U\004"}}, src_port = 3080, dst_port = 21762, proto_reserved1 = 806039879, proto_reserved2 = 1426261513, src_su = {s = {sa_family = 1796, sa_data = "\f\002UG1\016\060\f\006\003U\004\n\f"}, sin = {sin_family = 1796, sin_port = 524, sin_addr = {s_addr = 238110549}, sin_zero = "0\f\006\003U\004\n\f"}, sin6 = {sin6_family = 1796, sin6_port = 524, sin6_flowinfo = 238110549, sin6_addr = {__in6_u = {__u6_addr8 = "0\f\006\003U\004\n\f\005Smile0\036", __u6_addr16 = {3120, 774, 1109, 3082, 21253, 26989, 25964, 7728}, __u6_addr32 = { 50727984, 201983061, 1768772357, 506488172}}}, sin6_scope_id = 892407063}}, bind_address = 0x3135393034323830, proto = 51 '3'}, req = {next = 0x3331353930333238, buf = 0x9300b317c305a34 <error: Cannot access memory at address 0x9300b317c305a34>, start = 0x5502130604550306 <error: Cannot access memory at address 0x5502130604550306>, pos = 0x55030609300b3147 <error: Cannot access memory at address 0x55030609300b3147>, parsed = 0xe314755020c0804 <error: Cannot access memory at address 0xe314755020c0804>, body = 0xc0a045503060c30 <error: Cannot access memory at address 0xc0a045503060c30>, b_size = 1768772357, content_len = 623994220, chunk_size = 50733872, flags = 1109, bytes_to_go = 1935896604, error = 1680762467, state = 1663989349}, refcnt = {val = 1818848627}, type = 1836016485, flags = 12155, send_flags = {f = 99 'c', blst_imask = 111 'o'}, state = 808005997, extra_data = 0xf78648862a090627, timer = {next = 0x616a1a160109010d, prev = 0x746e65702e6e6f73, expire = 1933602415, initial_timeout = 1701603693, data = 0x6d6f632e736d6f63, f = 0x9060d3022018230, flags = 34346, slow_idx = 34376}, timeout = 50332929, lifetime = 983426, id_hash = 167871024, id_next = 0xab4698aaf4cd000, id_prev = 0x82f3bd212ad477e1, c_next = 0xfebfadf2e80d476b, c_prev = 0x542ad26c6fff5d01, con_aliases = {{ parent = 0x7ee7daa501f3ec2, next = 0x2564cfdb796faf0b, prev = 0x67075f982b68c22d, port = 26158, hash = 52278}, {parent = 0xa61ccf520173dbcf, next = 0xedfd914bd9290055, prev = 0x4c82e1be15c8e6bd, port = 56482, hash = 35158}, {parent = 0x1dd62e187ab759f8, next = 0x5b9483e3bd60806b, prev = 0x3c0e501ef81ca353, port = 42794, hash = 37520}, {parent = 0xb6c3f82065f2d125, next = 0xa78be2be546e1fb7, prev = 0x7a2f32985edb7215, port = 16490, hash = 49867}, {parent = 0x12bd4ff964ba4650, next = 0x7f071da522484c27, prev = 0x949f60e68638512f, port = 26169, hash = 41924}, {parent = 0x60a5d1aac5486cb5, next = 0xfa82965a57bf2ff6, prev = 0x7121974cf7f338d5, port = 64152, hash = 31481}, {parent = 0xbe418ec01629988f, next = 0x57a216f873989cb5, prev = 0x8dceabdc4ec9c171, port = 55564, hash = 38148}, { parent = 0x7ba30100010302ef, next = 0x1d55030609307930, prev = 0x62c300030020413, port = 24585, hash = 18566}, {parent = 0x704f1d161f040d01, next = 0x6547204c53536e65, prev = 0x206465746172656e, port = 25923, hash = 29810}, {parent = 0x5503061d30657461, next = 0xb4f5140416040e1d, prev = 0xb4484ad4fb6fc4a5, port = 45712, hash = 7028}, {parent = 0x1d5503061f30883e, next = 0xf714801630180423, prev = 0xf2ba7396fd05170d, port = 24267, hash = 6103}, {parent = 0x2a09060d30f186cf, next = 0xb01010df7864886, prev = 0x8f00010182030005, port = 51556, hash = 28499}}, aliases = 672183753, wbuf_q = { first = 0x5b7bf0e590790cf6, last = 0x67d25b5f6e53d679, wr_timeout = 1518084272, queued = 3564629620, offset = 996062876, last_used = 950886522}}

On Mon, 5 Oct 2015 at 15:56 Jason Penton jason.penton@gmail.com wrote:

...

Hey Daniel

Using TLS only. not using tcpops nor websocket

On Mon, 5 Oct 2015 at 15:54 Daniel-Constantin Mierla miconda@gmail.com wrote:

...
It could be some other reason, I will investigate as well. Using any modules on top of tcp (like tcpops, tls or websocket)?

Cheers, Daniel

On 05/10/15 15:47, Jason Penton wrote:

Hey Daniel,

Got it once so far... I'll take a look at that commit and see what it did

but very strange indeed.

Cheers Jason

On Mon, 5 Oct 2015 at 15:42 Daniel-Constantin Mierla miconda@gmail.com wrote:

...
Hello,

the only recent change in the tcp core was resetting a flag to avoid infinite log messages (until connection was closed) via commit d36734d658cd0bcfc8357c7e85ca32da0612aaee .

Does it happen often, or you got it just once so far?

Cheers, Daniel

On 05/10/15 08:29, Jason Penton wrote:

Hey guys,

Not sure if there have been any changes but I have an interesting problem here when using TCP:

*The BT is as follows:*

#0 local_timer_list_expire (h=0xa0f128 <tcp_reader_ltimer+178664>, t=723807134, l=0x9e3740 <tcp_reader_ltimer>) at local_timer.c:198 #1 local_timer_expire (t=723807134, h=<optimized out>) at local_timer.c:227 #2 local_timer_run (lt=lt@entry=0x9e3740 <tcp_reader_ltimer>, saved_ticks=723807150) at local_timer.c:250 #3 0x00000000005d8417 in tcp_reader_timer_run () at tcp_read.c:1682 #4 tcp_receive_loop (unix_sock=<optimized out>) at tcp_read.c:1734 #5 0x00000000005c81c8 in tcp_init_children () at tcp_main.c:4788 #6 0x00000000004a9da3 in main_loop () at main.c:1664 #7 0x000000000042411e in main (argc=<optimized out>, argv=<optimized out>) at main.c:2566

This seems to be related to clearing timers for TCP connections. The crash is related to the following code:

*_timer_rm_list(tl)*

where it does a null pointer deref on tl->next and tl->prev, which, according to the bt, are null (see below).

*(gdb) print *tl* $14 = {next = 0x0, prev = 0x0, expire = 723807134, initial_timeout = 32, data = 0x7fbbb05aa628, f = 0x5d02f0 <tcpconn_read_timeout>, flags = 512, slow_idx = 0}

Any ideas?

Cheers Jason

sr-dev mailing listsr-dev@lists.sip-router.orghttp://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

-- Daniel-Constantin Mierlahttp://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda Book: SIP Routing With Kamailio - http://www.asipto.com Kamailio Advanced Training, Sep 28-30, 2015, in Berlin - http://asipto.com/u/kat

sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

-- Daniel-Constantin Mierlahttp://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda Book: SIP Routing With Kamailio - http://www.asipto.com Kamailio Advanced Training, Sep 28-30, 2015, in Berlin - http://asipto.com/u/kat

Daniel-Constantin Mierla

1:50 p.m.

Can you try to run with '-x qm' -- maybe there is a double free or a buffer overflow, q_malloc is more suitable for debugging such cases.

Also, can you give some details on reproducing it? Are you stress testing it?

Daniel

On 06/10/15 15:41, Jason Penton wrote:

...

I can confirm this is not related to TLS as I have recreated using TCP only. Here is another BT (it seems the entire tcpconn structure is corrupt)....

*(gdb) bt* #0 local_timer_del (h=h@entry=0x99f3a0 <tcp_main_ltimer>, tl=tl@entry=0x7ff3c82ce910) at local_timer.c:178 #1 0x00000000005ad47c in handle_tcpconn_ev (tcpconn=0x7ff3c82ce828, ev=<optimized out>, fd_i=-1) at tcp_main.c:4139 #2 0x00000000005c3e57 in io_wait_loop_epoll (h=<optimized out>, t=<optimized out>, repeat=<optimized out>) at io_wait.h:1061 #3 tcp_main_loop () at tcp_main.c:4481 #4 0x00000000004ab79f in main_loop () at main.c:1672 #5 0x000000000042411e in main (argc=<optimized out>, argv=<optimized out>) at main.c:2566 (gdb) frame 1 #1 0x00000000005ad47c in handle_tcpconn_ev (tcpconn=0x7ff3c82ce828, ev=<optimized out>, fd_i=-1) at tcp_main.c:4139 4139tcp_main.c: No such file or directory.

*(gdb) frame 0* *(gdb) p *tl* $3 = {next = 0x616a1a160109010d, prev = 0x746e65702e6e6f73, expire = 1933602415, initial_timeout = 1701603693, data = 0x6d6f632e736d6f63, f = 0x9060d3022018230, flags = 34346, slow_idx = 34376} Looks like there is corruption....

*(gdb) frame 1* *(gdb) print *tcpconn* $11 = {s = 77, fd = -1, write_lock = {val = 14}, id = 16254464, reader_pid = -2110739197, rcv = {src_ip = {af = 60854530, len = 33685762, u = {addrl = {9667549471699894785, 363385298974377544}, addr32 = { 221250049, 2250901766, 234325576, 84607233}, addr16 = {513, 3376, 2310, 34346, 34376, 3575, 257, 1291}, addr = "\001\002\060\r\006\t*\206H\206\367\r\001\001\v\005"}}, dst_ip = {af = 825700352, len = 101265419, u = {addrl = {5140016830441870595, 312159073630948145}, addr32 = {100947203, 1196753427, 154143537, 72680198}, addr16 = {21763, 1540, 531, 18261, 2865, 2352, 774, 1109}, addr = "\003U\004\006\023\002UG1\v0\t\006\003U\004"}}, src_port = 3080, dst_port = 21762, proto_reserved1 = 806039879, proto_reserved2 = 1426261513, src_su = {s = {sa_family = 1796, sa_data = "\f\002UG1\016\060\f\006\003U\004\n\f"}, sin = {sin_family = 1796, sin_port = 524, sin_addr = {s_addr = 238110549}, sin_zero = "0\f\006\003U\004\n\f"}, sin6 = {sin6_family = 1796, sin6_port = 524, sin6_flowinfo = 238110549, sin6_addr = {__in6_u = {__u6_addr8 = "0\f\006\003U\004\n\f\005Smile0\036", __u6_addr16 = {3120, 774, 1109, 3082, 21253, 26989, 25964, 7728}, __u6_addr32 = { 50727984, 201983061, 1768772357, 506488172}}}, sin6_scope_id = 892407063}}, bind_address = 0x3135393034323830, proto = 51 '3'}, req = {next = 0x3331353930333238, buf = 0x9300b317c305a34 <error: Cannot access memory at address 0x9300b317c305a34>, start = 0x5502130604550306 <error: Cannot access memory at address 0x5502130604550306>, pos = 0x55030609300b3147 <error: Cannot access memory at address 0x55030609300b3147>, parsed = 0xe314755020c0804 <error: Cannot access memory at address 0xe314755020c0804>, body = 0xc0a045503060c30 <error: Cannot access memory at address 0xc0a045503060c30>, b_size = 1768772357, content_len = 623994220, chunk_size = 50733872, flags = 1109, bytes_to_go = 1935896604, error = 1680762467, state = 1663989349}, refcnt = {val = 1818848627}, type = 1836016485, flags = 12155, send_flags = {f = 99 'c', blst_imask = 111 'o'}, state = 808005997, extra_data = 0xf78648862a090627, timer = {next = 0x616a1a160109010d, prev = 0x746e65702e6e6f73, expire = 1933602415, initial_timeout = 1701603693, data = 0x6d6f632e736d6f63, f = 0x9060d3022018230, flags = 34346, slow_idx = 34376}, timeout = 50332929, lifetime = 983426, id_hash = 167871024, id_next = 0xab4698aaf4cd000, id_prev = 0x82f3bd212ad477e1, c_next = 0xfebfadf2e80d476b, c_prev = 0x542ad26c6fff5d01, con_aliases = {{ parent = 0x7ee7daa501f3ec2, next = 0x2564cfdb796faf0b, prev = 0x67075f982b68c22d, port = 26158, hash = 52278}, {parent = 0xa61ccf520173dbcf, next = 0xedfd914bd9290055, prev = 0x4c82e1be15c8e6bd, port = 56482, hash = 35158}, {parent = 0x1dd62e187ab759f8, next = 0x5b9483e3bd60806b, prev = 0x3c0e501ef81ca353, port = 42794, hash = 37520}, {parent = 0xb6c3f82065f2d125, next = 0xa78be2be546e1fb7, prev = 0x7a2f32985edb7215, port = 16490, hash = 49867}, {parent = 0x12bd4ff964ba4650, next = 0x7f071da522484c27, prev = 0x949f60e68638512f, port = 26169, hash = 41924}, {parent = 0x60a5d1aac5486cb5, next = 0xfa82965a57bf2ff6, prev = 0x7121974cf7f338d5, port = 64152, hash = 31481}, {parent = 0xbe418ec01629988f, next = 0x57a216f873989cb5, prev = 0x8dceabdc4ec9c171, port = 55564, hash = 38148}, { parent = 0x7ba30100010302ef, next = 0x1d55030609307930, prev = 0x62c300030020413, port = 24585, hash = 18566}, {parent = 0x704f1d161f040d01, next = 0x6547204c53536e65, prev = 0x206465746172656e, port = 25923, hash = 29810}, {parent = 0x5503061d30657461, next = 0xb4f5140416040e1d, prev = 0xb4484ad4fb6fc4a5, port = 45712, hash = 7028}, {parent = 0x1d5503061f30883e, next = 0xf714801630180423, prev = 0xf2ba7396fd05170d, port = 24267, hash = 6103}, {parent = 0x2a09060d30f186cf, next = 0xb01010df7864886, prev = 0x8f00010182030005, port = 51556, hash = 28499}}, aliases = 672183753, wbuf_q = { first = 0x5b7bf0e590790cf6, last = 0x67d25b5f6e53d679, wr_timeout = 1518084272, queued = 3564629620, offset = 996062876, last_used = 950886522}}

On Mon, 5 Oct 2015 at 15:56 Jason Penton <jason.penton@gmail.com mailto:jason.penton@gmail.com> wrote:
Hey Daniel

Using TLS only. not using tcpops nor websocket

On Mon, 5 Oct 2015 at 15:54 Daniel-Constantin Mierla
<miconda@gmail.com <mailto:miconda@gmail.com>> wrote:

    It could be some other reason, I will investigate as well.
    Using any modules on top of tcp (like tcpops, tls or websocket)?

    Cheers,
    Daniel


    On 05/10/15 15:47, Jason Penton wrote:
...
    Hey Daniel,

    Got it once so far... I'll take a look at that commit and see
    what it did - but very strange indeed.

    Cheers
    Jason

    On Mon, 5 Oct 2015 at 15:42 Daniel-Constantin Mierla
    <miconda@gmail.com <mailto:miconda@gmail.com>> wrote:

        Hello,

        the only recent change in the tcp core was resetting a
        flag to avoid infinite log messages (until connection was
        closed) via commit d36734d658cd0bcfc8357c7e85ca32da0612aaee .

        Does it happen often, or you got it just once so far?

        Cheers,
        Daniel


        On 05/10/15 08:29, Jason Penton wrote:
...
        Hey guys, 

        Not sure if there have been any changes but I have an
        interesting problem here when using TCP:

        *The BT is as follows:*

        #0  local_timer_list_expire (h=0xa0f128
        <tcp_reader_ltimer+178664>, t=723807134, l=0x9e3740
        <tcp_reader_ltimer>) at local_timer.c:198
        #1  local_timer_expire (t=723807134, h=<optimized out>)
        at local_timer.c:227
        #2  local_timer_run (lt=lt@entry=0x9e3740
        <tcp_reader_ltimer>, saved_ticks=723807150) at
        local_timer.c:250
        #3  0x00000000005d8417 in tcp_reader_timer_run () at
        tcp_read.c:1682
        #4  tcp_receive_loop (unix_sock=<optimized out>) at
        tcp_read.c:1734
        #5  0x00000000005c81c8 in tcp_init_children () at
        tcp_main.c:4788
        #6  0x00000000004a9da3 in main_loop () at main.c:1664
        #7  0x000000000042411e in main (argc=<optimized out>,
        argv=<optimized out>) at main.c:2566

        This seems to be related to clearing timers for TCP
        connections. The crash is related to the following code:

        *_timer_rm_list(tl)*

        where it does a null pointer deref on tl->next and
        tl->prev, which, according to the bt, are null (see below).

        *(gdb) print *tl*
        $14 = {next = 0x0, prev = 0x0, expire = 723807134,
        initial_timeout = 32, data = 0x7fbbb05aa628, f =
        0x5d02f0 <tcpconn_read_timeout>, flags = 512, slow_idx = 0}

        Any ideas?

        Cheers
        Jason






        _______________________________________________
        sr-dev mailing list
        sr-dev@lists.sip-router.org
        <mailto:sr-dev@lists.sip-router.org>
        http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev
        -- 
        Daniel-Constantin Mierla
        http://twitter.com/#!/miconda
        <http://twitter.com/#%21/miconda> - http://www.linkedin.com/in/miconda
        Book: SIP Routing With Kamailio - http://www.asipto.com
        Kamailio Advanced Training, Sep 28-30, 2015, in Berlin - http://asipto.com/u/kat

        _______________________________________________
        sr-dev mailing list
        sr-dev@lists.sip-router.org
        <mailto:sr-dev@lists.sip-router.org>
        http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev
    -- 
    Daniel-Constantin Mierla
    http://twitter.com/#!/miconda <http://twitter.com/#%21/miconda> - http://www.linkedin.com/in/miconda
    Book: SIP Routing With Kamailio - http://www.asipto.com
    Kamailio Advanced Training, Sep 28-30, 2015, in Berlin - http://asipto.com/u/kat

-- Daniel-Constantin Mierla http://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda Book: SIP Routing With Kamailio - http://www.asipto.com

Jason Penton

1:57 p.m.

Ok cool Daniel, will do.

No, not stress testing at all.... I can't reproduce it on request - but let me add the mem debugging and see how we go...

Cheers Jason

On Tue, 6 Oct 2015 at 15:50 Daniel-Constantin Mierla miconda@gmail.com wrote:

...

Can you try to run with '-x qm' -- maybe there is a double free or a buffer overflow, q_malloc is more suitable for debugging such cases.

Also, can you give some details on reproducing it? Are you stress testing it?

Daniel

On 06/10/15 15:41, Jason Penton wrote:

I can confirm this is not related to TLS as I have recreated using TCP only. Here is another BT (it seems the entire tcpconn structure is corrupt)....

*(gdb) bt* #0 local_timer_del (h=h@entry=0x99f3a0 <tcp_main_ltimer>, tl=tl@entry=0x7ff3c82ce910) at local_timer.c:178 #1 0x00000000005ad47c in handle_tcpconn_ev (tcpconn=0x7ff3c82ce828, ev=<optimized out>, fd_i=-1) at tcp_main.c:4139 #2 0x00000000005c3e57 in io_wait_loop_epoll (h=<optimized out>, t=<optimized out>, repeat=<optimized out>) at io_wait.h:1061 #3 tcp_main_loop () at tcp_main.c:4481 #4 0x00000000004ab79f in main_loop () at main.c:1672 #5 0x000000000042411e in main (argc=<optimized out>, argv=<optimized out>) at main.c:2566 (gdb) frame 1 #1 0x00000000005ad47c in handle_tcpconn_ev (tcpconn=0x7ff3c82ce828, ev=<optimized out>, fd_i=-1) at tcp_main.c:4139 4139 tcp_main.c: No such file or directory.

*(gdb) frame 0* *(gdb) p *tl* $3 = {next = 0x616a1a160109010d, prev = 0x746e65702e6e6f73, expire = 1933602415, initial_timeout = 1701603693, data = 0x6d6f632e736d6f63, f = 0x9060d3022018230, flags = 34346, slow_idx = 34376} Looks like there is corruption....

*(gdb) frame 1* *(gdb) print *tcpconn* $11 = {s = 77, fd = -1, write_lock = {val = 14}, id = 16254464, reader_pid = -2110739197, rcv = {src_ip = {af = 60854530, len = 33685762, u = {addrl = {9667549471699894785, 363385298974377544}, addr32 = { 221250049, 2250901766, 234325576, 84607233}, addr16 = {513, 3376, 2310, 34346, 34376, 3575, 257, 1291}, addr = "\001\002\060\r\006\t*\206H\206\367\r\001\001\v\005"}}, dst_ip = {af = 825700352, len = 101265419, u = {addrl = {5140016830441870595, 312159073630948145}, addr32 = {100947203, 1196753427, 154143537, 72680198}, addr16 = {21763, 1540, 531, 18261, 2865, 2352, 774, 1109}, addr = "\003U\004\006\023\002UG1\v0\t\006\003U\004"}}, src_port = 3080, dst_port = 21762, proto_reserved1 = 806039879, proto_reserved2 = 1426261513, src_su = {s = {sa_family = 1796, sa_data = "\f\002UG1\016\060\f\006\003U\004\n\f"}, sin = {sin_family = 1796, sin_port = 524, sin_addr = {s_addr = 238110549}, sin_zero = "0\f\006\003U\004\n\f"}, sin6 = {sin6_family = 1796, sin6_port = 524, sin6_flowinfo = 238110549, sin6_addr = {__in6_u = {__u6_addr8 = "0\f\006\003U\004\n\f\005Smile0\036", __u6_addr16 = {3120, 774, 1109, 3082, 21253, 26989, 25964, 7728}, __u6_addr32 = { 50727984, 201983061, 1768772357, 506488172}}}, sin6_scope_id = 892407063}}, bind_address = 0x3135393034323830, proto = 51 '3'}, req = {next = 0x3331353930333238, buf = 0x9300b317c305a34 <error: Cannot access memory at address 0x9300b317c305a34>, start = 0x5502130604550306 <error: Cannot access memory at address 0x5502130604550306>, pos = 0x55030609300b3147 <error: Cannot access memory at address 0x55030609300b3147>, parsed = 0xe314755020c0804 <error: Cannot access memory at address 0xe314755020c0804>, body = 0xc0a045503060c30 <error: Cannot access memory at address 0xc0a045503060c30>, b_size = 1768772357, content_len = 623994220, chunk_size = 50733872, flags = 1109, bytes_to_go = 1935896604, error = 1680762467, state = 1663989349}, refcnt = {val = 1818848627}, type = 1836016485, flags = 12155, send_flags = {f = 99 'c', blst_imask = 111 'o'}, state = 808005997, extra_data = 0xf78648862a090627, timer = {next = 0x616a1a160109010d, prev = 0x746e65702e6e6f73, expire = 1933602415, initial_timeout = 1701603693, data = 0x6d6f632e736d6f63, f = 0x9060d3022018230, flags = 34346, slow_idx = 34376}, timeout = 50332929, lifetime = 983426, id_hash = 167871024, id_next = 0xab4698aaf4cd000, id_prev = 0x82f3bd212ad477e1, c_next = 0xfebfadf2e80d476b, c_prev = 0x542ad26c6fff5d01, con_aliases = {{ parent = 0x7ee7daa501f3ec2, next = 0x2564cfdb796faf0b, prev = 0x67075f982b68c22d, port = 26158, hash = 52278}, {parent = 0xa61ccf520173dbcf, next = 0xedfd914bd9290055, prev = 0x4c82e1be15c8e6bd, port = 56482, hash = 35158}, {parent = 0x1dd62e187ab759f8, next = 0x5b9483e3bd60806b, prev = 0x3c0e501ef81ca353, port = 42794, hash = 37520}, {parent = 0xb6c3f82065f2d125, next = 0xa78be2be546e1fb7, prev = 0x7a2f32985edb7215, port = 16490, hash = 49867}, {parent = 0x12bd4ff964ba4650, next = 0x7f071da522484c27, prev = 0x949f60e68638512f, port = 26169, hash = 41924}, {parent = 0x60a5d1aac5486cb5, next = 0xfa82965a57bf2ff6, prev = 0x7121974cf7f338d5, port = 64152, hash = 31481}, {parent = 0xbe418ec01629988f, next = 0x57a216f873989cb5, prev = 0x8dceabdc4ec9c171, port = 55564, hash = 38148}, { parent = 0x7ba30100010302ef, next = 0x1d55030609307930, prev = 0x62c300030020413, port = 24585, hash = 18566}, {parent = 0x704f1d161f040d01, next = 0x6547204c53536e65, prev = 0x206465746172656e, port = 25923, hash = 29810}, {parent = 0x5503061d30657461, next = 0xb4f5140416040e1d, prev = 0xb4484ad4fb6fc4a5, port = 45712, hash = 7028}, {parent = 0x1d5503061f30883e, next = 0xf714801630180423, prev = 0xf2ba7396fd05170d, port = 24267, hash = 6103}, {parent = 0x2a09060d30f186cf, next = 0xb01010df7864886, prev = 0x8f00010182030005, port = 51556, hash = 28499}}, aliases = 672183753, wbuf_q = { first = 0x5b7bf0e590790cf6, last = 0x67d25b5f6e53d679, wr_timeout = 1518084272, queued = 3564629620, offset = 996062876, last_used = 950886522}}

On Mon, 5 Oct 2015 at 15:56 Jason Penton jason.penton@gmail.com wrote:

...
Hey Daniel

Using TLS only. not using tcpops nor websocket

On Mon, 5 Oct 2015 at 15:54 Daniel-Constantin Mierla miconda@gmail.com wrote:

...
It could be some other reason, I will investigate as well. Using any modules on top of tcp (like tcpops, tls or websocket)?

Cheers, Daniel

On 05/10/15 15:47, Jason Penton wrote:

Hey Daniel,

Got it once so far... I'll take a look at that commit and see what it did - but very strange indeed.

Cheers Jason

On Mon, 5 Oct 2015 at 15:42 Daniel-Constantin Mierla miconda@gmail.com wrote:

...
Hello,

the only recent change in the tcp core was resetting a flag to avoid infinite log messages (until connection was closed) via commit d36734d658cd0bcfc8357c7e85ca32da0612aaee .

Does it happen often, or you got it just once so far?

Cheers, Daniel

On 05/10/15 08:29, Jason Penton wrote:

Hey guys,

Not sure if there have been any changes but I have an interesting problem here when using TCP:

*The BT is as follows:*

#0 local_timer_list_expire (h=0xa0f128 <tcp_reader_ltimer+178664>, t=723807134, l=0x9e3740 <tcp_reader_ltimer>) at local_timer.c:198 #1 local_timer_expire (t=723807134, h=<optimized out>) at local_timer.c:227 #2 local_timer_run (lt=lt@entry=0x9e3740 <tcp_reader_ltimer>, saved_ticks=723807150) at local_timer.c:250 #3 0x00000000005d8417 in tcp_reader_timer_run () at tcp_read.c:1682 #4 tcp_receive_loop (unix_sock=<optimized out>) at tcp_read.c:1734 #5 0x00000000005c81c8 in tcp_init_children () at tcp_main.c:4788 #6 0x00000000004a9da3 in main_loop () at main.c:1664 #7 0x000000000042411e in main (argc=<optimized out>, argv=<optimized out>) at main.c:2566

This seems to be related to clearing timers for TCP connections. The crash is related to the following code:

*_timer_rm_list(tl)*

where it does a null pointer deref on tl->next and tl->prev, which, according to the bt, are null (see below).

*(gdb) print *tl* $14 = {next = 0x0, prev = 0x0, expire = 723807134, initial_timeout = 32, data = 0x7fbbb05aa628, f = 0x5d02f0 <tcpconn_read_timeout>, flags = 512, slow_idx = 0}

Any ideas?

Cheers Jason

sr-dev mailing listsr-dev@lists.sip-router.orghttp://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

-- Daniel-Constantin Mierlahttp://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda Book: SIP Routing With Kamailio - http://www.asipto.com Kamailio Advanced Training, Sep 28-30, 2015, in Berlin - http://asipto.com/u/kat

sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

-- Daniel-Constantin Mierlahttp://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda Book: SIP Routing With Kamailio - http://www.asipto.com Kamailio Advanced Training, Sep 28-30, 2015, in Berlin - http://asipto.com/u/kat

-- Daniel-Constantin Mierlahttp://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda Book: SIP Routing With Kamailio - http://www.asipto.com

Camille Oudot

26 Nov 26 Nov

6:05 p.m.

Le Tue, 6 Oct 2015 13:57:18 +0000, Jason Penton jason.penton@gmail.com a écrit :

...

No, not stress testing at all.... I can't reproduce it on request - but let me add the mem debugging and see how we go...

Hi Jason,

I'm digging up this message to know if you were using the usrloc module and its close_expire_tcp when this crash occurred, as a recent issue found in this functionality (memory leak) might have prevented some expired elements to be removed from the TCP connection list.

-- Camille

Jason Penton

6:07 p.m.

Hey Camille,

No, this was using the IMS usrloc modules. We found some memory corruption which we fixed.

Cheers Jason On 26 Nov 2015 8:05 pm, "Camille Oudot" camille.oudot@orange.com wrote:

...

Le Tue, 6 Oct 2015 13:57:18 +0000, Jason Penton jason.penton@gmail.com a écrit :

...
No, not stress testing at all.... I can't reproduce it on request - but let me add the mem debugging and see how we go...

Hi Jason,

I'm digging up this message to know if you were using the usrloc module and its close_expire_tcp when this crash occurred, as a recent issue found in this functionality (memory leak) might have prevented some expired elements to be removed from the TCP connection list.

-- Camille

3534

Age (days ago)

3586

Last active (days ago)

sr-dev@lists.kamailio.org

9 comments

3 participants

tags (0)

participants (3)

Camille Oudot
Daniel-Constantin Mierla
Jason Penton