Some screenshots to illustrate what I mean:

image

Same graph, with a Y-axe limited to 500:

image

Same as above, but shortening the time frame and with Y-limit of 100:

image

I see a type of pattern going further back in time is the following:

And, looking in the logs, at the ~time the spikes happen, I can see this in the logs:

root@kamailio2:~# cat /var/log/kamailio/kamailio.log-20180712 | grep -v NOTICE | grep -v "core/tcp_"
...
Jul 12 02:54:24 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'orcr9cRwLT' and tags 'AV-f9MK~v' 'tetjZeg24aZeS'
Jul 12 02:54:45 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'IRO6Y5Rw53' and tags 'cIaxfaScL' '9jBBFcc1NaUNQ'
Jul 12 02:54:48 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'fw3hjesWFN' and tags '-3xLiat-a' 't1r1KrNrH0BcQ'
Jul 12 02:54:50 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'w1rHrdpK2f' and tags 'HoJgX22ya' '4XmtD99FgyyDa'
Jul 12 02:54:58 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'mrnlQ8OYuX' and tags 'yum6HTlKK' '766FZ05HaQXmj'
Jul 12 02:55:02 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'lrbwFVn7Q8' and tags 'xOubxsLmV' '77Q419N7eHy6N'
Jul 12 02:55:03 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID '14yaUN~xGG' and tags '6nSXFwBSb' 'mp1U2N57Ncg3p'
Jul 12 02:55:05 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'hgI~lJdBJ3' and tags 'FzXjp4-YJ' 'a2jt4Hrv1H0cN'
Jul 12 02:55:09 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'oBof16H0~A' and tags 'ZepyE21EK' '3ZXHF271aU7pH'
Jul 12 02:55:13 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'XIvos7QACD' and tags 'UPsNWynuM' 'X3jF108jcX89N'
Jul 12 02:55:13 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'ygaFfVdLn0' and tags '3Qbx8dkvK' '0yUyZrX83aSpa'
Jul 12 02:55:18 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'm9G~N0LU9d' and tags 'wQ~cwqKHN' 'cNp0aHaS0XD4F'
Jul 12 02:55:20 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'OUaCSKP2PZ' and tags 'W2-o~Knwc' 'epjHQN6yQj91e'
Jul 12 02:55:22 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'wUtsFit9wE' and tags 'CmPHrLXm9' 'KKpSm5Fv718mD'
Jul 12 02:55:24 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'i5FZlDb2Jt' and tags 'c7AO4enu~' '423vDyyBrU4cQ'
Jul 12 02:55:26 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'XLKRf6e5Xq' and tags 'rSTom2iS3' '020pQS21HyQUK'
Jul 12 02:55:29 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'nfBiavwuHu' and tags 'P3cHBXiea' 'NQ2ZBN8BZNSBF'
Jul 12 02:55:31 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID '699iA2AdSO' and tags 'eZSvyxAYC' 'tj76025mK04Qe'
Jul 12 02:55:31 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'XUuh-7CmGt' and tags 'GyhSzl-c7' 'r0mNXc4DSerjQ'
Jul 12 02:55:33 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'oMau5CTVYF' and tags '42SPhgBwh' 'XQeBjgrpNUFpp'
Jul 12 02:55:34 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'l0nKr0oj5c' and tags '2ZzNMM0b8' 'B6p7a3U8H0c9j'
Jul 12 02:55:35 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'Kl49oLgGHV' and tags 'E5E5P9rFX' 'NB9ZFyjFN2cvD'
Jul 12 02:55:35 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'i2lEbuiAFm' and tags 'TDExnc7Bx' 'Urv4SBvp5gyQQ'
Jul 12 02:55:39 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'S5k252ptQ7' and tags 'VfPxDw0Bw' '7p13arg6BmDZD'
Jul 12 02:55:39 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'fjqooHvA~I' and tags 'NbHBVvU9w' 'XZc6SDF87vB1B'
Jul 12 02:55:40 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID '7b~UWpjiY2' and tags 'eVOoK-UjU' 'QXUHKmmpFmS1m'
Jul 12 02:55:41 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'eOSkqIsjd0' and tags '8chfVD~av' 'Zv170QZ4SmQ2N'
Jul 12 02:55:42 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'FgI7PLr3km' and tags 'HomQTapK0' 'HvevN50X0NjSr'
Jul 12 02:55:42 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID '9tKCteAnPq' and tags '9QSRb9F3r' '0yyS6HUX3QB2r'
Jul 12 02:55:44 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'w0mRMEBDPQ' and tags 'mqY5bERQ1' 'yQXyarSrcyymg'
Jul 12 02:55:46 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'AP1gTeQl6U' and tags 'YUz~HwPjE' 'eX0mZc7eUm37a'
Jul 12 02:55:46 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'lhxI8T1rXr' and tags 'yb92ex~AK' 'SyZHS3m48eN8F'
Jul 12 02:55:49 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID '2RqFuZyUmj' and tags 'owLepp0dU' '7v1cjy0N0BDej'
Jul 12 02:55:49 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'DkuPmsRFZz' and tags '~HNqKE7E2' 'Z90vN6SXFDvUD'
Jul 12 02:55:50 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'Zdz48Xh~co' and tags 'hQ4p6Ad8a' 'F6pjtetXrg7er'
Jul 12 02:55:50 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'CNoQG-88vC' and tags 'Y-2W95k9F' '0eZg37BpN419c'
Jul 12 02:55:52 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'IVtYwdhMVV' and tags 'tpw7O0Hxz' 'p0UrDgSFvyFya'
...
root@kamailio2:~#

That seemed weird, all from the same process (27795) and almost at the same time?

So I wanted to check randomly for one of the calls, if I take for example the first one (orcr9cRwLT) I can't find anything in the logs:

root@kamailio2:~# grep orcr9cRwLT /var/log/kamailio/kamailio.log-20180712
Jul 12 02:54:24 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'orcr9cRwLT' and tags 'AV-f9MK~v' 'tetjZeg24aZeS'
root@kamailio2:~#

The interesting thing is:

root@kamailio1:~# grep orcr9cRwLT /var/log/kamailio/kamailio.log-20180712
Jul 11 14:54:24 kamailio1 sbc[29467]: NOTICE: <script>: New request - M=INVITE F=sip:5cfafa09ac2a45428cff23271e15079b@my.domain.com T=sip:inbound@sip.mydomain.com R=sip:inbound@sip.mydomain.com:50814 IP=99.203.1.225:31849 UA='HIDDEN_USER_AGENT' ID=orcr9cRwLT
Jul 11 14:54:24 kamailio1 sbc[29467]: NOTICE: <script>: TLS information: SNI=sip.mydomain.com Ver=TLSv1.2 Cipher=AES256-GCM-SHA384 Bits=256 - M=INVITE R=sip:inbound@sip.mydomain.com:50814 ID=orcr9cRwLT
Jul 11 14:54:24 kamailio1 sbc[29467]: NOTICE: <script>: NAT detected - M=INVITE R=sip:inbound@sip.mydomain.com:50814 ID=orcr9cRwLT
Jul 11 14:54:24 kamailio1 sbc[29467]: NOTICE: <script>: Setting updated contact - M=INVITE ID=orcr9cRwLT
Jul 11 14:54:24 kamailio1 sbc[29467]: NOTICE: <script>: Authentication successful - M=INVITE R=sip:inbound@sip.mydomain.com:50814 ID=orcr9cRwLT
Jul 11 14:54:24 kamailio1 sbc[29467]: NOTICE: <script>: Selected dispatcher node: sip:A.B.C.D:5060 - M=INVITE R=sip:inbound@A.B.C.D:5060 ID=orcr9cRwLT
Jul 11 14:54:24 kamailio1 sbc[29467]: NOTICE: <script>: Forwarding request - M=INVITE R=sip:inbound@A.B.C.D:5060 ID=orcr9cRwLT
Jul 11 14:54:24 kamailio1 sbc[29436]: NOTICE: <script>: Reply - M=INVITE S=100 trying -- your call is important to us ID=orcr9cRwLT
Jul 11 14:54:24 kamailio1 sbc[29430]: NOTICE: <script>: Reply - M=INVITE S=200 OK ID=orcr9cRwLT
Jul 11 14:54:24 kamailio1 sbc[29433]: NOTICE: <script>: Reply - M=INVITE S=200 OK ID=orcr9cRwLT
Jul 11 14:54:24 kamailio1 sbc[29467]: NOTICE: <script>: Setting updated contact - M=ACK ID=orcr9cRwLT
Jul 11 14:54:24 kamailio1 sbc[29467]: NOTICE: <script>: In-dialog request from App - M=ACK R=sip:inbound@E.F.G.H:6061;transport=udp ID=orcr9cRwLT
Jul 11 14:54:24 kamailio1 sbc[29467]: NOTICE: <script>: Forwarding request - M=ACK R=sip:inbound@E.F.G.H:6061;transport=udp ID=orcr9cRwLT
Jul 11 14:54:24 kamailio1 sbc[29467]: NOTICE: <script>: Setting updated contact - M=ACK ID=orcr9cRwLT
Jul 11 14:54:24 kamailio1 sbc[29467]: NOTICE: <script>: In-dialog request from App - M=ACK R=sip:inbound@E.F.G.H:6061;transport=udp ID=orcr9cRwLT
Jul 11 14:54:24 kamailio1 sbc[29467]: NOTICE: <script>: Forwarding request - M=ACK R=sip:inbound@E.F.G.H:6061;transport=udp ID=orcr9cRwLT
root@kamailio1:~#

So the call, in fact was started through kamailio1 (this is a cluster of 2 nodes, kamailio1 and kamailio2, and they replicate dialog information using KDMQ).

The error msgs in kamailio2 appear exactly 12hr after the call started in kamailio1.

I'm starting to wonder if this can have anything to do with KDMQ replication?

When we moved from v5.0.2 to v5.1.4, we also enabled KMDQ for dialog module (which in v5.0.2 we were not using).

I hope this info helps narrow down the issue.

My next steps unless you guys give me other suggestions:

a) Enable debug logs and wait for issue to reproduce, maybe we see something there.
b) Disable KDMQ replication on dialog module and see if the issue is solved or continues.

I don't like either of them, but I rather start with the most helpful of the two.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.