[kamailio/kamailio] KEx module stats.get_statistics() randomly reporting 9223372036854776000 for current active/early dialogs (#1591)

List overview All Threads
Download

newer

older

[kamailio/kamailio] Statistics:...

[kamailio/kamailio] cnxcc segfault...

Joel Serrano

11 Jul 2018 11 Jul '18

6:08 p.m.

### Description

Since we upgraded from v5.0.2 to v5.1.4 we are seeing spikes in our monitoring platform regarding current active and early dialogs.

Every 10s we gather metrics using Kamailio's HTTP server:

``` # Statistics endpoint if ($hu =~ "^/statistics") { jsonrpc_exec('{"jsonrpc": "2.0","method": "stats.get_statistics","params": ["all"],"id": 1}'); xhttp_reply("200", "OK", "application/json", "$jsonrpl(body)"); exit; } ```

stats.get_statistics has lots of metrics, the problematic ones we have found are:

``` "dialog.active_dialogs": "84", "dialog.early_dialogs": "16", ```

The issue is that randomly, instead of reporting the real active/early dialogs, it will return **9223372036854776000**.

Example:

``` ... "dialog.active_dialogs": "77", "dialog.early_dialogs": "9223372036854776000", ...

... "dialog.active_dialogs": "9223372036854776000", "dialog.early_dialogs": "20", ...

... "dialog.active_dialogs": "83", "dialog.early_dialogs": "18", ...

```

### Troubleshooting

I added logging for kamailio to print the value for active-dialogs just to make sure it wasn't a problem further down in the pipeline, and I could see that same number, so it's definitely coming from Kamailio. Also, on version v5.0.2 installed from deb repo this did not happen.

#### Reproduction

I cannot reproduce on purpose, I see it happen several times a day though.

#### Debugging

Please let me know what I can provide to troubleshoot this, I don't know where to start, I'm concerned of activating debug logs in prod and waiting for this to happen might overload the system, any alternatives?

### Additional Information

* **Kamailio Version** - output of `kamailio -v`

``` # kamailio -v version: kamailio 5.1.4 (x86_64/linux) flags: STATS: Off, USE_TCP, USE_TLS, USE_SCTP, TLS_HOOKS, DISABLE_NAGLE, USE_MCAST, DNS_IP_HACK, SHM_MEM, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, TLSF_MALLOC, DBG_SR_MEMORY, USE_FUTEX, FAST_LOCK-ADAPTIVE_WAIT, USE_DNS_CACHE, USE_DNS_FAILOVER, USE_NAPTR, USE_DST_BLACKLIST, HAVE_RESOLV_RES ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535, DEFAULT PKG_SIZE 8MB poll method support: poll, epoll_lt, epoll_et, sigio_rt, select. id: unknown compiled with gcc 6.3.0 ```

* **Operating System**:

``` OS: Debian stretch 9.4

Kernel: Linux kamailio 4.9.0-6-amd64 #1 SMP Debian 4.9.88-1+deb9u1 (2018-05-07) x86_64 GNU/Linux ```

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1591

Attachments:

attachment.html (text/html — 5.9 KB)

Show replies by date

Joel Serrano

12 Jul 12 Jul

3:38 p.m.

New subject: [kamailio/kamailio] RPC command stats.get_statistics randomly reporting 9223372036854776000 for current active/early dialogs (#1591)

I updated the issue as I think this might not be related to KEX module and more with DIALOG module. As I'm not sure, I'm not specifying in the topic.

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1591#issuecomment-404555603

Julien Chavanton

5:01 p.m.

New subject: [kamailio/kamailio] RPC command stats.get_statistics randomly reporting 9223372036854776000 for current active/early dialogs (#1591)

Hi Joel, There was recently a little fix on initializing the counter from database. But this fix it is not in 5.1.4 anyway since the release was create June 5 and the fix was merged

Could you try to isolate where the regression was introduced by building different version doing a checkout somewhere in the middle ``` git log --pretty=oneline 5.0.2...5.1.4 ```

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1591#issuecomment-404581159

Daniel-Constantin Mierla

5:03 p.m.

New subject: [kamailio/kamailio] RPC command stats.get_statistics randomly reporting 9223372036854776000 for current active/early dialogs (#1591)

To clarify: some wrong value is returned, but afterwards it comes back to good values? Is it always the same wrong value?

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1591#issuecomment-404581656

Joel Serrano

9:43 p.m.

New subject: [kamailio/kamailio] RPC command stats.get_statistics randomly reporting 9223372036854776000 for current active/early dialogs (#1591)

@jchavanton: I will try to have a look

@miconda: it reports 9223372036854776000 and after X time it goes back to good values. (where X can be sometimes on the very next request, sometimes after couple minutes, but haven't seen above that). But to your point: yes, it goes back to good values by itself.

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1591#issuecomment-404660178

Joel Serrano

13 Jul 13 Jul

6:23 a.m.

New subject: [kamailio/kamailio] RPC command stats.get_statistics randomly reporting 9223372036854776000 for current active/early dialogs (#1591)

Some screenshots to illustrate what I mean:

![image](https://user-images.githubusercontent.com/16212586/42674283-f2d6c2da-8623-11...)

Same graph, with a Y-axe limited to 500:

![image](https://user-images.githubusercontent.com/16212586/42674318-112dc756-8624-11...)

Same as above, but shortening the time frame and with Y-limit of 100:

![image](https://user-images.githubusercontent.com/16212586/42674387-586dfa28-8624-11...)

I see a type of pattern going further back in time is the following:

- all is working - something happens (*see below note*) - dialogs number reported goes to 0 or close to 0 for a bit - dialogs report the 9223372036854776000 - things start working again?

And, looking in the logs, at the ~time the spikes happen, I can see this in the logs:

``` root@kamailio2:~# cat /var/log/kamailio/kamailio.log-20180712 | grep -v NOTICE | grep -v "core/tcp_" ... Jul 12 02:54:24 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'orcr9cRwLT' and tags 'AV-f9MK~v' 'tetjZeg24aZeS' Jul 12 02:54:45 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'IRO6Y5Rw53' and tags 'cIaxfaScL' '9jBBFcc1NaUNQ' Jul 12 02:54:48 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'fw3hjesWFN' and tags '-3xLiat-a' 't1r1KrNrH0BcQ' Jul 12 02:54:50 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'w1rHrdpK2f' and tags 'HoJgX22ya' '4XmtD99FgyyDa' Jul 12 02:54:58 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'mrnlQ8OYuX' and tags 'yum6HTlKK' '766FZ05HaQXmj' Jul 12 02:55:02 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'lrbwFVn7Q8' and tags 'xOubxsLmV' '77Q419N7eHy6N' Jul 12 02:55:03 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID '14yaUN~xGG' and tags '6nSXFwBSb' 'mp1U2N57Ncg3p' Jul 12 02:55:05 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'hgI~lJdBJ3' and tags 'FzXjp4-YJ' 'a2jt4Hrv1H0cN' Jul 12 02:55:09 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'oBof16H0~A' and tags 'ZepyE21EK' '3ZXHF271aU7pH' Jul 12 02:55:13 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'XIvos7QACD' and tags 'UPsNWynuM' 'X3jF108jcX89N' Jul 12 02:55:13 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'ygaFfVdLn0' and tags '3Qbx8dkvK' '0yUyZrX83aSpa' Jul 12 02:55:18 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'm9G~N0LU9d' and tags 'wQ~cwqKHN' 'cNp0aHaS0XD4F' Jul 12 02:55:20 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'OUaCSKP2PZ' and tags 'W2-o~Knwc' 'epjHQN6yQj91e' Jul 12 02:55:22 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'wUtsFit9wE' and tags 'CmPHrLXm9' 'KKpSm5Fv718mD' Jul 12 02:55:24 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'i5FZlDb2Jt' and tags 'c7AO4enu~' '423vDyyBrU4cQ' Jul 12 02:55:26 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'XLKRf6e5Xq' and tags 'rSTom2iS3' '020pQS21HyQUK' Jul 12 02:55:29 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'nfBiavwuHu' and tags 'P3cHBXiea' 'NQ2ZBN8BZNSBF' Jul 12 02:55:31 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID '699iA2AdSO' and tags 'eZSvyxAYC' 'tj76025mK04Qe' Jul 12 02:55:31 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'XUuh-7CmGt' and tags 'GyhSzl-c7' 'r0mNXc4DSerjQ' Jul 12 02:55:33 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'oMau5CTVYF' and tags '42SPhgBwh' 'XQeBjgrpNUFpp' Jul 12 02:55:34 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'l0nKr0oj5c' and tags '2ZzNMM0b8' 'B6p7a3U8H0c9j' Jul 12 02:55:35 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'Kl49oLgGHV' and tags 'E5E5P9rFX' 'NB9ZFyjFN2cvD' Jul 12 02:55:35 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'i2lEbuiAFm' and tags 'TDExnc7Bx' 'Urv4SBvp5gyQQ' Jul 12 02:55:39 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'S5k252ptQ7' and tags 'VfPxDw0Bw' '7p13arg6BmDZD' Jul 12 02:55:39 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'fjqooHvA~I' and tags 'NbHBVvU9w' 'XZc6SDF87vB1B' Jul 12 02:55:40 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID '7b~UWpjiY2' and tags 'eVOoK-UjU' 'QXUHKmmpFmS1m' Jul 12 02:55:41 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'eOSkqIsjd0' and tags '8chfVD~av' 'Zv170QZ4SmQ2N' Jul 12 02:55:42 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'FgI7PLr3km' and tags 'HomQTapK0' 'HvevN50X0NjSr' Jul 12 02:55:42 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID '9tKCteAnPq' and tags '9QSRb9F3r' '0yyS6HUX3QB2r' Jul 12 02:55:44 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'w0mRMEBDPQ' and tags 'mqY5bERQ1' 'yQXyarSrcyymg' Jul 12 02:55:46 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'AP1gTeQl6U' and tags 'YUz~HwPjE' 'eX0mZc7eUm37a' Jul 12 02:55:46 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'lhxI8T1rXr' and tags 'yb92ex~AK' 'SyZHS3m48eN8F' Jul 12 02:55:49 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID '2RqFuZyUmj' and tags 'owLepp0dU' '7v1cjy0N0BDej' Jul 12 02:55:49 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'DkuPmsRFZz' and tags '~HNqKE7E2' 'Z90vN6SXFDvUD' Jul 12 02:55:50 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'Zdz48Xh~co' and tags 'hQ4p6Ad8a' 'F6pjtetXrg7er' Jul 12 02:55:50 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'CNoQG-88vC' and tags 'Y-2W95k9F' '0eZg37BpN419c' Jul 12 02:55:52 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'IVtYwdhMVV' and tags 'tpw7O0Hxz' 'p0UrDgSFvyFya' ... root@kamailio2:~# ```

That seemed weird, all from the same process (`27795`) and almost at the same time?

So I wanted to check randomly for one of the calls, if I take for example the first one (`orcr9cRwLT`) I can't find anything in the logs:

``` root@kamailio2:~# grep orcr9cRwLT /var/log/kamailio/kamailio.log-20180712 Jul 12 02:54:24 kamailio2 sbc[27795]: WARNING: dialog [dlg_handlers.c:1596]: dlg_ontimeout(): timeout for dlg with CallID 'orcr9cRwLT' and tags 'AV-f9MK~v' 'tetjZeg24aZeS' root@kamailio2:~# ```

The interesting thing is:

``` root@kamailio1:~# grep orcr9cRwLT /var/log/kamailio/kamailio.log-20180712 Jul 11 14:54:24 kamailio1 sbc[29467]: NOTICE: <script>: New request - M=INVITE F=sip:5cfafa09ac2a45428cff23271e15079b@my.domain.com T=sip:inbound@sip.mydomain.com R=sip:inbound@sip.mydomain.com:50814 IP=99.203.1.225:31849 UA='HIDDEN_USER_AGENT' ID=orcr9cRwLT Jul 11 14:54:24 kamailio1 sbc[29467]: NOTICE: <script>: TLS information: SNI=sip.mydomain.com Ver=TLSv1.2 Cipher=AES256-GCM-SHA384 Bits=256 - M=INVITE R=sip:inbound@sip.mydomain.com:50814 ID=orcr9cRwLT Jul 11 14:54:24 kamailio1 sbc[29467]: NOTICE: <script>: NAT detected - M=INVITE R=sip:inbound@sip.mydomain.com:50814 ID=orcr9cRwLT Jul 11 14:54:24 kamailio1 sbc[29467]: NOTICE: <script>: Setting updated contact - M=INVITE ID=orcr9cRwLT Jul 11 14:54:24 kamailio1 sbc[29467]: NOTICE: <script>: Authentication successful - M=INVITE R=sip:inbound@sip.mydomain.com:50814 ID=orcr9cRwLT Jul 11 14:54:24 kamailio1 sbc[29467]: NOTICE: <script>: Selected dispatcher node: sip:A.B.C.D:5060 - M=INVITE R=sip:inbound@A.B.C.D:5060 ID=orcr9cRwLT Jul 11 14:54:24 kamailio1 sbc[29467]: NOTICE: <script>: Forwarding request - M=INVITE R=sip:inbound@A.B.C.D:5060 ID=orcr9cRwLT Jul 11 14:54:24 kamailio1 sbc[29436]: NOTICE: <script>: Reply - M=INVITE S=100 trying -- your call is important to us ID=orcr9cRwLT Jul 11 14:54:24 kamailio1 sbc[29430]: NOTICE: <script>: Reply - M=INVITE S=200 OK ID=orcr9cRwLT Jul 11 14:54:24 kamailio1 sbc[29433]: NOTICE: <script>: Reply - M=INVITE S=200 OK ID=orcr9cRwLT Jul 11 14:54:24 kamailio1 sbc[29467]: NOTICE: <script>: Setting updated contact - M=ACK ID=orcr9cRwLT Jul 11 14:54:24 kamailio1 sbc[29467]: NOTICE: <script>: In-dialog request from App - M=ACK R=sip:inbound@E.F.G.H:6061;transport=udp ID=orcr9cRwLT Jul 11 14:54:24 kamailio1 sbc[29467]: NOTICE: <script>: Forwarding request - M=ACK R=sip:inbound@E.F.G.H:6061;transport=udp ID=orcr9cRwLT Jul 11 14:54:24 kamailio1 sbc[29467]: NOTICE: <script>: Setting updated contact - M=ACK ID=orcr9cRwLT Jul 11 14:54:24 kamailio1 sbc[29467]: NOTICE: <script>: In-dialog request from App - M=ACK R=sip:inbound@E.F.G.H:6061;transport=udp ID=orcr9cRwLT Jul 11 14:54:24 kamailio1 sbc[29467]: NOTICE: <script>: Forwarding request - M=ACK R=sip:inbound@E.F.G.H:6061;transport=udp ID=orcr9cRwLT root@kamailio1:~# ```

So the call, in fact was started through `kamailio1` (this is a cluster of 2 nodes, `kamailio1` and `kamailio2`, and they replicate dialog information using KDMQ).

The error msgs in `kamailio2` appear exactly 12hr after the call started in `kamailio1`.

I'm starting to wonder if this can have anything to do with KDMQ replication?

When we moved from v5.0.2 to v5.1.4, we also enabled KMDQ for dialog module (which in v5.0.2 we were not using).

I hope this info helps narrow down the issue.

My next steps unless you guys give me other suggestions:

a) Enable debug logs and wait for issue to reproduce, maybe we see something there. b) Disable KDMQ replication on dialog module and see if the issue is solved or continues.

I don't like either of them, but I rather start with the most helpful of the two.

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1591#issuecomment-404738386

Joel Serrano

6:36 a.m.

New subject: [kamailio/kamailio] RPC command stats.get_statistics randomly reporting 9223372036854776000 for current active/early dialogs (#1591)

We have another cluster of two kamailio for internal traffic, those are also on v5.1.4 but don't have KDMQ enabled yet, and their graphs look perfect. So I'm really starting to think this can be DMQ related.

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1591#issuecomment-404740630

Daniel-Constantin Mierla

7:49 a.m.

New subject: [kamailio/kamailio] RPC command stats.get_statistics randomly reporting 9223372036854776000 for current active/early dialogs (#1591)

It may be that the value is decremented more than the active calls and "goes below 0", but the counter is an 'unsigned long' value, so from 0 jumps to maximum value. It may be related to DMQ replication. Based on feedback, so far all was pretty accurate with the stats for active dialogs in an instance without DMQ. So this path worth checking.

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1591#issuecomment-404755185

Daniel-Constantin Mierla

4:24 p.m.

New subject: [kamailio/kamailio] RPC command stats.get_statistics randomly reporting 9223372036854776000 for current active/early dialogs (#1591)

As an alternative to get the stats of active dialogs, I added an rpc command that scans the internal dialog hash table:

* https://github.com/kamailio/kamailio/commit/ebb149066690f7d96f45e1639e0c5ca9...

It is slightly slower, given the scan in real time, that the stats using the core counters, but should be fast enough not to affect the performances.

It can be used to compare the results eventually and detect when an anomaly in the counter based stats happen.

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1591#issuecomment-404883239

Joel Serrano

5:07 p.m.

New subject: [kamailio/kamailio] RPC command stats.get_statistics randomly reporting 9223372036854776000 for current active/early dialogs (#1591)

I'm building v5.1.4 packages with that commit cherry-picked. I will deploy as soon as I have run a few tests in lab. I have noticed that the issue happens only on low traffic, so it matches your theory of the value going "below 0"...

I was thinking if running a script every second gathering on both nodes:

I will also enable debug logs on `kamailio2` node (has slightly less traffic than `kamailio1`) maybe that helps too.

Thank you! I'll report back results tomorrow.

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1591#issuecomment-404894491

Joel Serrano

6:13 p.m.

New subject: [kamailio/kamailio] RPC command stats.get_statistics randomly reporting 9223372036854776000 for current active/early dialogs (#1591)

Hi Daniel,

On my initial test in prod, I can see already discrepancies:

``` root@kamailio2:~# kamctl rpc dlg.stats_active { "jsonrpc": "2.0", "result": { "starting": 4, "connecting": 55, "answering": 2, "ongoing": 253, "all": 314 }, "id": 23360 } root@kamailio2:~# kamctl rpc stats.get_statistics all | grep dialog "dialog:active_dialogs = 18446744073709551604", "dialog:early_dialogs = 15", "dialog:expired_dialogs = 0", "dialog:failed_dialogs = 120", "dialog:processed_dialogs = 340", root@kamailio2:~# ```

Now we can go past 9223372036854776000 ??? I don't understand that.

That was right after restarting (green active dialogs, yellow early dialogs):

![image](https://user-images.githubusercontent.com/16212586/42706316-2926ed4e-868b-11...)

Here is multiple runs of the stats.get_statistics:

``` root@kamailio2:~# kamctl rpc stats.get_statistics all | grep dialog "dialog:active_dialogs = 18446744073709551614", "dialog:early_dialogs = 8", "dialog:expired_dialogs = 0", "dialog:failed_dialogs = 315", "dialog:processed_dialogs = 863", root@kamailio2:~# kamctl rpc stats.get_statistics all | grep dialog "dialog:active_dialogs = 18446744073709551614", "dialog:early_dialogs = 8", "dialog:expired_dialogs = 0", "dialog:failed_dialogs = 315", "dialog:processed_dialogs = 864", root@kamailio2:~# kamctl rpc stats.get_statistics all | grep dialog "dialog:active_dialogs = 18446744073709551614", "dialog:early_dialogs = 8", "dialog:expired_dialogs = 0", "dialog:failed_dialogs = 315", "dialog:processed_dialogs = 864", root@kamailio2:~# kamctl rpc stats.get_statistics all | grep dialog "dialog:active_dialogs = 18446744073709551613", "dialog:early_dialogs = 10", "dialog:expired_dialogs = 0", "dialog:failed_dialogs = 320", "dialog:processed_dialogs = 866", root@kamailio2:~# kamctl rpc stats.get_statistics all | grep dialog "dialog:active_dialogs = 0", "dialog:early_dialogs = 8", "dialog:expired_dialogs = 0", "dialog:failed_dialogs = 322", "dialog:processed_dialogs = 870", root@kamailio2:~# kamctl rpc stats.get_statistics all | grep dialog "dialog:active_dialogs = 1", "dialog:early_dialogs = 9", "dialog:expired_dialogs = 0", "dialog:failed_dialogs = 323", "dialog:processed_dialogs = 873", root@kamailio2:~# kamctl rpc stats.get_statistics all | grep dialog "dialog:active_dialogs = 18446744073709551615", "dialog:early_dialogs = 9", "dialog:expired_dialogs = 0", "dialog:failed_dialogs = 324", "dialog:processed_dialogs = 875", root@kamailio2:~# kamctl rpc stats.get_statistics all | grep dialog "dialog:active_dialogs = 18446744073709551615", "dialog:early_dialogs = 10", "dialog:expired_dialogs = 0", "dialog:failed_dialogs = 325", "dialog:processed_dialogs = 876", root@kamailio2:~# ```

I tried to have a script gathering metrics:

``` root@kamailio2:~# cat dialogs.sh #!/bin/bash

while true; do DIALOGS=`kamctl rpc dlg.list | grep h_entry | sort | wc -l` METRICS=`kamctl rpc stats.get_statistics all | grep -e "dialog:active" -e "dialog:early" | tr -d " " | tr -d "\n"` NEWMETRICS=`kamctl rpc dlg.stats_active | jq -c .result` echo "`date +'%Y-%m-%d %H:%M:%S.%N'` - dlg.list:$DIALOGS - stats.get_statistics:$METRICS - dlg.stats_active:$NEWMETRICS" sleep 1 done root@kamailio2:~# ```

But, after a little kamailio "starts slowing down" until it dies (had to completely reboot the server).

``` root@kamailio2:~# ./dialogs.sh 2018-07-13 10:57:14.204158716 - dlg.list:419 - stats.get_statistics:"dialog:active_dialogs=13","dialog:early_dialogs=12", - dlg.stats_active:{"starting":5,"connecting":62,"answering":7,"ongoing":288,"all":362} 2018-07-13 10:57:15.437165579 - dlg.list:422 - stats.get_statistics:"dialog:active_dialogs=15","dialog:early_dialogs=11", - dlg.stats_active:{"starting":4,"connecting":59,"answering":10,"ongoing":289,"all":362} 2018-07-13 10:57:16.678432892 - dlg.list:423 - stats.get_statistics:"dialog:active_dialogs=14","dialog:early_dialogs=9", - dlg.stats_active:{"starting":2,"connecting":58,"answering":9,"ongoing":289,"all":358} 2018-07-13 10:57:17.915458776 - dlg.list:422 - stats.get_statistics:"dialog:active_dialogs=13","dialog:early_dialogs=9", - dlg.stats_active:{"starting":4,"connecting":60,"answering":7,"ongoing":289,"all":360} 2018-07-13 10:57:19.149467641 - dlg.list:420 - stats.get_statistics:"dialog:active_dialogs=13","dialog:early_dialogs=8", - dlg.stats_active:{"starting":3,"connecting":60,"answering":5,"ongoing":290,"all":358} 2018-07-13 10:57:20.386457513 - dlg.list:415 - stats.get_statistics:"dialog:active_dialogs=14","dialog:early_dialogs=7", - dlg.stats_active:{"starting":4,"connecting":59,"answering":5,"ongoing":290,"all":358} 2018-07-13 10:57:21.627241266 - dlg.list:412 - stats.get_statistics:"dialog:active_dialogs=15","dialog:early_dialogs=7", - dlg.stats_active:{"starting":3,"connecting":60,"answering":6,"ongoing":292,"all":361} 2018-07-13 10:57:22.867957903 - dlg.list:415 - stats.get_statistics:"dialog:active_dialogs=14","dialog:early_dialogs=9", - dlg.stats_active:{"starting":2,"connecting":64,"answering":5,"ongoing":291,"all":362} 2018-07-13 10:57:24.099604068 - dlg.list:418 - stats.get_statistics:"dialog:active_dialogs=13","dialog:early_dialogs=10", - dlg.stats_active:{"starting":2,"connecting":65,"answering":3,"ongoing":294,"all":364} 2018-07-13 10:57:33.049706118 - dlg.list:411 - stats.get_statistics:"dialog:active_dialogs=8","dialog:early_dialogs=6", - dlg.stats_active:{"starting":4,"connecting":62,"answering":5,"ongoing":287,"all":358} 2018-07-13 10:57:34.290398926 - dlg.list:418 - stats.get_statistics:"dialog:active_dialogs=8","dialog:early_dialogs=6", - dlg.stats_active:{"starting":9,"connecting":62,"answering":5,"ongoing":287,"all":363} 2018-07-13 10:57:35.531092078 - dlg.list:419 - stats.get_statistics:"dialog:active_dialogs=8","dialog:early_dialogs=6", - dlg.stats_active:{"starting":9,"connecting":62,"answering":5,"ongoing":287,"all":363} 2018-07-13 10:57:36.762143019 - dlg.list:419 - stats.get_statistics:"dialog:active_dialogs=8","dialog:early_dialogs=6", - dlg.stats_active:{"starting":9,"connecting":62,"answering":5,"ongoing":287,"all":363} 2018-07-13 10:57:38.006519396 - dlg.list:420 - stats.get_statistics:"dialog:active_dialogs=8","dialog:early_dialogs=6", - dlg.stats_active:{"starting":10,"connecting":62,"answering":5,"ongoing":287,"all":364} 2018-07-13 10:57:39.254513350 - dlg.list:420 - stats.get_statistics:"dialog:active_dialogs=8","dialog:early_dialogs=6", - dlg.stats_active:{"starting":10,"connecting":62,"answering":5,"ongoing":287,"all":364} 2018-07-13 10:57:55.779414400 - dlg.list:431 - stats.get_statistics:"dialog:active_dialogs=2","dialog:early_dialogs=6", - dlg.stats_active:{"starting":8,"connecting":72,"answering":10,"ongoing":272,"all":362} ^C root@kamailio2:~# ```

As you can see it starts taking >10s to reply some times, which I believe means it is dying, leaving a core and restarting:

``` root@kamailio2:/var/tmp# ls -lrt total 47781984 -rw------- 1 root kamailio 8607031296 Jul 13 10:59 core.kamailio.24376.13cn4.1531504766 -rw------- 1 root kamailio 2128039936 Jul 13 11:01 core.kamailio.28422.13cn4.1531504860 -rw------- 1 root kamailio 2081087488 Jul 13 11:01 core.kamailio.28420.13cn4.1531504860 -rw------- 1 root kamailio 2136645632 Jul 13 11:01 core.kamailio.28421.13cn4.1531504860 -rw------- 1 root kamailio 2174545920 Jul 13 11:01 core.kamailio.28423.13cn4.1531504860 -rw------- 1 root kamailio 5265956864 Jul 13 11:07 core.kamailio.1335.13cn4.1531505030 -rw------- 1 root kamailio 5309550592 Jul 13 11:07 core.kamailio.1339.13cn4.1531505030 -rw------- 1 root kamailio 5294923776 Jul 13 11:07 core.kamailio.1336.13cn4.1531505030 -rw------- 1 root kamailio 5329113088 Jul 13 11:07 core.kamailio.1334.13cn4.1531505030 -rw------- 1 root kamailio 5338894336 Jul 13 11:07 core.kamailio.1338.13cn4.1531505030 -rw------- 1 root kamailio 5262925824 Jul 13 11:07 core.kamailio.1337.13cn4.1531505030 root@kamailio2:/var/tmp# ```

I'm grabbing the `bt full` and `info locals` from all and I'll add them here shortly. I will have to rollback until you have time to check all this.

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1591#issuecomment-404911844

Joel Serrano

6:34 p.m.

New subject: [kamailio/kamailio] RPC command stats.get_statistics randomly reporting 9223372036854776000 for current active/early dialogs (#1591)

Hi Daniel,

This is what I can get:

``` root@kamailio2:/var/tmp# gdb /usr/sbin/kamailio -ex "bt full" --batch core.kamailio.1334.13cn4.1531505030 > core.kamailio.1334.13cn4.1531505030-bt_full.txt 66 ../../core/parser/../mem/../atomic/atomic_common.h: No such file or directory. root@kamailio2:/var/tmp# gdb /usr/sbin/kamailio -ex "bt full" --batch core.kamailio.1335.13cn4.1531505030 > core.kamailio.1335.13cn4.1531505030-bt_full.txt 66 ../../core/parser/../mem/../atomic/atomic_common.h: No such file or directory. root@kamailio2:/var/tmp# gdb /usr/sbin/kamailio -ex "bt full" --batch core.kamailio.1336.13cn4.1531505030 > core.kamailio.1336.13cn4.1531505030-bt_full.txt 66 ../../core/parser/../mem/../atomic/atomic_common.h: No such file or directory. root@kamailio2:/var/tmp# gdb /usr/sbin/kamailio -ex "bt full" --batch core.kamailio.1337.13cn4.1531505030 > core.kamailio.1337.13cn4.1531505030-bt_full.txt 66 ../../core/parser/../mem/../atomic/atomic_common.h: No such file or directory. root@kamailio2:/var/tmp# gdb /usr/sbin/kamailio -ex "bt full" --batch core.kamailio.1338.13cn4.1531505030 > core.kamailio.1338.13cn4.1531505030-bt_full.txt 66 ../../core/parser/../mem/../atomic/atomic_common.h: No such file or directory. root@kamailio2:/var/tmp# gdb /usr/sbin/kamailio -ex "bt full" --batch core.kamailio.1339.13cn4.1531505030 > core.kamailio.1339.13cn4.1531505030-bt_full.txt 66 ../../core/parser/../mem/../atomic/atomic_common.h: No such file or directory. root@kamailio2:/var/tmp# gdb /usr/sbin/kamailio -ex "bt full" --batch core.kamailio.24376.13cn4.1531504766 > ocre.kamailio.24376.13cn4.1531504766-bt_full.txt 51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory. root@kamailio2:/var/tmp# gdb /usr/sbin/kamailio -ex "bt full" --batch core.kamailio.28420.13cn4.1531504860 > ocre.kamailio.28420.13cn4.1531504860-bt_full.txt BFD: Warning: /var/tmp/core.kamailio.28420.13cn4.1531504860 is truncated: expected core file size >= 8607031296, found: 2081087488. Cannot access memory at address 0x7fc79b2dc148 Cannot access memory at address 0x7fc79b2dc140 Failed to read a valid object file image from memory. root@kamailio2:/var/tmp# gdb /usr/sbin/kamailio -ex "bt full" --batch core.kamailio.28421.13cn4.1531504860 > ocre.kamailio.28421.13cn4.1531504860-bt_full.txt BFD: Warning: /var/tmp/core.kamailio.28421.13cn4.1531504860 is truncated: expected core file size >= 8607031296, found: 2136645632. Cannot access memory at address 0x7fc79b2dc148 Cannot access memory at address 0x7fc79b2dc140 Failed to read a valid object file image from memory. root@kamailio2:/var/tmp# gdb /usr/sbin/kamailio -ex "bt full" --batch core.kamailio.28422.13cn4.1531504860 > ocre.kamailio.28422.13cn4.1531504860-bt_full.txt BFD: Warning: /var/tmp/core.kamailio.28422.13cn4.1531504860 is truncated: expected core file size >= 8607031296, found: 2128039936. Cannot access memory at address 0x7fc79b2dc148 Cannot access memory at address 0x7fc79b2dc140 Failed to read a valid object file image from memory. root@kamailio2:/var/tmp# gdb /usr/sbin/kamailio -ex "bt full" --batch core.kamailio.28423.13cn4.1531504860 > ocre.kamailio.28423.13cn4.1531504860-bt_full.txt BFD: Warning: /var/tmp/core.kamailio.28423.13cn4.1531504860 is truncated: expected core file size >= 8607031296, found: 2174545920. Cannot access memory at address 0x7fc79b2dc148 Cannot access memory at address 0x7fc79b2dc140 Failed to read a valid object file image from memory. root@kamailio2:/var/tmp# ```

I have no idea how to read backtraces, but I can tell you in advance there are plenty of `*dmq*` references in there...

Let me know what you think and next steps I can do to help troubleshoot this.

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1591#issuecomment-404917332

Joel Serrano

17 Jul 17 Jul

4:06 p.m.

New subject: [kamailio/kamailio] RPC command stats.get_statistics randomly reporting 9223372036854776000 for current active/early dialogs (#1591)

I don't know if it helps, but I have one server constantly reporting incorrectly:

``` root@kamailio1:~# kamctl rpc stats.get_statistics all | grep -e "dialog:active" "dialog:active_dialogs = 18446744073709551568", root@kamailio1:~# kamctl rpc stats.get_statistics all | grep -e "dialog:active" "dialog:active_dialogs = 18446744073709551567", root@kamailio1:~# kamctl rpc stats.get_statistics all | grep -e "dialog:active" "dialog:active_dialogs = 18446744073709551567", root@kamailio1:~# kamctl rpc stats.get_statistics all | grep -e "dialog:active" "dialog:active_dialogs = 18446744073709551566", root@kamailio1:~# kamctl rpc stats.get_statistics all | grep -e "dialog:active" "dialog:active_dialogs = 18446744073709551566", root@kamailio1:~# kamctl rpc stats.get_statistics all | grep -e "dialog:active" "dialog:active_dialogs = 18446744073709551566", root@kamailio1:~# kamctl rpc stats.get_statistics all | grep -e "dialog:active" "dialog:active_dialogs = 18446744073709551566", root@kamailio1:~# kamctl rpc stats.get_statistics all | grep -e "dialog:active" "dialog:active_dialogs = 18446744073709551564", root@kamailio1:~# kamctl rpc stats.get_statistics all | grep -e "dialog:active" "dialog:active_dialogs = 18446744073709551564", root@kamailio1:~# kamctl rpc stats.get_statistics all | grep -e "dialog:active" "dialog:active_dialogs = 18446744073709551563", root@kamailio1:~# kamctl rpc stats.get_statistics all | grep -e "dialog:active" "dialog:active_dialogs = 18446744073709551562", root@kamailio1:~# kamctl rpc stats.get_statistics all | grep -e "dialog:active" "dialog:active_dialogs = 18446744073709551565", root@kamailio1:~# kamctl rpc stats.get_statistics all | grep -e "dialog:active" "dialog:active_dialogs = 18446744073709551567", root@kamailio1:~# kamctl rpc stats.get_statistics all | grep -e "dialog:active" "dialog:active_dialogs = 18446744073709551568", root@kamailio1:~# kamctl rpc stats.get_statistics all | grep -e "dialog:active" "dialog:active_dialogs = 18446744073709551569", root@kamailio1:~# kamctl rpc stats.get_statistics all | grep -e "dialog:active" "dialog:active_dialogs = 18446744073709551569", root@kamailio1:~# kamctl rpc stats.get_statistics all | grep -e "dialog:active" "dialog:active_dialogs = 18446744073709551570", root@kamailio1:~# kamctl rpc stats.get_statistics all | grep -e "dialog:active" "dialog:active_dialogs = 18446744073709551570", root@kamailio1:~# kamctl rpc stats.get_statistics all | grep -e "dialog:active" "dialog:active_dialogs = 18446744073709551570", root@kamailio1:~# kamctl rpc stats.get_statistics all | grep -e "dialog:active" "dialog:active_dialogs = 18446744073709551569", root@kamailio1:~# kamctl rpc stats.get_statistics all | grep -e "dialog:active" "dialog:active_dialogs = 18446744073709551569", root@kamailio1:~# kamctl rpc stats.get_statistics all | grep -e "dialog:active" "dialog:active_dialogs = 18446744073709551570", root@kamailio1:~# kamctl rpc stats.get_statistics all | grep -e "dialog:active" "dialog:active_dialogs = 18446744073709551571", root@kamailio1:~# kamctl rpc stats.get_statistics all | grep -e "dialog:active" "dialog:active_dialogs = 18446744073709551572", root@kamailio1:~# ```

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1591#issuecomment-405637294

Daniel-Constantin Mierla

18 Jul 18 Jul

10:08 a.m.

New subject: [kamailio/kamailio] RPC command stats.get_statistics randomly reporting 9223372036854776000 for current active/early dialogs (#1591)

The values are closed to ULONG_MAX for 64bit, so likely some dialog end events are counted many times (likely due to dmq replication) and the value goes below 0, turning it to max value.

For reference, ranges for number types: * https://en.wikibooks.org/wiki/C_Programming/limits.h

In the past days I was rather busy with other stuff, I will try to get back and analyze the code asap.

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1591#issuecomment-405880927

Joel Serrano

4:52 p.m.

New subject: [kamailio/kamailio] RPC command stats.get_statistics randomly reporting 9223372036854776000 for current active/early dialogs (#1591)

Thanks! Let me know if you need any more info

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1591#issuecomment-406000670

Joel Serrano

28 Jul 28 Jul

7:15 p.m.

New subject: [kamailio/kamailio] RPC command stats.get_statistics randomly reporting 9223372036854776000 for current active/early dialogs (#1591)

Could it have something to do with "old-non-expired-non-removed" dialogs?

So today I applied latest @charlesrchance patch, and what I normally do (so I don't lose dialogs) is:

1- Restart one node 2- Wait for DMQ replication dialog sync 3- Restart other node

Right after restarting both nodes in that sequence, I already could see values close to ULONG_MAX, so I did a `watch` every second and I could see the values move around.

I decided to compare the dialogs to see if I could find anything interesting, what I found is that there were a lot of dialogs with the `init_ts` from like 11th and 13th July and being today 28th, those dialogs should not be there.

So I wonder if those "bad" dialogs could be the cause of the counters going below 0.

Maybe some timer to remove those dialogs didn't happen? and because of the way I perform restarts to not lose dialogs, they were constantly there replicated backwards and forwards...

That said, I did a full restart (both nodes at the same time, losing dialogs) and since then the metrics are looking good.

I hope this info helps.

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1591#issuecomment-408629258

Joel Serrano

29 Jul 29 Jul

4:46 p.m.

New subject: [kamailio/kamailio] RPC command stats.get_statistics randomly reporting 9223372036854776000 for current active/early dialogs (#1591)

More info:

On another cluster, same setup..., during this troubleshooting I disabled DMQ and enabled MySQL for dialog replication, I also left one node outside of rotation to see replication behavior.

Well, with 0 traffic I can see this:

``` root@sbc01:~# kamctl rpc dlg.stats_active { "jsonrpc": "2.0", "result": { "starting": 0, "connecting": 0, "answering": 0, "ongoing": 0, "all": 0 }, "id": 2634 } root@sbc01:~# ```

Which is correct, but...:

``` root@sbc01:~# kamctl rpc stats.fetch all | grep dialog "dialog.active_dialogs": "38", "dialog.early_dialogs": "0", "dialog.expired_dialogs": "38", "dialog.failed_dialogs": "0", "dialog.processed_dialogs": "1", root@sbc01:~# ```

I wonder if it's correct that those 38 `expired` dialogs still count towards the `active` counter? Could that have something to do? Maybe those non-existent-expired-dialogs because they are somewhere in the "active" counter, they get replicated only when DMQ is enabled?

So far I have clear that:

1- For sure there is a scenario where counters go below 0. 2- It only happens (so far) when DMQ is enabled 3- In my initial look, I found some logs about dialogs created on node1 were timed-out on node2 (see previous posts) 4- Now I see a discrepancy on active vs expired dialogs on a node with 0 active dialogs.

I feel we are slowly narrowing down what the possible problem can be, but I still don't have a clear picture what causes what.

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1591#issuecomment-408690103

Charles Chance

31 Jul 31 Jul

12:37 p.m.

New subject: [kamailio/kamailio] RPC command stats.get_statistics randomly reporting 9223372036854776000 for current active/early dialogs (#1591)

Having made some tests around this, whilst I have not yet been able to reproduce the negative counter issue, I do think there needs to be some further thought around dialog replication.

First thing to note - stats are not affected by replicated dialogs, so I don't think DMQ is _directly_ responsible for the negative counting. However - and this may be _indirectly_ related - if a node is restarted, any dialogs owned by it at the time will be forever 'stuck'. This is owing to the fact that in its current implementation, the dialog owner is responsible for triggering update/removal across the rest of the cluster. If the owner no longer exists - or has been restarted and has no idea that it was once the owner of some/all of the dialogs it receives in its initial re-sync - then this link is broken permanently. It is further compounded by the fact these orphaned dialogs never (in my tests, anyway) timeout.

I need to spend some more time on the DMQ side, since this is the first time I have looked at it properly. In the meantime, @joelsdc, regarding your issue:

1. Do you have database enabled alongside DMQ replication or was it only for testing? I suspect this is where the recent 38 'expired' dialogs came from - conversely, the earlier 'bad' dialogs you mentioned were likely a result of the owning node having been restarted (these would not have been included in the 'expired' counter).

2. Are you expecting the stats counters (the original ones, not the new 'dlg.stats_active') to reflect all dialogs across the cluster or just those handled directly by the local instance?

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1591#issuecomment-409205552

Joel Serrano

5:30 p.m.

New subject: [kamailio/kamailio] RPC command stats.get_statistics randomly reporting 9223372036854776000 for current active/early dialogs (#1591)

Hi @charlesrchance, thanks for the time to look into this.

What you say totally matches the "weird" behaviors I was seeing. You really gave me a lot of clarity understanding what is going on behind the scenes.

To answer your questions:

1. Yes, I have database but my idea was to move away from it for each thing I handle using DMQ. This also matches chronology of events, I have been restarting nodes here and there, I wasn't "losing" dialogs, but making them "orphaned" instead without knowing.

2. So there are (at least) a couple different options here: 2a) stat counters would have all dialogs (local + replicated), which in reality is the true value, as from kamailio's perspective, those dialogs are there and they are active aren't they? 2b) stat counters would only have local dialogs, so if you want to have full dialog count you would have to stack the values from all nodes.

Now there is also a `dlg.stats_active` that gives you current dialogs (not sure if it also shows replicated ones) so maybe we can benefit from both options, a counter that includes both local and replicated dialogs, and a different counter (dlg.stats_active?) that has only the current-local-count of dialogs.

Since I fully restarted both SBCs (which in fact, that was clearing the orphaned dialogs) I haven't seen a spike (negative counter).

At this point I don't know what the best options is, I do understand perfectly the problem though.

@miconda what do you think about all this?

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1591#issuecomment-409303111

Daniel-Constantin Mierla

15 Aug 15 Aug

1:47 p.m.

New subject: [kamailio/kamailio] RPC command stats.get_statistics randomly reporting 9223372036854776000 for current active/early dialogs (#1591)

The rpc command `dlg.stats_active` goes over the internal hash table of dialog module and counts the dialogs that are in active state, so what ever is there, local or replicated dialogs.

Counters are exposed to inconsistency if adding/removing is triggered by events that can appear many times, so maybe the statistic related to active calls should be changed to a callback counting in a similar manner as the rpc command. There are other statistics that have a callback function behind instead of a counter.

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1591#issuecomment-413202450

Daniel-Constantin Mierla

11 Sep 11 Sep

2:44 p.m.

New subject: [kamailio/kamailio] RPC command stats.get_statistics randomly reporting 9223372036854776000 for current active/early dialogs (#1591)

I pushed a patch (see above) to update stats for active and early dialogs on dmq operations. Can you give it a try and see if all ok with it.

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1591#issuecomment-420299616

Joel Serrano

8:31 p.m.

New subject: [kamailio/kamailio] RPC command stats.get_statistics randomly reporting 9223372036854776000 for current active/early dialogs (#1591)

Hi @miconda , I can absolutely test this.

Would your patch apply to return values for `dlg.stats_active` or to the dialog entries for `stats.get_statistics` ?? Or maybe to both? :)

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1591#issuecomment-420415071

Daniel-Constantin Mierla

12 Sep 12 Sep

6:45 a.m.

New subject: [kamailio/kamailio] RPC command stats.get_statistics randomly reporting 9223372036854776000 for current active/early dialogs (#1591)

This is only for the statistics.

The rpc command was not touched.

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1591#issuecomment-420531298

Joel Serrano

24 Oct 24 Oct

6:06 p.m.

New subject: [kamailio/kamailio] RPC command stats.get_statistics randomly reporting 9223372036854776000 for current active/early dialogs (#1591)

Hi @miconda, sorry for the delay, it's taken me a while until I have been able to test this..

I can confirm so far that `dlg.stats_active` works correctly, it's in sync between both nodes all the time, so any changes to the dialogs are reflected there instantly on both nodes.

But, something must be wrong with `stats.get_statistics`:

Active K node:

``` root@sbc01:/etc/kamailio# kamcmd stats.get_statistics all | grep dialog dialog:active_dialogs = 30 dialog:early_dialogs = 2 dialog:expired_dialogs = 14 dialog:failed_dialogs = 12176 dialog:processed_dialogs = 44907 root@sbc01:/etc/kamailio# ```

Backup K node (newly restarted):

``` root@sbc02:/etc/kamailio# kamcmd stats.get_statistics all | grep dialog dialog:active_dialogs = 27 dialog:early_dialogs = 0 dialog:expired_dialogs = 0 dialog:failed_dialogs = 0 dialog:processed_dialogs = 0 root@sbc02:/etc/kamailio# ```

that `27` seems to me like an "initial sync" from the peer node, but that number won't change after that (which makes sense as this node is a backup and doesnt' handle any traffic unless there is a failover).

Does this make sense?

Thanks!

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1591#issuecomment-432767538

Joel Serrano

26 Oct 26 Oct

4:31 p.m.

New subject: [kamailio/kamailio] RPC command stats.get_statistics randomly reporting 9223372036854776000 for current active/early dialogs (#1591)

Hi guys.. the `dlg.stats_active` is definitely working nicely. But the stats.get_statistics() still gets out of sync with its peer node...:

![image](https://user-images.githubusercontent.com/16212586/47570680-97c53c80-d8eb-11...)

So I see the situation like this:

- `dlg.stats_active` works. Thanks! <--- I'm using this for now. - `stats.get_statistics` has 2 issues: 1- The counter at some point still hits <0 and goes to ULONG_MAX value, blablabla... 2- On startup it counts the active dialogs (not sure if via DMQ or database) and that number stays static. (NOTE: this we are talking on the backup node, which has 0 traffic, and therefor should say 0). <--- This I yet have to confirm if it happens because of DMQ or dialog modparam DB_MODE=3

I'll update here later on.

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1591#issuecomment-433467049

Joel Serrano

4:36 p.m.

New subject: [kamailio/kamailio] RPC command stats.get_statistics randomly reporting 9223372036854776000 for current active/early dialogs (#1591)

BTW, @henningw:

If I build myself kamailio I have the commit id in `kamailio -V`

But from nightly-deb packages it's not available:

DEB install: ``` root@sbc01:~# kamailio -V version: kamailio 5.2.0-pre1 (x86_64/linux) flags: STATS: Off, USE_TCP, USE_TLS, USE_SCTP, TLS_HOOKS, USE_RAW_SOCKS, DISABLE_NAGLE, USE_MCAST, DNS_IP_HACK, SHM_MEM, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, TLSF_MALLOC, DBG_SR_MEMORY, USE_FUTEX, FAST_LOCK-ADAPTIVE_WAIT, USE_DNS_CACHE, USE_DNS_FAILOVER, USE_NAPTR, USE_DST_BLACKLIST, HAVE_RESOLV_RES ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144 MAX_URI_SIZE 1024, BUF_SIZE 65535, DEFAULT PKG_SIZE 8MB poll method support: poll, epoll_lt, epoll_et, sigio_rt, select. id: unknown compiled with gcc 6.3.0 root@sbc01:~# ```

Source install: ``` root@sbc02:~# kamailio -V version: kamailio 5.2.0-pre1 (x86_64/linux) 2ecf60-dirty flags: STATS: Off, USE_TCP, USE_TLS, USE_SCTP, TLS_HOOKS, USE_RAW_SOCKS, DISABLE_NAGLE, USE_MCAST, DNS_IP_HACK, SHM_MEM, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, TLSF_MALLOC, DBG_SR_MEMORY, USE_FUTEX, FAST_LOCK-ADAPTIVE_WAIT, USE_DNS_CACHE, USE_DNS_FAILOVER, USE_NAPTR, USE_DST_BLACKLIST, HAVE_RESOLV_RES ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144 MAX_URI_SIZE 1024, BUF_SIZE 65535, DEFAULT PKG_SIZE 8MB poll method support: poll, epoll_lt, epoll_et, sigio_rt, select. id: 2ecf60 -dirty compiled with gcc 6.3.0 root@sbc02:~# ```

I guess I wasn't going crazy after all! :D

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1591#issuecomment-433468833

Joel Serrano

5:36 p.m.

New subject: [kamailio/kamailio] RPC command stats.get_statistics randomly reporting 9223372036854776000 for current active/early dialogs (#1591)

I just verified and removing DB_URL and DB_MODE from dialog modparams has the same behavior. In order to keep the thread in sync with the topic, I have created #1692 to track the metric not being updated. As for *this* issue we can either close it or continue troubleshooting on why the dialog counter from `stats.get_statistics` goes below 0.

@miconda what do you think?

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1591#issuecomment-433486369

Daniel-Constantin Mierla

11 Mar 11 Mar

8:14 a.m.

New subject: [kamailio/kamailio] RPC command stats.get_statistics randomly reporting 9223372036854776000 for current active/early dialogs (#1591)

Closing with the solution of rpc "dlg.stats_active".

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1591#issuecomment-471442731

Daniel-Constantin Mierla

8:14 a.m.

New subject: [kamailio/kamailio] RPC command stats.get_statistics randomly reporting 9223372036854776000 for current active/early dialogs (#1591)

Closed #1591.

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1591#event-2193117634

2315

Age (days ago)

2558

Last active (days ago)

sr-dev@lists.kamailio.org

28 comments

4 participants

tags (0)

participants (4)

Charles Chance
Daniel-Constantin Mierla
Joel Serrano
Julien Chavanton