Having made some tests around this, whilst I have not yet been able to reproduce the negative counter issue, I do think there needs to be some further thought around dialog replication.

First thing to note - stats are not affected by replicated dialogs, so I don't think DMQ is directly responsible for the negative counting. However - and this may be indirectly related - if a node is restarted, any dialogs owned by it at the time will be forever 'stuck'. This is owing to the fact that in its current implementation, the dialog owner is responsible for triggering update/removal across the rest of the cluster. If the owner no longer exists - or has been restarted and has no idea that it was once the owner of some/all of the dialogs it receives in its initial re-sync - then this link is broken permanently. It is further compounded by the fact these orphaned dialogs never (in my tests, anyway) timeout.

I need to spend some more time on the DMQ side, since this is the first time I have looked at it properly. In the meantime, @joelsdc, regarding your issue:

  1. Do you have database enabled alongside DMQ replication or was it only for testing? I suspect this is where the recent 38 'expired' dialogs came from - conversely, the earlier 'bad' dialogs you mentioned were likely a result of the owning node having been restarted (these would not have been included in the 'expired' counter).

  2. Are you expecting the stats counters (the original ones, not the new 'dlg.stats_active') to reflect all dialogs across the cluster or just those handled directly by the local instance?


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.