#### Pre-Submission Checklist - [X] Commit message has the format required by CONTRIBUTING guide - [X] Commits are split per component (core, individual modules, libs, utils, ...) - [X] Each component has a single commit (if not, squash them into one commit) - [X] No commits to README files for modules (changes must be done to docbook files in `doc/` subfolder, the README file is autogenerated)
#### Type Of Change - [X] Small bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds new functionality) - [ ] Breaking change (fix or feature that would change existing functionality)
#### Checklist: - [X] PR should be backported to stable branches - [X] Tested changes locally - [ ] Related to issue #XXXX (replace XXXX with an open issue number)
#### Description I've recently being experiencing a loop in nodes removal/addition leading to "ghost nodes". Suppose to have three servers A,B,C. Server C goes down not cleanly, so DMQ doesn't notify the other nodes. Server A is the first to send its ping, with a nodelist including node C. After fr_timer, the transaction for the message to node C times out and the node is removed from node A nodelist. Then node B sends its ping with a nodelist including node C (still alive for A), node A sees node C as a new node and adds it back to its nodelist. Now node B reaching fr_timer timeout removes node C, until next node's A ping, and so on. This does not occur if the delta between node A and node B pings is less than fr_timer. What I propose here is that, upon a failed ping, the failing node is put in disabled state and we wait a 2nd failed ping before removing it from the nodelist. This should prevent dead nodes to come back. You can view, comment on, or merge this pull request online at:
https://github.com/kamailio/kamailio/pull/1840
-- Commit Summary --
* dmq: wait for a 2nd failed ping before deleting a node
-- File Changes --
M src/modules/dmq/notification_peer.c (23)
-- Patch Links --
https://github.com/kamailio/kamailio/pull/1840.patch https://github.com/kamailio/kamailio/pull/1840.diff
Looks good - thanks!
Would it be sensible to add a module parameter to configure the number of failed pings before a node is considered dead (defaulting to 1 to preserve current behaviour)?
Definitely :) Should I merge and backport this one?
I am against backporting because it changes current behaviour and is not strictly a bug fix.
For master, I would still prefer the option to preserve current behaviour (with modparam) in case it is essential for some users to fail quickly.
I discovered the problem exactly because I wanted to fail quickly, having a global fr_timer of 5 s. IMHO the patch as it is doesn't affect the capability to fail quickly since a node is taken off the nodelist for message broadcasting immediately, except for the ping. I would say that it affects the recovery speed, since a two pings cycle is needed before a node is added back in case of a temporary failure. A modparam can be added to enable this, along with the caveat in the doc that a race condition can happen due to the combination of fr_timer and dmq ping interval.
Okay - can be merged in that case.
Thanks again!
Mergin, thank you :)
Merged #1840 into master.