<!-- Kamailio Pull Request Template -->
<!-- IMPORTANT: - for detailed contributing guidelines, read: https://github.com/kamailio/kamailio/blob/master/.github/CONTRIBUTING.md - pull requests must be done to master branch, unless they are backports of fixes from master branch to a stable branch - backports to stable branches must be done with 'git cherry-pick -x ...' - code is contributed under BSD for core and main components (tm, sl, auth, tls) - code is contributed GPLv2 or a compatible license for the other components - GPL code is contributed with OpenSSL licensing exception -->
#### Pre-Submission Checklist <!-- Go over all points below, and after creating the PR, tick all the checkboxes that apply --> <!-- All points should be verified, otherwise, read the CONTRIBUTING guidelines from above--> <!-- If you're unsure about any of these, don't hesitate to ask on sr-dev mailing list --> - [ X] Commit message has the format required by CONTRIBUTING guide - [ X] Commits are split per component (core, individual modules, libs, utils, ...) - [ X] Each component has a single commit (if not, squash them into one commit) - [ X] No commits to README files for modules (changes must be done to docbook files in `doc/` subfolder, the README file is autogenerated)
#### Type Of Change - [ ] Small bug fix (non-breaking change which fixes an issue) - [X ] New feature (non-breaking change which adds new functionality) - [ ] Breaking change (fix or feature that would change existing functionality)
#### Checklist: <!-- Go over all points below, and after creating the PR, tick the checkboxes that apply --> - [ ] PR should be backported to stable branches - [ X] Tested changes locally - [ X] Related to issue #2361
#### Description When a destination is selected using one of the hash algorithms (over Call-ID, From, etc.) and destination is not available, then the next is selected. This causes the following problem : All messages to the unavailable destination will be sent to the same unique destination. This will cause that one destination will receive double load and the rest will continue with the same.
Solution : To better distribute messages, while also guaranteeing the same distribution of the hash, a re-hash of the hash is made.
As the rehash could select the same or other failed destination, it can rehash several times. The number of rehashes made is controlled by :
modparam("dispatcher", "ds_rehash_max", -1)
Which can take the following values :
-1 : The total count of destinations is used 0 : Disabled. Reverts to original implementation. Positive number. Maximum number of rehashes.
You can view, comment on, or merge this pull request online at:
https://github.com/kamailio/kamailio/pull/2363
-- Commit Summary --
* dispatcher : better distribution when using hash and destination is not available
-- File Changes --
M src/modules/dispatcher/dispatch.c (49) M src/modules/dispatcher/dispatcher.c (2)
-- Patch Links --
https://github.com/kamailio/kamailio/pull/2363.patch https://github.com/kamailio/kamailio/pull/2363.diff
OK, guys,
It seems I did the pull request right this time. Now it contains only my changes, as I verified my forked repo was in full sync with kamailio:master.
if one of the maintainers could take a look at this pull request, please? It'd be great if it could be at release 5.4.0.
It's a very small change, just about 50 lines, and a very simple concept. Just rehash of hash, a limited amount of times, when destinations are not available.
Best regards,
Luis Rojas (Kalki70 at github)
On Wed, Jun 17, 2020 at 9:30 AM Kalki70 notifications@github.com wrote:
Pre-Submission Checklist
- [ X] Commit message has the format required by CONTRIBUTING guide
- [ X] Commits are split per component (core, individual modules,
libs, utils, ...)
- [ X] Each component has a single commit (if not, squash them into
one commit)
- [ X] No commits to README files for modules (changes must be done to
docbook files in doc/ subfolder, the README file is autogenerated)
Type Of Change
- Small bug fix (non-breaking change which fixes an issue)
- [X ] New feature (non-breaking change which adds new functionality)
- Breaking change (fix or feature that would change existing
functionality)
Checklist:
- PR should be backported to stable branches
- [ X] Tested changes locally
- [ X] Related to issue #2361
https://github.com/kamailio/kamailio/issues/2361
Description
When a destination is selected using one of the hash algorithms (over Call-ID, From, etc.) and destination is not available, then the next is selected. This causes the following problem : All messages to the unavailable destination will be sent to the same unique destination. This will cause that one destination will receive double load and the rest will continue with the same.
Solution : To better distribute messages, while also guaranteeing the same distribution of the hash, a re-hash of the hash is made.
As the rehash could select the same or other failed destination, it can rehash several times. The number of rehashes made is controlled by :
modparam("dispatcher", "ds_rehash_max", -1)
Which can take the following values :
-1 : The total count of destinations is used 0 : Disabled. Reverts to original implementation. Positive number. Maximum number of rehashes.
You can view, comment on, or merge this pull request online at:
https://github.com/kamailio/kamailio/pull/2363 Commit Summary
- dispatcher : better distribution when using hash and destination is
not available
File Changes
- *M* src/modules/dispatcher/dispatch.c
https://github.com/kamailio/kamailio/pull/2363/files#diff-b9d00a2976c1b1c5d3535a5ae0e24047 (49)
- *M* src/modules/dispatcher/dispatcher.c
https://github.com/kamailio/kamailio/pull/2363/files#diff-5e75274551c31eb6158e6815fd840331 (2)
Patch Links:
- https://github.com/kamailio/kamailio/pull/2363.patch
- https://github.com/kamailio/kamailio/pull/2363.diff
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kamailio/kamailio/pull/2363, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABO7UZMG4REGMAG7GQAYX6TRXDAMNANCNFSM4OATJVEA . _______________________________________________ Kamailio (SER) - Development Mailing List sr-dev@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-dev
@Kalki70 pushed 1 commit.
fdd623a50fedd4312f8324cf8796ef00eed12af5 dispatcher: Fixs rehash of has when ds_rehash_max is 1.
@Kalki70 pushed 1 commit.
ce177c134f00cd1509313b50163849f736b6e0a9 dispatcher : FIX. Previous implementation of rehash did not consider case when "use_default" flag is active.
When I realized the use of "use_default" flag I had trouble trying to fix the rehash in that case.
According to documentation:
[use_default](https://kamailio.org/docs/modules/4.3.x/modules/dispatcher.html#dispatcher.p...)
_3.10. use_default (int) If the parameter is set to 1, the last address in destination set is used as a final option to send the request to._
But it's not the last address. It's the first one. After dealing with my implementation I tried with official 5.3.3 and found the same behavior. It's the first one that is not used and reserved as default, in case no other destination is available.
It seems that for some reason the list of addresses are loaded in reverse order.
My implementation does not touch that, so it is still the first one.
First, the variables must be declared at the beginning of the functions/blocks, to be coherent with the rest of the code.
Then, the new code is doing new-hashing over an integer converted to string, which then do new-hashing over the returned index (again converted to string). I do not like the approach. Why not just do hashid modulo number of active destinations then selecting the destination address by iterating in the list with active destinations, instead of picking up directly by index.
Anyhow, I think the existing algorithms should stay like they are now, you can add a new algorithm hashing the value in a variable (like algorithm 7) with the re-hashing you want. Algorithm 7 can substitute all the other hashing algorithms by setting the appropriate value in the hashing variabel (eg, $var(dshash) = $ci for algorithm 0).
My comments in between, with [Kalki70]:
First, the variables must be declared at the beginning of the functions/blocks, to be coherent with the rest of the code.
[Kalki70] I did not know you had this convention. I am more used to the "declare the variable when you need it" approach, as it makes code clearer (matter of opinion). I will change that.
Then, the new code is doing new-hashing over an integer converted to string, which then do new-hashing over the returned index (again converted to string). I do not like the approach. Why not just do hashid modulo number of active destinations then selecting the destination address by iterating in the list with active destinations, instead of picking up directly by index.
[Kalki70] I agree that it is not nice. I am new to the code, and couldn't find a way to do a native hash over an integer. It seemed to be that everything was using a hash over a string in the end. If you can direct me to the appropriate available function to do hash over integer, I can change it. The code is already doing hashid modulo and then iterating over a list of destinations. I am trying to fix that.
Anyhow, I think the existing algorithms should stay like they are now, you can add a new algorithm hashing the value in a variable (like algorithm 7) with the re-hashing you want. Algorithm 7 can substitute all the other hashing algorithms by setting the appropriate value in the hashing variabel (eg, $var(dshash) = $ci for algorithm 0).
[Kalki70] : Here I disagree. I am not proposing a new algorithm, I am trying to FIX the current implementation of hashing algorithms, as I consider them faulty and useless in a real production environment. Every destination must be running at less than 50% capacity, to be able to support the full load of a faulty destination. That means, duplicating resources. Even worse. You may have 50 destinations, but if two consecutive entries in the dispatcher.list fail, the next one will receive triple load, its own plus 100% of load of the two failed destinations. How do you protect from that scenario? Impossible in a real solution. It would have been better (not so bad, but not really "better") to use an "M+N" approach. Use hash over M destinations, and if destination is down, use one of the N spares, instead of overloading current active ones.
I open the debate here. I think the current implementation of hash algorithm, when destination is down, is faulty. I asked about this on kamailio users list, but received no answer. I also reported a strange behavior when entries in dispatcher.list are duplicated.
https://lists.kamailio.org/pipermail/sr-users/2020-June/109518.html
I have a question about the reverse order of destinations. As you can see on the issue #2361 , when a destination fails, the code selects the "next", but that is really the previous in the file dispatcher.list. Also, if we use the "use_default" flag, the last destination should not be used, unless all other destinations fail. Again, if you try it, you see it's not the last destination, but the first.
It seems that somehow the list fo destinations in the file are loaded in reverse, so iterating over the next actually goes to the previous. The last one is actually the first.
Best regards,
Luis Rojas (user Kalki70).
It is admirable that you are convinced it is a faulty implementation nobody observed since August 2004 when the module was added and the hashing over call-id was implemented, so you are going to fix it now for everyone.
Well, actually that is the wanted behaviour of the respective hashing algorithms, to go to next one if the selected destination is unavailable. It may not meet your expectation, therefore you can add a new hashing algorithm, as I proposed. Otherwise, your fix is breaking existing deployments for others. It is one of the benefits of open source to be able to accommodate most of the needs of its contributors.
Normally, a gateway/destination server should be down temporarily, otherwise it should be removed from the list if is shut down for maintenance. Also, you should do dispatching of the initial requests stateless and rely on record-routing for requests within dialog. Because if you have destination servers going up and down during a call, then you do not ensure BYE is going to same server as the INVITE, if the number if active servers is different.
Regarding the order of the addresses in the list, to ensure the one you want, you have to set the priority field. While with a text file you can have an order in the file, with database is harder to predict the order of records in a select query without an explicit order-by.
My fix does not break anything, as the re-hashing can be disabled by config. Backward compatibility is a must, always, in my opinion. I did a mistake though. Default value, if variable is not set, should be 0 ("disabled"). That would be something to modify.
Maybe in August 2004 servers and resources where much cheaper. Nowadays most telco companies can't afford to have double amount of resources, in case kamailio decides to duplicate the load over one of the destinations. I can be sarcastic too, and I don't see how that helps debating over this issue.
Regarding relying on Record-Route to secure delivery to same destinations : Maybe you are not aware of all SIP scenarios. For instance, SIP CANCEL, which must follow same path as original INVITE, and Record-Route has no use then. Same applies for ACK for non 2xx replies.
Maybe documentation should indicate that, if the **optional** priority field is not present, entries on file dispatcher.list are considered in reverse order. Hard to know that in advance.
I didn't want to be sarcastic at all, just admired how you put your perspective that you fix a bug -- you are right and others are wrong. And yes, I am sure I am not aware of all SIP scenarios, every day I am surprised on what new issues I have to troubleshoot on SIP traffic. That's why I haven't listed all of them, just the case of BYE, but you actually prove my point of CANCEL and ACK -- if the servers active at CANCEL or ACK are not like for INVITE, is hard to ensure the same server is selected. A solution here could be to try to do some stickiness routing using htable.
I also agree that there could be different ways to achieve some new feature, but, imo, it is more clear to have a new algorithm that covers all your needs, instead of changing for all the other hashing algorithms, which are used by many others as they are. This brings more flexibility as well, one can use existing algorithms and the new one in the same config at the same time, not depending on a modparam.
Same approach was done when relative weight distribution algorithm was added, along with the exiting at that time, weight distribution algorithm.
You are more than welcome to propose PR to improve the documentation based on your findings and discussions on community forums/portals -- such contributions are very much appreciated!
I never said there was a bug. I don't think it is a bug. For me a bug is something wrong in the code, so the application does not behave as it is expected. In this case the code is right. It does what it is expected to do. Problem is, I think the expected behavior is a bad behavior. Maybe the original developer did not think carefully about this. There is very little use on real life scenarios, unless all destinations have plenty of free resources, so they can process double or triple load. Maybe the guy would say "yeah, it's a good improvement! Let's do it in a backward compatible way!"
Please, take a look at issue #2361. if, instead of Calls, those were CAPS, so every destination processes 1000CAPS. What if those nodes can't process 2000 CAPS? For instance they are capable of processing 1500CAPS. When one destination fails, the next will receive 2000 CAPS. But it can't. Currently Kamailio can't solve that scenario. Wouldn't it be better that every destination now processes 1200 CAPS?
That was the point I wanted to discuss. A better distribution. If everything is normal. Every destination receives, more or less, same amount of calls (it's just hash, not fair distribution). If some destinations fail, the calls they were supposed to receive are better distributed over the remaining destinations, but respecting the hash : same hash, same destination.
If servers active change between INVITE and other messages, I don't think there is a stateless way to solve it. The current implementation also fails. First, for what I saw in the code, the list contains all destinations, not just those active. Once hashid is obtained, modulo #total destination is applied, and then state of destination is checked. If it's not active, next is selected. if next is active, INVITE is sent to it.
What if, when CANCEL is received, the original destination was again active? Hashid is obtained, modulo #total, server is active, CANCEL is sent to wrong server. As I said, I don't think there is a fully stateless way to solve it. kamailio should memorize original destination, or something complex, beyond the scope of a simple hash.
I did care about continuing service. OK, if a server goes down and later goes up, it will cause some calls to fail, with current implementation and also mine. But, service should continue. It would be a high availability platform, that can continue providing service to new calls, but failing on transient ones. And with a better distribution.
I did not want to add a new feature. I wanted to improve existing ones. The hash algorithms can be improved, as any part of the code may eventually be improved.
_Now, if I wanted to implement it your way, in a new algorithm, I have no idea about how could I get inside function ds_manage_routes() the value to be used as a hash. You said, for instance, $var(dshash) = $ci . How can I get the content of dshash inside that function?_
You said "I am trying to FIX the current implementation", which implies there is a problem, a bug. Initial developer maybe thought about more that you think, just had to implement based on the requirements at that moment and they may still be valid for specific use cases even after 16 years, given that the code was changed. Anyhow, let's not go around the words and focus on the technical parts.
I didn't say what you propose as a new distribution logic is not useful for some cases, I proposed to do a new algorithm inspiring from the 7. The module was introduced before variables, once they were available, algorithm 7 can offer same like hashing over callid or other attributes of the sip message. There are clear benefits of this approach, some I listed in previous comments.
If you look in the code where the algorithm 7 (DS_ALG_HASHPV) you can see how the value of the pv specified by modparam hash_pvar is received. If you get troubles sorting it out, let us know what you coded (post a diff), and we will try to assist.
@Kalki70 pushed 1 commit.
33bdabc77905b6b056022fa7eec560c0163f09ef Default mode is fully compatible with previous implementation. Default mode "Doesn nothing"
After careful consideration, I have decided not to implement a new algorithm. My proposal is an enhancement of current hash algorithms. It's an extension. Without changes it will behave like it has always been, since 2004. **It does not break anything**.
However, if the user wants, via modparam he can activate the extension, so rehash of hash can happen. As we must know, hash of rehash could lead to same original failed server. In that case, a loop of rehashing may happen. It's up to the user to decide how many tries of rehashing. Anyway, if after retrying hashing still no available server is found, algorithm stops and falls back to original, sequential implementation. Fully backward compatible.
I expected this place to be an interesting interacting place of ideas and debating, with several people arguing. For instance, mister @sergey-safarov, who seemed to like my proposal at issue #2361.
Now, if mister "god" Daniel Constantin Mierla decides to just reject this proposal, because he just did not like it, or because it's June, we will just happily use at my work my patched version of kamailio, and we can simply forget the subject.
This is one of the benefits of open source.
@Kalki70 First of all thanks for the analysis (especially done in #2361). We are a group of developers in this project and certainly are open to discussions and feedback that is focussed at the topic.
I agree that the observed behaviour might be in some cases not that one want and another option would be benefitial. Many years ago this was one of the reasons to make it e.g. more explicit with backup/fallback options in carrierroute. But there might be scenarios were you don't want this rehashing done (e.g. if you use a stable partitioning user -> user location DB). This is something were a simple replacement (like currently done) is better as this rehashing.
Your extension to the module is welcome, we usually accept most of the pull requests in this project. I think all that Daniel said is that it should go in as a dedicated algorithm and not as a configuration parameter. We already have quite a lot of configuration parameter in this (and other) modules, so it might be easier for people to understand in the end.
@Kalki70 - I think you start crossing the common sense line with your last comment on a personal attack. As I said my non-technical remark was more from an admiration point of view -- bugs that are found in code older than 15 years deserve somehow an "highlighting prize", but it is not the case here, you wrote you FIX current implementation, proves to be a new set of distribution algorithms. Everything else was technical review and comments about the code changes. If you want the kind of discussions with personal attacks, Kamailio is not a project for such approach.
The way I proposed to add the feature has many benefits: * cleaner enhancement, the existing algorithms not being affected at all * ability to use at the same time with other algorithms -- while you may want to do your distribution to transcoding servers, maybe the registrations are better with old style algorithm to know better where user end up registered on a server maintenance time frame * backward compatibility * follows the same approach with weight and rweight algorithms
Yours keeps the backward compatibility, nothing more adds to the decision complexity of setting modparams.
Besides the above aspects, I also had the specific code review remarks: declaration of variables at the beginning of blacks and the not-so-good second hashing which is done over an integer printed as string. For the second here, the scope of hashing is to get an integer from string, if you have already the integer, converting it to string to hash again is not really a good solution. Even from distribution point of view, if you have 3 server addresses in the set, then you will always has either "0", "1" or "2". Hashing 3 string values will return always 3 integers (hash id). I haven't spent time on analyzing much the code, but practically, if server 2 is down, then all traffic sent to it will be sent to the server corresponding to the hash id computed from "1", so you still send to another **single** server. Second hashing has to be done on something that can ensure better distribution.
As code issues in the patch, one seems to be a potential infinite loop in the line:
``` while (( maxRehash > 0 ) && ds_skip_dst(idx->dlist[fullHash % idx->nr].flags ) ); ```
Code indentation is also not following the style in the file. And those were what I could spot quickly so far.
There are several misunderstanding of the code
As I mentioned, I didn't find a function to do hash natively over an integer, so I used a string. I also asked, if there was any I could use, but received no answer. I guess std::hash is out of the question.
It's wrong to consider that hash will be done over "1", "2", "3", etc. Those are the indexes to the servers. I am not doing a hash over the indexes, but over the original hashes.
Also, my fault. maxRehash has a very misleading name. It was born as a "max", but then used as a counter. If Daniel could have read two lines above :
``` maxRehash--; } while (( maxRehash > 0 ) && ds_skip_dst(idx->dlist[fullHash % listSize].flags ) );
``` It's impossible to have there an infinite loop.
I don't like it as a new algorithm. And it has no sense to use it with other algorithms. It has sense only wth hash algorithms.
Forget the request. I withdraw it. I will use a patched version at my work.
Closed #2363.
Hello, @henningw ,
Tthank you for your comment. I do really appreciate it.
I must disagree on this :
"But there might be scenarios were you don't want this rehashing done "
YES, and that is why my proposal does not change current behavior.
With no changes in config file, nothing changes. But, if user wants, he can activate my extension.