Hi Greger!
Greger V. Teigre wrote: ...
Agree. We use RADIUS-based authentication and authorization with distributed RADIUS servers. Only usrloc is stored in mysql (we use
I want to ask about your radius experiences. We (www.at43.at) are also using radius authentication. All the radius requests are sent to a local radius proxy which forwards the request to the radius server of the participating groups (universities, schools ...).
If one of the remote radius servers is down, we are having problems with ser. Ser's threads are busy, waiting for the radius authorization responses and ser is slowing done. Then, the client starts to retransmit their REGISTER messages and ser is getting busier and busier until all threads are busy with authentication requests. Thus, the complete service will be down only if one of the radius servers is down.
We have reduced the proxy load by replying "100...trying" to all REGISTER requests, which reduces retransmissions in case of slow authentication. We also tried to tweak the radius retransmission and timeout settings but could not find a satisfying solution yet.
Do you also have problems in your distributed radius setup? Maybe you could post a little about your experience with distributed radius.
All other radius users are also welcome to post their radius experiences.
regards, klaus
PS: I hope Maxim's patch for stateful authentication is going into 0.9.0
Hi Klaus, Just a quick response to what you describe below: We have a different scenario based on three facts: - We have complete control and monitoring of all participating RADIUS servers - Each ser has a RADIUS server on the local LAN where the server center is managed as a whole (i.e. individual components should not be unavailable) - We do not tolerate RADIUS downtime at all. Our 24x7 operations center will immediately respond and correct the situation
Thus, we have never experienced the scenario below. However, if something happens, it is actually more likely that we start to NAK all requests as a default. This of course causes the clients to re-register, but ser does not slow down. As you proxy the requests, you probably have a re-send from the RADIUS proxy to the other servers as well, in addition to ser's resend. This adds up as ser will send a new request before the proxy has finished it resends. You could probably turn off the re-send on your RADIUS proxy completely and only rely on ser's resend. It depends on the network between and the level of monitoring you have on all the servers. If you have complete control and tight monitoring, you can probably turn off resend and set very low time-outs. Thus, when a server is down, ser will nak and the client will retry later (which is probably as good as anything, because something is probably seriously wrong and retrying every 4 seconds won't help...)
Hope this made sense. g-)
Klaus Darilion wrote:
Hi Greger!
Greger V. Teigre wrote: ...
Agree. We use RADIUS-based authentication and authorization with distributed RADIUS servers. Only usrloc is stored in mysql (we use
I want to ask about your radius experiences. We (www.at43.at) are also using radius authentication. All the radius requests are sent to a local radius proxy which forwards the request to the radius server of the participating groups (universities, schools ...).
If one of the remote radius servers is down, we are having problems with ser. Ser's threads are busy, waiting for the radius authorization responses and ser is slowing done. Then, the client starts to retransmit their REGISTER messages and ser is getting busier and busier until all threads are busy with authentication requests. Thus, the complete service will be down only if one of the radius servers is down. We have reduced the proxy load by replying "100...trying" to all REGISTER requests, which reduces retransmissions in case of slow authentication. We also tried to tweak the radius retransmission and timeout settings but could not find a satisfying solution yet.
Do you also have problems in your distributed radius setup? Maybe you could post a little about your experience with distributed radius.
All other radius users are also welcome to post their radius experiences. regards, klaus
PS: I hope Maxim's patch for stateful authentication is going into 0.9.0
Greger V. Teigre wrote:
Hi Klaus, Just a quick response to what you describe below: We have a different scenario based on three facts:
- We have complete control and monitoring of all participating RADIUS
servers
- Each ser has a RADIUS server on the local LAN where the server center
is managed as a whole (i.e. individual components should not be unavailable)
- We do not tolerate RADIUS downtime at all. Our 24x7 operations center
will immediately respond and correct the situation
Thus, we have never experienced the scenario below. However, if something happens, it is actually more likely that we start to NAK all requests as a default. This of course causes the clients to re-register, but ser does not slow down. As you proxy the requests, you probably have a re-send from the RADIUS proxy to the other servers as well, in addition to ser's resend.
We have disabled retransmissions at the radius proxy. In radiusclient.conf we have: radius_timeout 3 radius_retries 1
Now, our setup works, but it's not a fien working solution. The problem is that an oingoing radius request will block a thread completly. Thus, having lots of clients (lots of REGISTERs) and having a slow radius backend is like a DoS attack.
regards, klaus
Yes, I understand your problem. Handling RADIUS retries demands a server design made for it. I don't know if it is allowed, but wouldn't it be better to reduce timeout to 1 (or 2) and retries to 0? I mean, if you don't get a response within one second (dependent on your network setup), why wait or retry? I have never really understood the wait and retry of RADIUS, we tend to failover to secondary or tertiary RADIUS as fast as possible. The only point I see of using long waits and maybe 1 retry is if you are running auths across an (unstable) Internet connection. I guess it's part of the legacy. g-)
Klaus Darilion wrote:
Greger V. Teigre wrote:
Hi Klaus, Just a quick response to what you describe below: We have a different scenario based on three facts:
- We have complete control and monitoring of all participating RADIUS
servers
- Each ser has a RADIUS server on the local LAN where the server
center is managed as a whole (i.e. individual components should not be unavailable)
- We do not tolerate RADIUS downtime at all. Our 24x7 operations
center will immediately respond and correct the situation
Thus, we have never experienced the scenario below. However, if something happens, it is actually more likely that we start to NAK all requests as a default. This of course causes the clients to re-register, but ser does not slow down. As you proxy the requests, you probably have a re-send from the RADIUS proxy to the other servers as well, in addition to ser's resend.
We have disabled retransmissions at the radius proxy. In radiusclient.conf we have: radius_timeout 3 radius_retries 1
Now, our setup works, but it's not a fien working solution. The problem is that an oingoing radius request will block a thread completly. Thus, having lots of clients (lots of REGISTERs) and having a slow radius backend is like a DoS attack.
regards, klaus
Greger V. Teigre wrote:
Yes, I understand your problem. Handling RADIUS retries demands a server design made for it. I don't know if it is allowed, but wouldn't it be better to reduce timeout to 1 (or 2) and retries to 0?
The problem is short timeouts: faster retransmission - thus, a overloaded radius server will be over-overloaded.
The radiusclient.conf does not allow: radius_retries 0
maybe the mean "total requests" and not "retransmissions".
I think the problem is not only radius related, but related to remote authentication. What about mysql authentication: does ser cache the password or does it query the database for each REGISTER? I there is no caching, than there would also be a problem if the mysql database is on a remote site and the query takes some time.
Thus, a good starting point would be transaction stateful REGISTER handling in ser to avoid increasing load on a slow radius server.
regards, klaus
I mean, if
you don't get a response within one second (dependent on your network setup), why wait or retry? I have never really understood the wait and retry of RADIUS, we tend to failover to secondary or tertiary RADIUS as fast as possible. The only point I see of using long waits and maybe 1 retry is if you are running auths across an (unstable) Internet connection. I guess it's part of the legacy. g-)
Klaus Darilion wrote:
Greger V. Teigre wrote:
Hi Klaus, Just a quick response to what you describe below: We have a different scenario based on three facts:
- We have complete control and monitoring of all participating RADIUS
servers
- Each ser has a RADIUS server on the local LAN where the server
center is managed as a whole (i.e. individual components should not be unavailable)
- We do not tolerate RADIUS downtime at all. Our 24x7 operations
center will immediately respond and correct the situation
Thus, we have never experienced the scenario below. However, if something happens, it is actually more likely that we start to NAK all requests as a default. This of course causes the clients to re-register, but ser does not slow down. As you proxy the requests, you probably have a re-send from the RADIUS proxy to the other servers as well, in addition to ser's resend.
We have disabled retransmissions at the radius proxy. In radiusclient.conf we have: radius_timeout 3 radius_retries 1
Now, our setup works, but it's not a fien working solution. The problem is that an oingoing radius request will block a thread completly. Thus, having lots of clients (lots of REGISTERs) and having a slow radius backend is like a DoS attack.
regards, klaus
I'm not sure if I understand. A short timeout will only make ser timeout faster and without a retransmission configured, ser should respond with a NAK to the UA. How fast the UA tries again is dependent on the UA implementation.
Anyway, I believe the expiry of the nonce (default 300 seconds) controls when a new auth is done, although I have never verified it (never had registration intervals below 300 seconds). I don't know the security impact of increasing the expiry time. I think ser caches the nonce in mysql, so unless the nonce is in memory, it will be loaded to check against it, if it has expired, a new auth attempt will be made. I'm not sure about the ser behavior for a timeout with mysql.
BTW, makes me recall another thing we have seen: Some UAs actually do two auths against the DB every time a registration arrives. Once for the first INVITE (which receives an "auth required") and then another time with a new nonce. I think it has something to do with the UA including the old credentials in the first INVITE even though the nonce has expired and an auth must be done to verify that the credentials are incorrect. Have you seen this behavior? g-)
Klaus Darilion wrote:
Greger V. Teigre wrote:
Yes, I understand your problem. Handling RADIUS retries demands a server design made for it. I don't know if it is allowed, but wouldn't it be better to reduce timeout to 1 (or 2) and retries to 0?
The problem is short timeouts: faster retransmission - thus, a overloaded radius server will be over-overloaded.
The radiusclient.conf does not allow: radius_retries 0
maybe the mean "total requests" and not "retransmissions".
I think the problem is not only radius related, but related to remote authentication. What about mysql authentication: does ser cache the password or does it query the database for each REGISTER? I there is no caching, than there would also be a problem if the mysql database is on a remote site and the query takes some time.
Thus, a good starting point would be transaction stateful REGISTER handling in ser to avoid increasing load on a slow radius server.
regards, klaus
I mean, if
you don't get a response within one second (dependent on your network setup), why wait or retry? I have never really understood the wait and retry of RADIUS, we tend to failover to secondary or tertiary RADIUS as fast as possible. The only point I see of using long waits and maybe 1 retry is if you are running auths across an (unstable) Internet connection. I guess it's part of the legacy. g-)
Klaus Darilion wrote:
Greger V. Teigre wrote:
Hi Klaus, Just a quick response to what you describe below: We have a different scenario based on three facts:
- We have complete control and monitoring of all participating
RADIUS servers
- Each ser has a RADIUS server on the local LAN where the server
center is managed as a whole (i.e. individual components should not be unavailable)
- We do not tolerate RADIUS downtime at all. Our 24x7 operations
center will immediately respond and correct the situation
Thus, we have never experienced the scenario below. However, if something happens, it is actually more likely that we start to NAK all requests as a default. This of course causes the clients to re-register, but ser does not slow down. As you proxy the requests, you probably have a re-send from the RADIUS proxy to the other servers as well, in addition to ser's resend.
We have disabled retransmissions at the radius proxy. In radiusclient.conf we have: radius_timeout 3 radius_retries 1
Now, our setup works, but it's not a fien working solution. The problem is that an oingoing radius request will block a thread completly. Thus, having lots of clients (lots of REGISTERs) and having a slow radius backend is like a DoS attack.
regards, klaus
Greger V. Teigre wrote:
BTW, makes me recall another thing we have seen: Some UAs actually do two auths against the DB every time a registration arrives. Once for the first INVITE (which receives an "auth required") and then another time with a new nonce. I think it has something to do with the UA including the old credentials in the first INVITE even though the nonce has expired and an auth must be done to verify that the credentials are incorrect. Have you seen this behavior? g-)
Yes, I've seen this once but can't remember which client it was. IMO it is a good idea to include the credentials (from the last nonce) in all requests. If the nonce is still valid, this avoids the second request with the credentials. On the other hand, it increases traffic on the authentication servers. Don't know whats better :-)
regards, klaus