[Serusers] ENUM timeout to +33 - parallel resolvers, tuneable timeout

Wed Mar 16 21:03:43 CET 2005

At 5:56 PM +0100 on 3/16/05, Adrian Georgescu wrote:
>Hello Juha,
>
>The ENUM prefix for France +33 (3.3.e164.arpa) does not work, it is 
>delegated but the servers do not answer and the resolver times-out 
>in 12-20 seconds. If you prefer to do ENUM look-ups before going to 
>normal PSTN this affect SIP call flows.
>
>Do you have any idea how to work around this problem? Besides fixing 
>the DNS server side :)
>
>Regards,
>Adrian

[this post became somewhat non-specific to SER as I wrote it - 
apologies.  However, it is directly relevant to most of the community 
so I will post it.]

One method would be to write custom resolver code for ENUM lookups 
that has shorter timeouts for specific zones.

Another "feature" of such a program could be to do parallel 
resolution in an even more highly modified resolver, when there are 
several ENUM root zones that a particular host may care about.  The 
requirement for rapid timeout would also apply in these 
multi-threaded lookups, since PDD is the most important thing to 
worry about when completing calls (at least, currently delay is most 
important, IMHO.)  Perhaps being able to give each zone a preference 
and timeout would be interesting, too... let's expand on this.

There should in theory only be one "root" for ENUM (e164.arpa.) or at 
worst just one ENUM zone lookup per SIP routing engine, but in 
practice it seems like there are more and more root zones springing 
up for various administrative, technical, and political reasons and 
it is becoming possible (probable?) that each server should look 
through more than one zone during an ENUM query cycle.  I'd love to 
be able to use those zones all at once, but cascading through 
multiple instances of setting the domain_suffix and then doing the 
lookup is impractical - if there are failures or slowness, then 
post-dial-delay becomes unacceptable.

So, there are two features that I'd love to see built into a generic 
resolver: the ability to hand-tune failure intervals, and the ability 
to paralellize lookups into multiple top-level zones for the same 
query (and to weight the answers if we get multiple replies.)

This seems like it might be useful for the entire VoIP community, and 
not just for SER users.  Michael Haberler at nic.at had someone who 
was interested in writing this code a while back, but recently 
(yesterday) he said that the project didn't get wings.  I've got $150 
to donate towards anyone who comes up with something remotely 
resembling a generic resolver that handles parallel queries in the 
format I describe below:

Let's take a hypothetical config file for such a resolver daemon. 
This resolver assumes that it will be handed a non-qualified lookup 
like "2.1.2.1.5.5.5.2.1.2.1" without a top-level domain attached. 
The hacked resolver will then scan through the multiple top-level 
zones and try to get a match in a paralellized and possibly cascading 
fashion.

# Which domains do we want to rip apart and handle in specific ways?
#  We first look at the IP address of the device that is sending
#  the query.  If it matches one or more lines, then see if it
#  matches the top-level zone that is being requested.  If there
#  is a match, and a "permit", then strip off the zone from the
#  query, and hand off to the groups specified in the list of
#  groups after the "permit" keyword, in the order the groups
#  are listed.
#
# Wildcards can be used for domains or IP addresses.
#
# Rules are interpreted in order of entry, and the first match
#  ends the lookup process.
#
# If no group(s) is specified, then hand off resolution to the
#  default-forwarder resolver(s).
#
# host [ip address] [zone suffix] [permit,deny] [group, group, ...]
#
host 10.*.*.* e164.arpa permit 1 2
host 10.*.*.* * permit
host *.*.*.* * deny
#
#
#
# Any lookup that doesn't match a "host" line with a group list
#  above gets pushed out to a list of normal DNS resolvers.  The
#  replies from those resolvers are simply forwarded back
#  through this hacked resolver to the querying device.  In
#  this example, if 10.10.10.88 asked for the A record for
#  "foo.com", then that query would be permitted and handed
#  off to the default forwarders for resolution. (note: the
#  default-fowarders aren't parallelized, though I suppose
#  they could be, but that would perhaps create unnecessary
#  DNS traffic for "non-critical" lookups.)
#
# default-forwarder [ip address] [port]
#
default-forwarder 192.148.33.13 53
default-forwarder 205.11.29.2   53
#
#
#
#
# Group 1
#
# Group 1 is for my internal e164 zones, which have their own
#  resolvers and speed assumptions.
#
#  forwarder [group] [weight] [ip address] [port]
#
forwarder 1 1 10.10.10.4 53
forwarder 1 1 10.10.22.9 53
#
# zone [group] [zone] [weight] [max ms wait]
#
zone 1 e164.mycompany.com       1 50
zone 1 e164.myothercompany.com  1 70
#
#
#
# Group 2
#
# If all the resolvers in group 1 don't come back with any valid
#  answers after 70ms, then we move on to group 2, which is external
#  zones and "outside" resolvers...
#
#  forwarder [group] [weight] [ip address] [port]
#
forwarder 2 1 4.33.12.94  53
forwarder 2 1 12.39.113.5 53
#
# zone [group] [zone] [weight] [max ms wait]
#
zone 2 e164.arpa 2 1 100
zone 2 e164.info 2 2 170
zone 2 e164.org  2 3 200
#
# end

   If we get a lookup from a host 10.10.10.44 for 
2.1.2.1.5.5.5.2.1.2.1.e164.arpa, here's what happens:

   The system determines that 10.10.10.44 is a permitted host. 
Additionally, the zone "e164.arpa" is one of our trigger suffixes. 
The system permits the lookup, and strips off "e164.arpa" from the 
lookup, and then hands the lookup to group #1's rules, and then (if 
no answer) to group #2's rules.

   First, we look up the number in our "internal" DNS trees, which 
we've set up in group #1.  There are two zones (representing perhaps 
subsidiary companies) that we resolve for locally, so we should look 
up the numbers in those servers first.  We have a very very short 
resolution time (70ms maximum) since those servers are local and have 
small zones.  We have two resolvers for our internal trees, and we 
send the query to each resolver.  We prefer answers from either zone, 
and whoever answers first gets the call (thus, the weights from each 
zone are identical.)  If the lookup is successful here, we stop and 
don't proceed to group #2.

   If there was no answer from our queries in group #1, we go to Group 
#2.  This is where we look up the external ENUM queries.  We have a 
few external zones in which we're going to look up the number.  If we 
get an answer from more than one of these zones, we indicate that we 
prefer e164.arpa first, e164.info second, and e164.org third as far 
as what answer we actually forward back to the entity that requested 
the ENUM lookup.  Even though our original query came in with 
"e164.arpa." as the suffix, we stripped that off - we'll do 
e164.arpa. lookups as part of the whole set of other possible roots - 
it's not "special", nor does the system treat it any differently than 
any other possible root.
   In our example, we believe that e164.arpa will have an answer to us 
in 100ms or less, and we will ignore any answers that come in after 
that interval (though, usefully they may be cached in our upstream 
DNS server even if we ignore the answer.)  We think that answers from 
e164.info will come to us in under 170ms, and e164.org in under 200ms.

   The "forwarder" lines indicate which DNS forwarders we're going to 
use for resolution (and their port numbers.)  Queries would get split 
to each forwarder with the same weight at the same time, thus giving 
some redundancy if one forwarder should go down or become latent. 
This increases the amount of DNS traffic to the end authoritative 
resolvers.  It would be rare that there might be forwarders with 
different weights, but possibly that may be desired in certain 
circumstances.

Comment: "Why don't you just have a universal timeout?  Won't the 
system always wait for the longest interval?"  Answer: No.  Using the 
above example file, think about the case when e164.org answers in 40 
milliseconds, but e164.arpa doesn't answer after 100 milliseconds. 
We will use the e164.org answer, but we won't wait for the longest 
possible time (200ms) which is in the config file, saving ourselves 
100ms off the PDD.

   This code would work with almost any ENUM compatible system, since 
it will strip off the trailing suffix in a configurable way and then 
force the lookup to be run through several other top level zones of 
the administrator's choosing.  Cisco, Snom, or any SBC could use this 
without modification, as long as it supports ENUM.  It would be 
important not to make this system a forwarder for other resolvers; 
that would probably get ugly.

   Note that Asterisk has the ability to recursively query many zones 
automatically, but I don't think it's a parallel lookup and there are 
no timeouts other than standard DNS timeouts (which really makes ENUM 
unusable in a public network, from my experience.)  The same 
functionality can be done in SER, but it involves multiple switching 
of the domain_suffix parameters and then doing iterative enum_query 
calls.  This doesn't fix the case of very long lookup delays if there 
are DNS failures or slowness.  Both systems suffer from poor granular 
control of lookup timeouts and alternate zone use, so some external 
program I think is required that solves this problem for all 
ENUM-capable systems.

Interesting links:
   http://www.corpit.ru/mjt/udns.html
   http://www.chiark.greenend.org.uk/~ian/adns/
   http://daniel.haxx.se/projects/c-ares/

Last note: Yes, this is all a hack.  ENUM is a hack.  But a lack of 
options marvelously clears the mind.  This is one of those things 
that has to be built before ENUM becomes actually useful in anything 
other than the most tightly controlled private networks.  A 
BSD-licensed version of this hypothetical code would do wonders to 
help VoIP interconnectivity lurch forward.  (Better yet would be a 
real routing protocol, but I'll conserve my wishes to that which 
might actually happen.)

JT