Caching DNS with BIND9 on SmartOS

I ran into a very strange issue recently: DNS requests to my domain controllers from our distributed analytics were failing intermittently with NXDOMAIN.

Emergent infrastructure failures, particularly intermittent ones, are frustrating to diagnose.  In this case my domain controllers appeared to be responding intermittently with NXDOMAIN to DNS lookups for my database clusters.  Combine that with automatic exception recovery and connection pooling and you have a recipe for lots of alerts going to on-call.

Hunting through the Web for answers was predictably an amusing way to spend hours exposing ourselves to unrelated advice and outright misconceptions.  What we did determine was that under extreme UDP/TCP load a Windows server may refuse to accept further communications.  Our hypothesis was that the UDP packets for the DNS lookups were being lost, but we still don't know for sure.

It turns out that SmartOS containers, where our distributed analytics are running, do not do DNS caching.  As a result every lookup was being sent directly to the domain controllers even if it had not exceeded its TTL.

That's where the solution became evident; we needed to deploy DNS caching on the SmartOS containers.  Initially we looked at dnsmasq which is what Ubuntu uses, but came around to BIND9 for the stability and simplicity it provides.

The configuration of BIND9 for this turned out to be incredibly simple.

Listing of /opt/local/etc/named.conf:

options {
  listen-on port 53 { 127.0.0.1; };
  forwarders {
    208.67.222.222;
    208.67.220.220;
   };
};

Enable the service:

svcadm enable pkgsrc/bind

Check that it is working:

dig @127.0.0.1 joyent.com

Switch over /etc/resolv.conf:

search yourdomain.yourtld

nameserver 127.0.0.1

As simple as that. Really.  It says to listen only on loopback and to forward requests it doesn't know to my domain controllers.

By default BIND9 allows querying and caching so that is actually it.  This sheds an order of magnitude of DNS requests from the domain controller via caching.

Taking it Further

If you are still allowing DNS requests to arbitrary public DNS servers on your networks, you are at unnecesary risk: cache poisoning, loss of DNS auditing, DNS snooping, botnet C2s (Command & Control).  More information is available in US-CERT's Controlling Outbound Access (TA15-240A) notice

It turns out these risks are easily mitigated by expanding upon the caching DNS above to being a network-available DNS cache.  Once you have this in place you can lock down UDP/53  and TCP/53 requests leaving your network from one or a pair of these trusted caches allowing substantially restricted and monitored DNS access with minimal effort.  Here is a slightly modified version of the above adapted to use OpenDNS and allow local network machines to query:

options {
  forwarders {
    208.67.222.222;
    208.67.220.220;
   };
};