It's Always DNS: The February 2018 Issue
There’s this “It’s Always DNS” meme that often goes round the Interwebs, somewhat bolstered by the Twitter outages that occurred (DNS, of course) in late 2016. I haven’t had that many DNS problems in my line of work, but the other day I saw something that made me jump right on the “It’s always DNS” bandwagon.
So I was setting up a test instance of an internal application at work, as one often does. In recent times, Docker and its associate tools have made this a breeze. (However, they can sometimes suddenly move beneath your feet in ways you don’t expect. More on this in a later post.) The application was set to authenticate over LDAP.
I found that the initial login and load was taking something of the order of 5 to 7 seconds! This Netscapesque login time was absolutely not acceptable, so I went spelunking. After setting up logging for measuring how long it took to render the page (helped along by this excellent post), it was clear that the problem was not with the page, but with how long the user was taking to authenticate to the LDAP server.
Bingo! So I tried authenticating directly with the LDAP server using ldapsearch
, and saw a minute delay. Here, a coworker dropped in and reminded me that DNS was likely to be an issue, so I dug a bit deeper, and played around with DNS servers.
It turned out that in my hastily setup VM that was hosting the docker containers for the application, had its primary DNS server set incorrectly! So, the /etc/resolv.conf
looked something like this:
nameserver a.b.c.d
nameserver x.y.z.w
The primary nameserver was set incorrectly, and hence, when resolving the LDAP server, the auth module I was using would try to resolve via the primary nameserver, fail, and then fall back to the secondary before finally authenticating. (man resolv.conf
showed the default timeout as 5s, which tallied quite well with the observation.) Simply removing the offending a.b.c.d
fixed the problem and made the authentication and subsequent page load much snappier.
Lessons learnt:
Don’t set up staging areas in a hurry, try to keep them as well-defined and reproducible as possible. Abstractions like Docker do not save you from differing network conditions.
It’s always a good idea to follow metrics about page load times, etc. This was an old-ish application, but taking the time to instrument made the troubleshooting much easier
The
timeout
,rotate
andattempts
options inresolv.conf
are all worth reading, and quite important.
❧ Please send me your suggestions, comments, etc. at comments@mandarg.com