Apple goes down for 12 hours because of a DNS error
Before I discuss what happened to Apple, I think a quick refresher about DNS is in order for those who are not familiar…
When you access a website using a domain, such as “www.digitalruby.com”, there is a mechanism in place to find out where the computer / web server is that stores the information you are asking for is. This system is called DNS (Domain Name Service), or as my 6 year old daughter has learned it, the Internet phone book. Think of DNS as a gigantic list of “URL’s” (uniform resource location), i.e. people names and web server ip addresses, or phone numbers.
For example, for www.digitalruby.com you would look in this phone book and find this entry:
www.digitalruby.com -> 220.127.116.11 (5 minutes).
Those 4 numbers separated by dots are called the ip address, and it tells your computer where the computer is that is running my website. There are thousands of routers all over the world that know about this Internet phone book and are able to route your request to the appropriate chain of routers, computers and switches until it eventually arrives at my server. My server takes your request, goes and finds the data, and sends it back in a similar fashion back to your computer.
The 5 minutes in parenthesis tells your computer that it should not ask for the ip address again until 5 minutes have expired. This is called caching, and reduces the amount of Internet traffic with computers asking for the ip address for a URL.
Isn’t technology neat?
Ok, back to Apple. On March 11, 2015, Apple had a twelve hour outage that brought down the App Store and iTunes. Think of that – Apple did not make any money from App downloads, in-app purchases or iTunes media for half a day. That’s certainly a mega multi-million dollar mistake given that millions of purchases are made per day. All because somebody made a typo in the Internet phone book. A common value people use for DNS caching is 12 hours. This means that computers will not look back in the Internet phone book until 12 hours have expired – which is about how long Apple was down.
As a giant company with lots of new employees, it’s amazing that Apple doesn’t have more issues like this. All the talented, competent employees are probably off in architecture roles or starting their own companies, so it’s a challenge to keep a company running smoothly. Technology is incredibly hard to do right, and requires decades of practice and effort, and when you have “newbies” coming in, or even veterans who forget something, it’s easy to have a big problem.
The moral of the story? Leave your DNS caching (or TTL times) at 5 minutes and sleep easier at night knowing that if there is a problem, you can have it fixed in 5 minutes or more instead of 12 hours or more. When you know you are going to be moving your web server, set your TTL to the smallest value possible so that downtime is even smaller.
Thanks for reading, have a wonderful day!