.ioTLD failure, how it affected Cloud Foundry, as well as how we could potentially mitigate TLD failures in the future.
On October 28th, 2016, five out of seven nameservers for the
.io top-level domain (TLD) stopped working. In this article, we’ll talk briefly about the
.io TLD failure, then talk about how it affected Cloud Foundry (CF) as well as how we could potentially mitigate these failures in the future.
Five of the seven nameservers for the
.io TLD stopped responding. This made many
.io domains inaccessible for a few hours.
Pivotal Web Services (PWS) relies on the
.io Top-Level Domain for a few things and the PWS CloudOps team noticed a few ways that users of PWS could have been affected.
run.pivotal.io) were possibly affected.
api.run.pivotal.io. This meant that the CF CLI was potentially unusable while the
.ioTLD was having issues.
*.cfapps.iodomain names received “Cannot resolve host”-type errors.
.iodomain in its connection string — p-mysql, for example — app owners would see connection errors.
Register applications to multiple domains with different TLDs. For example, register your application with both
example.com. For TLS to work with this strategy, you will need to buy a certificate for each domain.
By default, when you run
cf push you get
myapp running at
To mitigate against TLD failures:
# Application name: myapp # Apps Domain: example.io # Private Domain: example.com cf create-route my-space example.com --hostname myapp cf push cf map-route my-app example.com --hostname myapp # Here myapp would be running at myapp.example.io and myapp.example.com
At the Cloud Foundry level, configure the platform to use multiple TLDs with an extra shared domain and follow the steps above.
Assuming your CF installation uses wildcard DNS entries for your system and application domains, there are different ways to mitigate customer impact of a partial TLD nameserver failure on different IaaSes.
Increase the Time To Live (TTL) value for your DNS entries that have an A record with the public IP address for the CF domain. Increasing the TTL increases the time until cache invalidation on your [DNS caching servers] (https://www.digitalocean.com/community/tutorials/a-comparison-of-dns-server-types-how-to-choose-the-right-dns-configuration#caching-dns-server). A time of four to six hours should work. The risk of this approach is that if you decide to change your load balancer, a new IP address will be assigned and it will approximately take the same four to six hours to propagate worldwide. In the case of a production system, this even rarely happens.
Since Amazon Elastic Load Balancers (ELBs) don’t provide an IP address for your DNS entry, you need to [create an ALIAS record] (http://docs.aws.amazon.com/Route53/latest/DeveloperGuide/resource-record-sets-choosing-alias-non-alias.html) for the DNS entry of your system and app domains. AWS doesn’t guarantee that the IP address of a load balancer will remain the same over that load balancer’s lifetime.
A load balancer internally points to one or more A records with a TTL of 60 seconds. You cannot mitigate against the partial TLD NS downtime in a safe manner.
Here is an unsafe way to do this: Replace your
ALIAS record with
A records that resolve from the load balancer’s
CNAME with higher TTLs as in step 1. (example: my-custom-load-balancer.amazonaws-312312.com -> 126.96.36.199). If AWS changes the IP addresses your application will experience longer downtime rather than intermittent failures. The safest option is to not do anything and wait for the issue to resolve itself.
Note: Some DNS caching servers do not honor TTLs.
DNS is a highly-available and a highly-cached system and top-level domain failures are rare. Additionally, most cases of DNS server failure do not affect every request. For more information about the Domain Name System, refer to RFC 1034 and RFC 1035.