Understanding Internet Working Principles from Google Downtime

Translator Note: This article mentioned that CloudFlare is a Headquarters Content Distribution Network (CDN) service company, which was established in 2009 by three developers of the Project Honey Pot project. In October 2011, the Wall Street Journal is rated as the most innovative network technology company.

Today, Google’s service has experienced a short downtime, lasted about 27 minutes, and the Internet users in some areas have affected. The reason for this event has taken place to enter the deep, dark corner of the Internet. I am a network engineer of CloudFlare, which is a force that helps Google recovered from this downtime. Below is the process of things happening.

About the Pacific Standard Time November 5, 2012, at 6:24 pm, on November 6, 2012, at 2:24 in the morning, Cloudflare’s employees found Google’s service interrupted. We use Google’s email and other services, so when its service is not normal, the office will quickly discover. I work on the network technology team, so I immediately attach the network to see what is situated – is a local regional problem or a global issue.

Problem investigation

I will realize that all Google’s services we can’t connect – even including 8.8.8.8, Google’s public DNS server – I started from tracing DNS.

$ DIG + TRACE Google.com

Below is the reply I got when I detect the domain name server of Google.com:

Google.com. 172800 in ns ns2.google.com. 172800 in ns ns1.google.com. Google.com. 172800 in ns ns3.google.com. Google.com. 172800 in ns ns4.google. COM. ;; Received 164 bytes from 192.12.94.30 # 53 (E.GTLD-SERVERS.NET) IN 152 ms ;; connection timed out; no servers could be reached

The results of any server cannot be detected prove to have any problems. In particular, this means that no Google DNS server will not be connected from our office.

I started looking for problems to see if it is in this communication.

216.239.32.10 (216.239.32.10): 56 Data Bytes Request Timeout for ICMP_SEQ 0 92 bytes from 1-1-15.EDGE2-EQX-SIN.MORATELINDO.CO.ID (202.43.176.217): Time to Live Exceeded

There is a strange information here. Typically, we should not see the name of a network service provider (Morate) of Indonesia in Google’s routing information. I immediately entered a CloudFlare router to see what happened. At the same time, the report on the world’s other places on Twitter shows that we are not the only place to encounter problems.

Internet route

In order to understand what is wrong, you need to know how some of the Internet is based on the basics. The entire Internet consists of a lot of networks, which are called “Autonomous Systems (AS)”. Each network has a unique number to mark yourself, known as the AS number. Cloudflare’s AS number is 13335, Google’s AS number is 15169. Each network is connected to each other by a technique called Edge Gateway Protocol (BGP). The edge gateway protocol is called the internet adhesive – which network is declared which network belongs to which network is belonging to the route from a self-government network to another autonomous network. An Internet “Routing” is exactly the same as this word: the path to another IP address in another autonomous network to another autonomous network.

The edge gateway protocol is based on a mutual trust system. The principle of each network is based on the principle of trust tells other network which IP address belongs to which network is belonging. When you send a packet or send a request through the network, your network service provider contacts its upstream provider or peer provider, asking them from your network service provider to network destination, which route recent.

Unfortunately, if a network is declared saying an IP address or a network in it’s inside, and the truth is not the case, if its upstream network or peer network trusts it, then this packet will eventually Lost loss. This problem occurs here.

I have viewed the routing address of Google IP passed by the Edge Gateway Agreement, and the route points to Moratel (23947), a network service provider of Indonesia. Our office in California, is not far from Google’s data center, and the data package should not pass through Indonesia. It is very likely that Moracl declares a wrong network route. The route sent by the Edge Gateway Agreement I saw was:

[email protected]> show route 216.239.34.10 inet.0: 422168 destinations, 422168 routes (422154 active, 0 holddown, 14 hidden) + Active Route, – Last Active, * Both 216.239.34.0/24 * [BGP / 170 00:15:47, Med 18, Localpref 100 AS PATH: 4436 3491 23947 15169 I> To 69.22.153.1 VIA GE-1/0 / 9.0

I have viewed other routes, such as Google’s public DNS, which is also hijacked to the same (incorrect) path:

[email protected]> Show route 8.8.8.8 inet.0: 422196 Destinations, 422196 Routes (422182 Active, 0 Holddown, 14 Hidden) + Active Route, – Last Active, * Both 8.8.8.0/24 * [BGP / 170 ] 00:27:02, Med 18, Localpref 100 AS PATH: 4436 3491 23947 15169 I> To 69.22.153.1 VIA GE-1/0 / 9.0

Routing leak

As such a problem, it is considered to be “routing leak” in the industry, not normal, but “leak” route. This kind of thing is not a precedent. Google has suffered similar downtime. At that time, Pakistan was speculated in order to prohibit one video on YouTube, and Pakistani ISP deleted routing information of the YouTube website. Unfortunately, their approach is passed to the upper reaches of the external, Pakistani Telecom, trusted the Pakistani Telecom, and transmit this route to the entire Internet. This incident caused the YouTube website to not be accessible for about 2 hours.

What happened today is similar. Some of Moracl is probably “fat finger”, and the wrong Internet routing is lost. The upstream provider of Telecom Yingke, Moles, trusted Moratel, passed to their routing. Soon, this wrong route passed to the entire Internet. In this trust mode in the Edge Gateway Protocol, it is said that this is a malicious behavior, it is better to say that this is a misuse or mistake.

repair

The solution is to let Moles stop the declaration of the wrong route. As a network engineer, especially the engineers working in large network companies like Cloudflare, a large part of the work is to keep in touch with network engineers from all over the world. When I probably the problem, I contacted a colleague of Moles, telling him what happened. He probably repaired this issue at 6:50 in the Pacific Standard Time / World Standard Time at 2:50 in the morning. After 3 minutes, the route recovered normal, Google’s service re-worked.

From the network transport map, I estimate that 3-5% of the entire Internet users have received the impact of this downtime accident. The hardest hit is Hong Kong because it is the headquarters of Telecom Yingke. If you can’t access Google’s service at the time, you should now know what is the reason.

Build a better internet

I said that these is to let everyone know how we have built on a mutual trust mechanism. Today’s accident description, even if you are a big company like Google, the externally you can’t control the factors will affect your users, so that they can’t access you, so a network technology team is very necessary, from them Monitor routes, manage your contact with the world. Cloudflare’s daily work is to ensure that customers get the best route. Let’s look at all websites on the Internet to make sure they provide services with the fastest transfer speed. Today’s things are just a small fragment of our work.