Why did Google’s ‘incorrect settings’ cause serious network failure in Japan?

ntttttttttttt

Google Inc has issued an apology for the large-scale communication failure that occurred around noon on Aug 25 in Japan, saying that its incorrect settings caused the failure.

“Due to incorrect network settings, a failure that makes it difficult to access Internet services occurred,” the company said. “We apologize for the inconvenience and concern.”

From the beginning, many experts considered that a large amount of route information sent from Google triggered the failure. And it turned out to be true.

Especially, NTT Communications Corp, KDDI Corp and companies/individuals using the two companies’ communication services were severely affected. The failure affected Internet connection, the Internet-related services, financial transactions, payment services such as Mobile Suica, etc.

However, Google did not clarify whether the “incorrect network settings” were caused by human error or defects of software, devices, etc. And it said, “the information was updated within eight minutes.”

So, only eight minutes of “incorrect network settings” had a great impact on Japan’s communication infrastructures. Why did this happen? Is it possible that it will happen again?

Google owns data centers around the world and runs a gigantic network connecting them. Companies and communication carriers that have such large-scale networks exchange “route information” with one another to communicate with one another. For this purpose, “BGP (border gateway protocol)” is used. The internet can exist because of this interconnection of large-scale networks.

In the accident, the route information was wrong, and communication routes changed for some services.

“A larger number of route changes occurred at other carriers connected to OCN (an Internet service run by NTT Communications),” NTT Communications said in an interview with ITPro.

Specifically, it seems that Google was peering (exchanging network route information with a fifty-fifty relationship) with OCN and transmitted a large amount of incorrect route information at around 12:22 p.m. Aug 25. The amount of the data seems to be larger than all the route information (full route) of the Internet.

In general, the full route contains about 650,000 routes. And the number of routes that were erroneously transmitted this time is “70,000 larger” or “100,000 larger” than that, according to two different carriers.

Why did they find out that Google sent the incorrect route information? It is because the route information exchanged by using the BGP includes an AS number allocated to Google (AS15169).

Why did it become impossible to connect to Internet?

It is considered that the large-scale connection failure occurred because the amount of this route information was large and the information was erroneous. A large amount of route information occupies the memory of communication devices (routers) that exchange information by using the BGP too much. The routers of some carriers were hung up, making it impossible to communicate data.

Also, the route information erroneously transmitted by Google potentially contained a large amount of incorrect route information related to NTT Communications, according to a major Internet service provider (IPS).

NTT Communications is called “Tier 1 provider” and owns broadband IP backbones on a global scale. There are about 10 Tier 1 providers in the world, and they have route information that enables routing to anywhere on the Internet. When the information of routing to NTT Communications is changed, it affects many IPSes.

Changing routes does not mean complete disconnection from the Internet. But, due to circuitous routes, the responses of services and applications become slow, making users feel “disconnected.”

In fact, from 12:22 to 12:45 on August 25, the transmission of the large amount of incorrect route information seemingly disconnected the communication route between OCN and Google. It is highly likely that Internet accesses related to this peering were affected.

Google claimed that it updated information within eight minutes. It takes some time to disseminate correct route information, but it is unlikely that such a problem lasts long like the communication failure on August 25. So, it can be deduced that the failure occurred because of the high road on routers.

“I guess that some carriers took time in restoring their routers because bugs were caused by the heavily occupied memory,” a major ISP said.

Malicious hijacking can happen

There are some cases in which incorrect route information was transmitted due to operation mistakes. So, this is not the first time.

IPSes are making efforts to (1) filter route information that is too much detailed and can affect the operation of routers and (2) check the validity of route information by using a service called “IRR” (Internet routing registry).

However, in the latest case, route information that covers a wide area was erroneously transmitted. And measures such as the IRR service did not prevent the large-scale failure.

Furthermore, malicious “route hijacking” can occur. Therefore, ICT-ISAC, JPNIC (Japan Network Information Center), and mainly major IPSes have been making efforts to operate a monitoring system called keiro bugyo (route magistrate) in the aim of detecting route hijacking. The prevention of recurrence is a major issue for carriers.