Issue on connectivity Sunday, August 30, 2020
On Sunday, August 30, 2020, one of the world’s largest connectivity providers (CenturyLink/Level3) experienced a total breakdown of its communications network. This downturn affected both CenturyLink/Level3 customers and other services and providers throughout the Internet.
While we are still waiting for a report from CenturyLink/Level3 specifying the origin of the outage, how it was remedied, and what measures are being taken to prevent a similar situation from occurring again, we would like to send you our over-report on how the incident was handled from Clouding.io.
Background
In Clouding.io we work with multiple connectivity providers, to ensure the highest availability and access speed to the servers hosted on Clouding.io, from any geographical point.
We add or remove providers from our connectivity “pool” from time to time depending on their quality and reliability.
CenturyLink/Level3 was added approximately two years ago to our platform, as among other things it provides excellent connectivity to all Movistar customers in Spain.
During the last two years -and until this last incident- it has been the best provider of our connectivity “pool” and the one that currently transports more traffic to and from Clouding.io.
Incidence August 30th, 2020
At 12:18 PM CEST – local time in Spain – our monitoring systems detected the first connectivity issues. Our monitoring system works both from Clouding.io to the Internet – checking the connectivity from our platform to the Internet – and from the Internet to Clouding.io.
The external monitoring of Clouding.io platform connectivity is done through a monitoring provider, so that the connectivity is monitored once per second from multiple geographical locations.
The external monitoring system began reporting specific connectivity problems at 12:18:31 PM CEST, which were recovered after 2 or 3 minutes.
Upon receiving the first alert, one of our technicians reviewed the traffic flow to the various providers, detecting that no traffic was being sent or received to CenturyLink/Level3 but that the BGP session was still active.
This type of situation should not occur, since, in the event of a provider failure, the expected behavior is that the BGP session will be disconnected and the provider will be automatically deactivated.
The situation that occurred was what is often called a “BGP Blackholing”, a situation in which a provider stops routing traffic correctly without ever disconnecting the BGP session.
In these exceptional situations, our protocol is very straightforward: We manually deactivate the BGP session with the affected provider and contact them to find out what has happened. Once the provider confirms us that the situation has been resolved, a maintenance window is scheduler to reactivate the connectivity with the provider.
As a precaution, the connectivity is usually restored during low load hours, usually after midnight, to avoid connectivity problems in case the provider has not solved the incident properly.
Clouding.io’s pool of connectivity providers is sized in such a way that even with only one active provider, we have more than enough bandwidth to avoid any bandwidth bottleneck.
Why was the incident not resolved at this time?
Well, this is something that we to a certain extent are still asking ourselves. We are waiting for a detailed report from CenturyLink/Level3, but the Clouding.io team has been able to gather a lot of information.
According to the information we have been able to gather, CenturyLink/Level3 continued to announce our IPs to the rest of the Internet, even though we had manually closed the BGP session with them.
This is something that should never happen, as a provider should only announce our addresses to other providers as long as we are announcing them to them at that time.
By manually closing the BGP session from Clouding.io to CenturyLink/Level3, CenturyLink/Level3 should have stopped announcing Clouding.io’s IP addresses to the rest of the Internet, so the traffic would be redirected to other providers.
As CenturyLink/Level3 is a Tier 1 provider – one of the largest connectivity providers with extensive worldwide presence – . And by continuing to announce our IP addresses to the rest of the Internet, some of the inbound traffic to Clouding.io was not normally redirected to the other providers we work with, and continued to attempt to go through the inoperable CenturyLink/Level3 network.
What did we do from Clouding.io to solve the problem?
The first step was to identify the origin of the problem, for which we had to contact other connectivity providers to see what route advertisements they were receiving.
At the same time, we opened incidents with all our other connectivity providers, to ensure that they were aware of the situation and that they also took mitigating measures.
Once we detected that many connectivity providers were still receiving CenturyLink/Level3 address announcements, we began to make changes in the way we announced our IP addresses to other connectivity providers.
The BGP protocol usually chooses the most specific announcement. That is, if, for example, one announcement contains a group of 2048 IPs and another a group of 1024 IPs, it will give preference – except in some cases where the traffic path is forced – to the more specific announcement, i.e. the 1024 IPs.
Therefore, the first mitigating measure was to change all of our IP advertisements to more specific one, to supersede the ones that CenturyLink/Level3 was still announcing. This procedure is complex, since it involves re-configuring our Edge Routers and in some cases must be coordinated with the different providers, since they must also make changes to their announcement filters.
This strategy allowed us to recover connectivity to most of the Internet, although it was not a complete solution, as some connectivity providers outside Clouding.io were still trying to send us traffic via CenturyLink/Level3.
What was the exact time scale of the incident?
It is quite difficult to identify the exact timescale, as this incident affected a multitude of Internet providers and services. including the external monitoring system that we used, but according to the information we were able to gather the approximate scale was:
12:18 PM CEST – The first incidence was detected.
12:21 PM CEST – We manually deactivated the connectivity with CenturyLink/Level3.
12:45 PM CEST Approx – We start to publish new BGP ads recovering progressively more connectivity.
12:30 to 2:30 PM CEST – Several vendors worldwide manually disabled their BGP sessions with CenturyLink/Level3 to remove false announcements, a process that helped us recover even more connectivity.
16:00 PM CEST Approx – CenturyLink/Level3 manages to remove the “false” announcements it was making and 100% connectivity is recovered
Conclusion
Due to the nature of the Internet – a large network made up of other interconnected networks – there will always be areas that are beyond our full control. Even if our philosophy has always been -and will continue to be- to work with the best suppliers, to minimize the risk of service interruptions.
This has been one of the biggest Internet failures worldwide in recent times and unfortunately there is no simple way to automate a system capable of reacting to this type of situation.
In any case, we believe that an incident in the service should always lead to improvements, either in our systems or in our recovery protocols. Therefore, we have started working on a faster and easier way to change the IP range announcements we make to the Internet. This way, in the rare case that such a situation occurs again, we will be able to react faster and recover the highest percentage of the service in a shorter period of time.
Links Related News
https://blog.cloudflare.com/analysis-of-todays-centurylink-level-3-outage/
https://edition.cnn.com/2020/08/30/tech/internet-outage-cloudflare/index.html
Leave a Reply