What exactly is BGP, the issue that caused Facebook’s outage

I’m guessing a bunch of folks might be curious about BGP. So here’s a 🧵
Computers talk to each other using what’s known as IP addresses. They can look like 192.168.1.11 (IPv4) or fd14:9deb:3174::ce12 (IPv6). It’s like your home’s postal address.
When you send someone a post via your local post office, for some of the pincodes, your PO “knows” which PO to directly send to, else it “forwards” it to a General Post Office (GPO).
Similarly, when a machine sends a packet to some IP, it does it by “forward”-ing it to the local “gateway” (typically your ISP), which then “routes” it to either a “known” network, or to some “bigger” network.
The way a network (e.g. Airtel) “knows” another network (e.g. FB) is by a process called “peering” (almost always a BGP peering). BGP is the protocol which allows one network to say to the other “hey, send all packets for such & such addresses to me”.
So when FB wants its IPs to be reachable by any ISP world over, it “peers” with a bunch of ISPs, which would’ve “peer”-ed with a bunch of other ISPs and so on, so that it is reachable eventually by some path.
This process of BGP peering itself is done over the same TCP/IP network protocol, so if the underlying network is affected for any reason (bad configuration, cable cut, network loops), BGP connections would break as well.
What happens when a BGP connection breaks? Well, eventually the “peering” router of the ISP will treat this as “this path is no more valid, so I’m gonna stop announcing this IP address range to my peers too”. This is treated as a “route withdrawal”
This withdrawal then “propagates” across different routers across the Internet (yes). Approx. 850k IPv4 routes currently exist in the “full” global IPv4 routing table.
What likely happened here (speculative zone now) is that some internal network configuration issue broke the BGP connections that FB uses to peer with others, leading to large scale route updates & thus routing failures.
The moment that happened, traffic would’ve tanked with massive alerts going out to their NOC engineers, unless of course their monitoring infrastructure uses the same network infrastructure which failed.
The DNS (what tells a computer the IP address of a domain like fb .com) failures would’ve started soon after, but would’ve taken some time as DNS servers world over would’ve kept DNS results cached for some time.
Eventually the recursive resolvers (those DNS servers provided by your ISP, or 1.1.1.1 or 8.8.8.8) would also not be able to reach FB’s DNS servers (the authoritative source), so FB’s DNS resolution itself started failing (ppl assumed it to be DNS due to this)
I’m guessing many an action item would emerge from the RCA here, but I’m sure engineers of startups would find a bunch of lessons in here. After all ,we build upon the shoulders of giants in this community.
Hoping this helped those less familiar with this space to get an idea of what BGP is, even if my guess at the underlying issue proves to be wrong.

Total
1
Shares
Get Threads In Your Inbox​





Share on facebook
Share on twitter
Share on linkedin

We unrolled this Twitter thread 🧵 for you.


NextBigWhat’s #TheThreadMill product brings you curated wisdom from Twitter threads by unrolling them and making them available to the wider audience.

Get Threads In Your Inbox​





More #Threads

Total
1
Share