Elon Musk And >1000 Poorly Batched RPCs
My article about whether Elon Musk fired people using lines of code as his sole metric (summary: probably not) is easily the most popular post on this blog, but it was also criticized as “navel-gazing” by one commenter — he said that only 10 people actually know what Elon Musk did, I am not one of them, and all of this speculation is pointless. So, fair warning: If that article was speculation, this is going to be a lot more than that. The topic at the center of the debate is why Twitter is, or was, so slow. The previous article’s topic could be answered with a simple “yes” or “no.” This one is complicated.
The reason I find this worthy of discussion is that if you strip away all the politics, along with your negative or positive opinion of Elon Musk, at the center is a pretty interesting technical conversation. The problem is that Twitter’s infrastructure is complex, so a simple question like “Why Is Twitter Slow?” probably does not have a straightforward answer. It would be as straightforward as assessing whether Elon Musk’s subsequent statements are true and, if so, what was done that successfully achieved this in just ten days.
To keep it centered, I am going to ground the conversation in statements made by people with deep knowledge of Twitter’s infrastructure.
And, to the commenter who said I am going to have to do better than just dropping links and filling in the gaps with rambling…
Starting Point: Yao Yue
Yao Yue is a valid starting point because she actually knows what she is talking about. She worked at Twitter for more than 12 years. She was the main contributor behind Pelikan, their unified cache backend, and she was explicitly mentioned in this Twitter blog post by the VP of infrastructure as their expert on the Twitter cache before the events of the last few weeks, when both of them stopped working at Twitter. A more detailed look at Pelikan is provided here. A tech talk on scaling Redis at Twitter is here.
Yao Yue writes: The slow part (last hop) is a single API request plus CDN, and the backend in general is not the slow part regardless of batching.
There was a derivative conversation that occurred after this one. Someone wrote that Elon Musk was wrong about why Twitter was slow, but that it was still slow. Yao Yue responded that the team responsible for monitoring client-side performance was entirely laid off, leading to this Tweet:
Sam Pullara, who had worked at Twitter some time ago, weighed in. His argument was that the reason Twitter is slow is because they did away with server-side rendering.
This prompted its own slew of arguments, which you are welcome to read, including one between Sam Pullara and someone named Cullen about Pullara’s original work rendering Twitter in 2012 using a framework he wrote, and Cullen’s retort that this is no longer relevant.
The most famous argument is between Elon Musk and an Android developer named Eric, because it appears he was fired for having the argument.
Next Point: Microservices
Yeah, I know…this hasn’t really answered anything so far. Twitter is super slow in many countries because of the number of poorly batched services? Is it because the number is more than 1000, or because they are poorly batched? What does any of this mean?
This video has one of the most succinct descriptions of microservices I have found:
- Every app function is its own service
- Own container
- Communicate via APIs
Arguments for or against microservices can get murky, but this does not seem to be what Elon Musk is arguing.
Elon Musk stated that more than 80% (good math on my part, right?) of Twitter microservices were bloatware, and this is where things start to get a little confusing. If you follow a few news websites, it seems what happened next was Elon Musk accidentally broke two-factor authentication.
From Yahoo! News:
Musk tweeted on Monday that he was “turning off the ‘microservices’ bloatware.” He said that less than 20% of the company’s microservices were needed for Twitter to work.
Once it’s behind an API, it’s easy to just set it and forget it. The reality is, I see companies with thousands of microservices when they probably should have had five. It can definitely be overdone, but a spectrum is the way I think of it…
These sorts of architecture decisions are hard to undo at a moment’s notice. They tend to be pretty deep issues. Part of the benefit of microservices is things are separated physically, often running on different servers and separated by an API. Those APIs definitely have performance implications.
To undo it really requires deep thinking on, “Does that mean we shut off entire services, because we didn’t need them after all?”
They were, it appears, correct to have concerns. It’s unclear what of this “bloatware” was altered during the day, but users began to report that, after logging out, they were unable to log back in again.
The problem affected users who have set up two-factor authentication, an extra step at login to make your account more secure. Users (who are also greeted with the two-factor authentication check when changing other Twitter settings) reported that they were not sent a code, making them unable to log in.
Oana Petrache, if you are reading this…yes, I just listed a bunch of articles, and now I am going to fill in the gaps with ramblings.
InformationWeek is, I think, the best source here because it has nuance to it. Maybe those services were unnecessary. It is hard to say. According to all three, it is likely that whatever effort Elon Musk kicked off resulted in cutting something that was important.
Did this eventually lead to a series of events that culminated in Twitter working much better?
I have no idea.
Filling In The Gaps With Ramblings
The Write API accepts a new Tweet, then uses a fanout service. Hashes are kept for Tweets, which are Redis instances. One thing system design interview courses get right about Twitter is that it is very read-heavy, and this was very much a factor in its design. Writes are slow. Reads are fast.
That paragraph really does not answer the question of why Twitter did >1000 poorly batched RPCs to render a home timeline, but at least I got something out of this.
I tried pitching this topic to anyone who would listen, and no one really got into it. There just was not enough information. Like the InformationWeek article said, could there have been too many services? Well sure. Maybe there were 1000 and there only needed to be five.
Twitter is very, very open in ways many tech companies are not, but it is not completely open source, either. To answer a lot of these things, we would need the source code, and we would need a team of very smart engineers responsible for building it.
Elon Musk has that, so maybe I will just leave it to him and pivot this blog back to side projects again.