Twitter algorithm just made it open source. You can find it on GitHub if you wish to take a closer look for yourself.
We will start by describing its architecture at a high level, and slowly enter the rabbit hole.
Take a seat, relax and enjoy the breakdown.
Architecture
The architecture can be abstracted to 3 high-level actions:
At first it starts by sourcing tweets that might be relevant to you (Candidate Source).
These posts are then ranked in chunks based on how likely they are to actually be interesting to you. The rank is based on a ~48M parameter neural network that is continuously trained on Tweet interactions (Heavy Ranker).
The third and last step is to apply heuristics and filters following specific preferences, which include NSFW, blocked/muted people and tweets that you’ve already seen (Heuristics & Filtering).
Candidate Source
It attempts to extract the best 1500 Tweets from a pool of hundreds of millions of tweets.
Candidates are collected from two distinctive pools: In-Network, which is composed by people you follow and Out-Of-Network, which is composed by people you don’t follow.
In-Network is the largest candidate source. It ranks Tweets of those you follow based on their relevance using a logistic regression model, moreover it uses a Real Graph, which is a model trained to predict the likelihood of engagement between two users. The higher the Real Graph score between you and the author of the Tweet, the more of their tweets will be included.
Out-Of-Network uses two approaches to identify which tweets to collect: Social Graph and Embedding Space.
The Social Graph is responsible for 15% of the gathered tweets. These are from people that overlap with what I like or who I interact with. For example, if I interact with Joe and Joe interacts with Pepe, it is very likely that Pepe will be considered more when looking at my social graph. This graph is very limited, as you won’t find anyone that is not related to your actual circle.
This is when the Embedding Space comes in and is responsible for the remaining 35% of the Out-Of-Network selection. It works based on categories, and it helps twitter to expand your feed with content that you might not have seen before but that you’ll probably like. As example, if a tweet in the tech space is doing very good engagement wise and your profile might actually like it, even though it is not part of your Social Graph, twitter will select it as a candidate.
Heavy Ranker
Now it is time to rank these tweets. At this stage, all candidates are treated equally, without regard for what candidate source it originated from.
The neural network used for this task is continuously trained on Tweet interactions to optimize for positive engagement (e.g. Likes, Retweets, and Replies). This ranking mechanism takes into account thousands of features and outputs ten labels to give each Tweet a score, where each label represents the probability of an engagement. We rank the Tweets from these scores.
Each label is a multiplier, which means that if the tweet scored poorly until one of the last one it has still the possibility to came out on top. It is based on likelihood of interest, there is no possibility of decreasing points applied on any score. Being a Twitter Blue subscriber is one of the labels used for this process, as it increases the likelihood of the account to be run by a human and not by a bot.
Heuristics and Filtering
This last cleanup is a mixture between heuristics, filtering and product features to create a balanced and diverse feed. Citing directly from the twitter engineering post, some of these are:
Visibility Filtering: Filter out Tweets based on their content and your preferences. For instance, remove Tweets from accounts you block or mute.
Author Diversity: Avoid too many consecutive Tweets from a single author.
Content Balance: Ensure we are delivering a fair balance of In-Network and Out-of-Network Tweets.
Feedback-based Fatigue: Lower the score of certain Tweets if the viewer has provided negative feedback around it.
Social Proof: Exclude Out-of-Network Tweets without a second degree connection to the Tweet as a quality safeguard. In other words, ensure someone you follow engaged with the Tweet or follows the Tweet’s author.
Conversations: Provide more context to a Reply by threading it together with the original Tweet.
Edited Tweets: Determine if the Tweets currently on a device are stale, and send instructions to replace them with the edited versions.
Mixing and Serving
At this point, before serving it to you, Twitter simply mixes the last selection with your followed posts, ads and follow suggestions.
This process runs 5 billion times per day and completes in under 1.5 seconds.
Conclusions
In conclusion, the Twitter algorithm may seem like a mysterious and complex system, but conceptually it is not far from what you would expect it to be. By collecting tweets and data from the user, the algorithm can calculate which posts are most likely to pique their interest. This allows users to have a more personalized experience on the platform and helps Twitter to deliver more relevant content. While there may be some debate about the algorithm's impact on the visibility of certain tweets and accounts, overall, it is an important tool that helps Twitter to provide a better user experience. By staying informed about how the algorithm works and how it affects your tweets, you can make the most of your Twitter presence and engage with your audience more effectively.