How Hacker News / Reddit Ranking Algorithms Work

I am reading this blog post to inform how I implement ranking for comments and annotations.

Date Created:
1 40

References



Notes


How Hacker News Ranking Algorithm Works

In this post I'll try to explain how Hacker News ranking algorithm works and how you can reuse it in your own applications. It's a very simple ranking algorithm and works surprisingly well when you want to highlight hot or new stuff.

The ranking performed by Hacker News (which is written in Arc, a Lisp dialect coded by Paul Graham) looks like this:

Gravity and time have a significant impact on the score of an item. Generally these things hold true:

  • the score decreases at increases, meaning that older items get lower and lower scores
  • the score decreases much faster for older items if gravity is increased

The plot below shows the scoring over time. The score decreases a lot as time goes by.

Hacker News Scoring Behavior Over Time

The gravity behavior can be seen below. The score decreases a lot faster the larger the gravity is.

Gravity's Effect on HackerNews Score


How Reddit Ranking Algorithm Works

This post examines how comment rankings work. Reddit's algorithms are fairly simple to understand and to implement.

Story Ranking

Reddit used to be open sourced and the code is freely available. Reddit was implemented in Python. The sorting algorithms are implemented in Pyrex, a language to write Python C extensions. The default story algorithm called hot ranking is implemented like this:

#Rewritten code from /r2/r2/lib/db/_sorts.pyx

from datetime import datetime, timedelta
from math import log

epoch = datetime(1970, 1, 1)

def epoch_seconds(date):
"""Returns the number of seconds from the epoch to date."""
td = date - epoch
return td.days * 86400 + td.seconds + (float(td.microseconds) / 1000000)

def score(ups, downs):
return ups - downs

def hot(ups, downs, date):
"""The hot formula. Should match the equivalent function in postgres."""
s = score(ups, downs)
order = log(max(abs(s), 1), 10)
sign = 1 if s > 0 else -1 if s < 0 else 0
seconds = epoch_seconds(date) - 1134028003
return round(sign * order + seconds / 45000, 7)

The following things can be said about submission time related to story ranking:

  • Submission time has a big impact on the ranking and the algorithm will rank newer stories higher than older
  • The score won't decrease as time goes by, but newer stories will get a higher score than older. This is a different approach than the Hacker News's algorithm which decreases the score as time goes by

The visualization to the right of the score for a story that has the same amount of u[votes and downvotes but different submission time.

Reddit's hot ranking uses the logarithm function to weight the first votes higher than the rest. Generally this applied:

  • The first 10 upvotes have the same weight as the next 100 upvotes which have the same wight as the next 1000, etc...

With Log Scale

Without log scale

Effects of Downvotes

Reddit is one of the few sites that has downvotes. A story's score is defined to be upvotes - downvotes

The meaning of this can be visualized like this:

This has a big impact for stories that get a lot of upvotes and downvotes(controversial), as they will get a lower ranking that stories that just get upvotes.

Conclusion of Story Ranking

  • Submission time is a very important parameter - newer stories will rank higher than old
  • The first 10 upvotes count as high as the next 100.
  • Controversial stories that get similar amounts of upvotes and downvotes will get a low ranking compared to stores that mainly get upvotes

Reddit Comment Ranking

Randall Munroe of xhcd is the idea guy behind Reddit's best ranking. He has written a great blog post about it:

reddit's new comment sorting system

This blog post goes into the Best ranking, as opposed to Top, hot, controversial, and old. It affects only comments, not stories. This has become the default way that comments are sorted.

  • The algorithm is detailed here by Evan Miller. The article explains why each of the common systems for ranking things based on votes is flawed in a different way, and if you go there you'll see an explanation of the algorithm plus some examples of the flawed voting systems.

comments posted early are most likely to stay at the top if they are moderately good because:

The reason for this bias is that once a comment gets a few early upvotes, it's moved to the top. The higher something is listed, the more likely it is to be read (and voted on), and the more votes the comment gets. It's a feedback loop that cements the comment's position, and a comment posted an hour later has little chance of overtaking it -- even if people reading it are upvoting it at a much higher rate.

The solution to this problem is the Best ranking. When a few people have voted on a comment, you get a rough idea of its quality. The more people who vote on it, the better an idea you get of where it should ultimately end up. With this algorithm, you quantify exactly how much you can tell about a comment from a limited number of votes.

The algorithm treats the vote count as a statistical sampling of a hypothetical full vote by everyone, much as in an opinion poll. It uses this to calculate the 95% confidence score for the comment. That is, it gives the comment a provisional ranking that it is 95% sure it will get to. The more votes, the closer the 95% confidence score gets to the actual score. One of the best things about the Best rating system is that, if the algorithm is wrong to put a comment at the top, it will quickly get ore data, since the comment with less data is near the top - and when it gets that data, it will quickly correct the comment's position. The bottom line is that this system means good comments will jump quickly to the top and stay there, and bad comments will hover near the bottom.

How Not To Sort By Average Rating

Problem: You are a web programmer. You have users. Your users rate stuff on your site. You want to put the highest-rated stuff at the top and lowest-rated at the bottom. You need some sort of “score” to sort by.

We need to balance the proportion of positive ratings with the uncertainty of a small number of observations. Fortunately, the math for this was worked out in 1927 by Edwin B. Wilson. What we want to ask is: Given the ratings I have, there is a 95% chance that the real fraction of positive ratings is at least what? Wilson gives the answer. Considering only positive and negative ratings, the lower bound on the proportion of positive ratings is given by:

Use minus where it says plus or minus to calculate the lower bound. Here is the observed fraction of positive ratings, is the quantile of the standard normal distribution, and is the total number of ratings.

The confidence sort algorithm implemented in python:

#Rewritten code from /r2/r2/lib/db/_sorts.pyx

from math import sqrt

def _confidence(ups, downs):
n = ups + downs

if n == 0:
return 0

z = 1.0 #1.0 = 85%, 1.6 = 95%
phat = float(ups) / n
return sqrt(phat+z*z/(2*n)-z*((phat*(1-phat)+z*z/(4*n))/n))/(1+z*z/n)

def confidence(ups, downs):
if ups + downs == 0:
return 0
else:
return _confidence(ups, downs)

The great thing about confidence sort is that submission time is irrelevant. Comments are ranked by confidence and by data sampling - i.e., the more votes a comment gets the more accurate its score will become. Visualization of comments:

Wilson's score has applications outside of ranking:

  1. Detect spam/abuse: What percentage of people who see this item will mark it as spam?
  2. Create a best of list: What percentage of people who see this item will mark it as best of?
  3. Create a most emailed list: What percentage of people which see this page will click Email?
Given how powerful and simple this is, it's amazing that most sites today use the naive ways to rank their content. This includes billion dollar companies like Amazon.com, which define Average rating = (Positive ratings) / (Total ratings).

You can read more about how comments are sorted in this blog post.

User Comments

Insert Math Markup

ESC
About Inserting Math Content
Display Style:

Embed News Content

ESC
About Embedding News Content

Embed Youtube Video

ESC
Embedding Youtube Videos

Embed TikTok Video

ESC
Embedding TikTok Videos

Embed X Post

ESC
Embedding X Posts

Embed Instagram Post

ESC
Embedding Instagram Posts

Insert Details Element

ESC

Example Output:

Summary Title
You will be able to insert content here after confirming the title of the <details> element.

Insert Table

ESC
Customization
Align:
Preview:

Insert Horizontal Rule

#000000

Preview:


View Content At Different Sizes

ESC

Edit Style of Block Nodes

ESC

Edit the background color, default text color, margin, padding, and border of block nodes. Editable block nodes include paragraphs, headers, and lists.

#ffffff
#000000

Edit Selected Cells

Change the background color, vertical align, and borders of the cells in the current selection.

#ffffff
Vertical Align:
Border
#000000
Border Style:

Edit Table

ESC
Customization:
Align:

Upload Files

ESC

Upload a .lexical file. If the file type matches the type of the current editor, then a preview will be shown below the file input.

Upload Jupyter Notebook

ESC

Upload a Jupyter notebook and embed the resulting HTML in the text editor.

Insert Custom HTML

ESC

Edit Image

ESC
#ffffff

Insert Columns Layout

ESC
Column Type:

Select Code Language

ESC
Select Coding Language

Upload Previous Version of Editor State

ESC