How Hacker News / Reddit Ranking Algorithms Work

I am reading this blog post to inform how I implement ranking for comments and annotations.

Date Created:

1 73

Notes

How Hacker News Ranking Algorithm Works

In this post I'll try to explain how Hacker News ranking algorithm works and how you can reuse it in your own applications. It's a very simple ranking algorithm and works surprisingly well when you want to highlight hot or new stuff.

The ranking performed by Hacker News (which is written in Arc, a Lisp dialect coded by Paul Graham) looks like this:

Gravity and time have a significant impact on the score of an item. Generally these things hold true:

the score decreases at increases, meaning that older items get lower and lower scores
the score decreases much faster for older items if gravity is increased

The plot below shows the scoring over time. The score decreases a lot as time goes by.

Hacker News Scoring Behavior Over Time

The gravity behavior can be seen below. The score decreases a lot faster the larger the gravity is.

Gravity's Effect on HackerNews Score

How Reddit Ranking Algorithm Works

This post examines how comment rankings work. Reddit's algorithms are fairly simple to understand and to implement.

Story Ranking

Reddit used to be open sourced and the code is freely available. Reddit was implemented in Python. The sorting algorithms are implemented in Pyrex, a language to write Python C extensions. The default story algorithm called hot ranking is implemented like this:

#Rewritten code from /r2/r2/lib/db/_sorts.pyx

from datetime import datetime, timedelta
from math import log

epoch = datetime(1970, 1, 1)

def epoch_seconds(date):
    """Returns the number of seconds from the epoch to date."""
    td = date - epoch
    return td.days * 86400 + td.seconds + (float(td.microseconds) / 1000000)

def score(ups, downs):
    return ups - downs

def hot(ups, downs, date):
    """The hot formula. Should match the equivalent function in postgres."""
    s = score(ups, downs)
    order = log(max(abs(s), 1), 10)
    sign = 1 if s > 0 else -1 if s < 0 else 0
    seconds = epoch_seconds(date) - 1134028003
    return round(sign * order + seconds / 45000, 7)

The following things can be said about submission time related to story ranking:

Submission time has a big impact on the ranking and the algorithm will rank newer stories higher than older
The score won't decrease as time goes by, but newer stories will get a higher score than older. This is a different approach than the Hacker News's algorithm which decreases the score as time goes by

The visualization to the right of the score for a story that has the same amount of u[votes and downvotes but different submission time.

Reddit's hot ranking uses the logarithm function to weight the first votes higher than the rest. Generally this applied:

The first 10 upvotes have the same weight as the next 100 upvotes which have the same wight as the next 1000, etc...

With Log Scale

Without log scale

Effects of Downvotes

Reddit is one of the few sites that has downvotes. A story's score is defined to be upvotes - downvotes

The meaning of this can be visualized like this:

This has a big impact for stories that get a lot of upvotes and downvotes(controversial), as they will get a lower ranking that stories that just get upvotes.

Conclusion of Story Ranking

Submission time is a very important parameter - newer stories will rank higher than old
The first 10 upvotes count as high as the next 100.
Controversial stories that get similar amounts of upvotes and downvotes will get a low ranking compared to stores that mainly get upvotes

Reddit Comment Ranking

Randall Munroe of xhcd is the idea guy behind Reddit's best ranking. He has written a great blog post about it:

Reddit Comment Ranking Blog Post

reddit's new comment sorting system

This blog post goes into the Best ranking, as opposed to Top, hot, controversial, and old. It affects only comments, not stories. This has become the default way that comments are sorted.

The algorithm is detailed here by Evan Miller. The article explains why each of the common systems for ranking things based on votes is flawed in a different way, and if you go there you'll see an explanation of the algorithm plus some examples of the flawed voting systems.

comments posted early are most likely to stay at the top if they are moderately good because:

The reason for this bias is that once a comment gets a few early upvotes, it's moved to the top. The higher something is listed, the more likely it is to be read (and voted on), and the more votes the comment gets. It's a feedback loop that cements the comment's position, and a comment posted an hour later has little chance of overtaking it -- even if people reading it are upvoting it at a much higher rate.

The solution to this problem is the Best ranking. When a few people have voted on a comment, you get a rough idea of its quality. The more people who vote on it, the better an idea you get of where it should ultimately end up. With this algorithm, you quantify exactly how much you can tell about a comment from a limited number of votes.

The algorithm treats the vote count as a statistical sampling of a hypothetical full vote by everyone, much as in an opinion poll. It uses this to calculate the 95% confidence score for the comment. That is, it gives the comment a provisional ranking that it is 95% sure it will get to. The more votes, the closer the 95% confidence score gets to the actual score. One of the best things about the Best rating system is that, if the algorithm is wrong to put a comment at the top, it will quickly get ore data, since the comment with less data is near the top - and when it gets that data, it will quickly correct the comment's position. The bottom line is that this system means good comments will jump quickly to the top and stay there, and bad comments will hover near the bottom.

How Not To Sort By Average Rating

Problem: You are a web programmer. You have users. Your users rate stuff on your site. You want to put the highest-rated stuff at the top and lowest-rated at the bottom. You need some sort of “score” to sort by.

We need to balance the proportion of positive ratings with the uncertainty of a small number of observations. Fortunately, the math for this was worked out in 1927 by Edwin B. Wilson. What we want to ask is: Given the ratings I have, there is a 95% chance that the real fraction of positive ratings is at least what? Wilson gives the answer. Considering only positive and negative ratings, the lower bound on the proportion of positive ratings is given by:

Use minus where it says plus or minus to calculate the lower bound. Here is the observed fraction of positive ratings, is the quantile of the standard normal distribution, and is the total number of ratings.

The confidence sort algorithm implemented in python:

#Rewritten code from /r2/r2/lib/db/_sorts.pyx

from math import sqrt

def _confidence(ups, downs):
    n = ups + downs

    if n == 0:
        return 0

    z = 1.0 #1.0 = 85%, 1.6 = 95%
    phat = float(ups) / n
    return sqrt(phat+z*z/(2*n)-z*((phat*(1-phat)+z*z/(4*n))/n))/(1+z*z/n)

def confidence(ups, downs):
    if ups + downs == 0:
        return 0
    else:
        return _confidence(ups, downs)

The great thing about confidence sort is that submission time is irrelevant. Comments are ranked by confidence and by data sampling - i.e., the more votes a comment gets the more accurate its score will become. Visualization of comments:

Wilson's score has applications outside of ranking:

Detect spam/abuse: What percentage of people who see this item will mark it as spam?
Create a best of list: What percentage of people who see this item will mark it as best of?
Create a most emailed list: What percentage of people which see this page will click Email?

Given how powerful and simple this is, it's amazing that most sites today use the naive ways to rank their content. This includes billion dollar companies like Amazon.com, which define Average rating = (Positive ratings) / (Total ratings).