Hello, I am an Engineering Manager at Facebook with 13+ years in Ad Technology, Natural Language Processing and Data mining. (Learn More)
by Pravin Paratey

Building a url shortener in 10 mins

This article is a 10-minute guide to building your own url shortener. A url shortener is a service that lets you shorten a long url ex. http://pravin.paratey.com/posts/building-url-shortener to http://goo.gl/q3jEH. Some url shorteners that you may be familiar with are

  1. Goo.gl
  2. Bit.ly
  3. TinyUrl
  4. TinyCC

A shortened url has two parts to it, the domain name (ex. goo.gl) and the url string (ex. q3jEH). In this article we will concentrate on building the second part of the url. Then, we will build a working url shortener in Node.js.

Some mathematics: Set theory

The task is to create an injective, non-surjective function from the domain of full urls to shortened urls. An additional constraint is to minimize the length of the shortened url. As shown in the figure, this class of functions has the properties,

Injective functions

  1. An element in the domain (bubble on the left) maps to only one element in the co-domain (bubble on the right).
  2. There may be elements in the co-domain that do not map to any elements in the domain.

Note: The domain is the list of urls that we have shortened till date. This is NOT the list of ALL the urls on the internet. The elements in this domain will grow and more and more urls are shortened using our service.

The challenge is coming up with a scheme to ensure that this property is maintained. There are a few hashing algorithms which ensure a one-to-one and onto property – among them are the MD5 and SHA series. Do note that MD5 and SHA-1 have a very tiny probability of collisions.

Take a look at the following python code,

>>> import hashlib
>>> url = 'http://paratey.com'
>>> hashlib.md5(url).hexdigest() # 32 chars
'2325ff901c7e76ddc0d78731c11e8152'
>>> hashlib.sha1(url).hexdigest() # 40 chars
'd01b36a809ab5b254377ed29635d3923026f21cf'
>>> hashlib.sha256(url).hexdigest() # 64 chars
'bd5c0e507f9c1f6477a5ccdb591e22ccfb7b610f746dd8a8a5f963bb5b72401a'

These hashing functions are meant to work on arbitrarily long strings (well, long enough for practical purposes) and therefore map to longer hashes to maintain uniqueness.

A look at the url string

A look at RFC 1738 tells you that the valid characters in a url can be,

lowalpha       = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" |
                 "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" |
                 "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" |
                 "y" | "z"
hialpha        = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" |
                 "I" | "J" | "K" | "L" | "M" | "N" | "O" | "P" |
                 "Q" | "R" | "S" | "T" | "U" | "V" | "W" | "X" |
                 "Y" | "Z"
digit          = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
                 "8" | "9"
safe           = "$" | "-" | "_" | "." | "+"
extra          = "!" | "*" | "'" | "(" | ")" | ","
national       = "{" | "}" | "|" | "\" | "^" | "~" | "[" | "]" | "`"
punctuation    = "<" | ">" | "#" | "%" | <">
reserved       = ";" | "/" | "?" | ":" | "@" | "&" | "="

Which means in addition to the alphanumeric characters, we have 33 characters (safe, extra, national, punctuation, reserved) to work with, making it a total of 95 characters. If we use all of these characters in a url, a 5 character url string will be able to represent 95**5 or over 7.7 billion unique urls. We will be using the alphanumeric characters only in this tutorial as it is easier for most people to work with these characters. Since there are 62 alphanumeric characters, we can shorten 916 million urls with a 5 character string (62**5).

We can spend time coming up with an awesome function. Or, we could cheat.

Cheating

We are going to use a global auto-increment number to reduce the length of the generated string. This means, the first url shortened will have the “short” url as 00001, the second, 00002 and so on.

Implementation

Take a look at the nodejs implementation below. You can download the code here

/* Node.js Application */
var http = require('http');
var parse = require('url').parse;

var SERVER = '127.0.0.1';
var PORT = 8080;

var id = 0; /* You can start from a non-zero seed */
var url_to_index = new Array();
var short_to_url = new Array();

/* Randomize CHARS if you dont want people to guess the next url generated */
var CHARS = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHUJKLMNOPQRSTUVWXYZ';

function num_to_base62(n) {
    if(n > 62) {
        return num_to_base62(Math.floor(n / 62)) + CHARS[n % 62];
    } else {
        return CHARS[n];
    }
}

var srv = http.createServer(function(req, res) {
    var param = parse(req.url, true);

    if (param.pathname == '/add') {
        if (param.query.url != undefined) { /* We have a url to add */
            /* Check whether the url has been added before */
            short_url = url_to_index[param.query.url];
            if (short_url == undefined) { /* Nope, not added */
                short_url = num_to_base62(id);
                while (short_url.length < 5) { /* Add padding */
                    short_url = CHARS[0] + short_url;
                }
                url_to_index[param.query.url] = short_url;
                short_to_url[short_url] = param.query.url;
                id++;
            }
            res.writeHead(200, {'Content-Type': 'text/html'});
            /* You may want to drop PORT in production environments */
            short_url_string = 'http://' + SERVER + ':' + PORT +
                               '/' + short_url;
            res.end('Your short url is: <a href="' + short_url_string +
                '">' + short_url_string + '</a>');
        }
    } else { /* Redirect user to the right place */
        long_url = short_to_url[param.pathname.substring(1)];
        if (long_url != undefined) {
            res.writeHead(302, {'Location': long_url});
            res.end();
        } else {
            res.writeHead(404, {'Content-Type': 'text/plain'});
            res.end('404 - Requested url not found');
        }
    }
}).listen(PORT, SERVER);

console.log('Server running at http://' + SERVER + ':' + PORT + '/');

Run this with node urlshortener.js. To add a url, use http://127.0.0.1:8080/add?url=TARGET_URL. For example,

Assumptions

  1. An in-memory javascript array is used here. In production, you will want to store the mappings in a database.

Resources