Hello, I am an Engineering Manager at Facebook with 13+ years in Ad Technology, Natural Language Processing and Data mining. (Learn More)
by Pravin Paratey

Binary heaps & Priority Queues

Today, I’m going to talk about Priority queues. I’ll begin by introducing the binary heap data structure and then move on to how they can be used as a priority queue. Then I will describe one application where you might use them. Or just for fun, I might do everything I described in the reverse order.

If you’ve written a web crawler, your architecture probably looks similar to the figure below.

Typical crawler architecture

  1. You begin with a list of Seed URLs(1) which are passed to the URL Queue(2).
  2. A group of Crawlers(3) listening on the queue then get a subset of the urls to crawl. Depending on your requirements, you may return subsets such that all urls in a subset are from the same domain, from distinct domains or are quite random.
  3. Each crawler will then fetch the page and pass it to the Extractor(5) which would extract the required data and outlinks. The data can be batched to the database(6) and the extracted outlinks pushed into the queue.
  4. The process continues until the Url Queue(2) is empty.

And the binary heap?

Sometimes, you might want the URL queue to function like a priority queue and not a FIFO queue. For instance, suppose you wanted to crawl pages by the number of times that url was linked to (until now), and not in any random order. You could do this by implementing a priority queue! To understand a priority queue, we’ll have to understand a binary heap first.

Binary heaps

Binary heaps come in two forms: Max-heaps and Min-heaps. A max-heap is a binary tree with the property, “All child nodes of a node are smaller than or equal to the parent node”. A min-heap has the opposite property.

We’re going to implement a Max-heap today. Min-heaps are similar. All you need to do is change the inequality signs in the code. The figure below shows what a Max-heap looks like,

Max heap

I’m going to use Python in addition to pseudo-code to illustrate everything. Lets create a simple class and we’ll add to it as we move along.

#!/usr/bin/env python
#
# Pravin Paratey (February 22, 2011)
# Code released under BSD license

class MaxHeap:
    """ This class illustrates a Max-Heap and its associated functions """

    def __init__(self):
        """
        We initialize the heap with some values
                     15
                   __/\__
                  /      \
                 13       9
                _/\_      /\
               /    \    /  \
              5     12  8    7
             / \    /\  |
            4   0  6  2 1
        """
        self.heap = [15, 13, 9, 5, 12, 8, 7, 4, 0, 6, 2, 1]

if __name__ == '__main__':
    m = MaxHeap()

Representing a heap as an array

Since a heap is a lot like a balanced binary tree, it can very easily be stored as an array. This is quite powerful as accessing nodes simply means accessing elements in an array. This greatly simplifies the operations,

  • Left-Child(i): return 2 × i
  • Right-Child(i): return (2 × i + 1)
  • Get-Parent(i): return math.floor(i/2)

Max heap as array

...
    def parent(self, index):
        """
        Parent will be at math.floor(index/2). Since integer division
        simulates the floor function, we don't explicitly use it
        """
        return index / 2

    def left_child(self, index):
        """ 1 is added because array begins at index 0 """
        return 2 * index + 1

    def right_child(self, index):
        return 2 * index + 2

if __name__ == '__main__':
    m = MaxHeap()

Heapify

The first function we’ll look at is Heapify(). This function is responsible for maintaining the heap property of a heap. The Heapify() function assumes that the node index passed to it violates the max-heap property, i.e. it may be smaller than its children. It also assumes that the other nodes satisfy the max-heap property.

Heapify() works by swapping the current node with the largest of its children, then it does a recursive call on the swapped node. The easiest way of explaining this would be to look at the set of diagrams below,

Illustrating Heapify 1

In the figure above, 3 violates the max-heap property. Heapify works under the assumption that the rest of nodes satisfy the max-heap property.

Illustrating Heapify 2

Node 3 is compared with its children - 25 and 20 and the larger node 25 is swapped with node 3.

Illustrating Heapify 3

This process is repeated and node 9 is swapped with node 3 - at which point the tree satisfies the max-heap. This can be written in Python as,

...
    def max_heapify(self, index):
        """
        Responsible for maintaining the heap property of the heap.
        This function assumes that the subtree located at left and right
        child satisfies the max-heap property. But the tree at index
        (current node) does not. O(log n)
        """
        left_index = self.left_child(index)
        right_index = self.right_child(index)

        largest = index
        if left_index < len(self.heap) and self.heap[left_index] > self.heap[index]:
            largest = left_index
        if right_index < len(self.heap) and self.heap[right_index] > self.heap[largest]:
            largest = right_index

        if largest != index:
            self.heap[index], self.heap[largest] = self.heap[largest], self.heap[index]
            self.max_heapify(largest)

This function has a time-complexity of O(log(n))

BuildHeap

This function builds a max-heap from an unordered array. The function begins with the lowest non-leaf nodes and calls Heapify() on them.

...
    def build_max_heap(self):
        """
        Responsible for building the heap bottom up. It starts with the lowest non-leaf nodes
        and calls heapify on them. This function is useful for initialising a heap with an
        unordered array. O(n)
        """
        for i in xrange(len(self.heap)/2, -1, -1):
            self.max_heapify(i)

This is useful if you want to start with an unordered list or numbers (or urls) and want to construct a heap. The time-complexity of this function is O(n).

Heapsort

Lets digress a bit and look at the heap-sort algorithm which has a time complexity of O(nlog(n)). Heapsort can be implemented as follows,

...
    def heap_sort(self):
        """ The heap-sort algorithm with a time complexity O(n log n) """
        self.build_max_heap()
        output = []
        for i in xrange(len(self.heap)-1, 0, -1):
            self.heap[0], self.heap[i] = self.heap[i], self.heap[0]
            output.append(self.heap.pop())
            self.max_heapify(0)
        output.append(self.heap.pop())
        self.heap = output

Priority Queue

Adding an element

Adding an element is quite easy. All you have to do is appending the new element at the end of the array and then work our way up, swapping the current node with its parent (if smaller), until you reach the root or the parent is greater than the current node.

...
    def propagate_up(self, index):
        """ Compares index with parent and swaps node if larger O(log(n)) """
        while index != 0 and self.heap[self.parent(index)] < self.heap[index]:
            self.heap[index], self.heap[self.parent(index)] = 
                self.heap[self.parent(index)], self.heap[index]
            index = self.parent(index)

    def add(self, key):
        """ Adds an element in the heap O(ln(n)) """
        self.heap.append(key)
        self.propagate_up(len(self.heap) - 1) # Index value is 1 less than length

The figures below illustrate the propagate_up() function:

Illustrating Propagate up 1

In the figure above, 40 is a new node that has been added to the heap. Clearly, it violates the heap property.

Illustrating Propagate up 2

40 is compared with its parent 20 and since 40 is greater, the nodes are swapped.

Illustrating Propagate up 3

This process continues until 40 cannot be swapped with its parent (50). At this point you see that the heap property is satisfied for all its nodes.

Removing an element

Removing an element is actually the ExtractMax() function which removes the element with the maximum value (i.e. the root node). This is our priority queue!

...
    def extract_max(self):
        """
        Part of the Priority Queue, extracts the element on the top of the heap
        and then re-heapifies. O(log n)
        """
        max = self.heap[0]
        data = self.heap.pop()
        if len(self.heap) > 0:
            self.heap[0] = data
            self.max_heapify(0)
        return max

Updating an element

Updating an element is done by first locating the element and incrementing it – and like add() – and working our way up to root.

...
    def increment(self, key, value):
        """ Increments key by the input value. O(log n) """
        for i in xrange(len(self.heap)):
            if self.heap[i] == key:
                self.heap[i] += value
                self.propagate_up(i)
                break

In closing, I would like to say that while the above code was quite illustrative, in real life, you will be working with a more involved data structure than an integer at each node. It is quite easy to extend this to work with (occurrence, url) pairs. To see some example code of how this can be done, use the links below.

Source Code