by Pravin Paratey

Summarisation Demo

↑ Genghis Khan ↑ Platypus ↑ United Kingdom ↑ Albert Einstein ↑ Vincent Van Gogh ↑ The Gray Wolf ↑ Doctor Who

Result

How it works

Let me illustrate the summarisation process for the paragraph,

The IPhone is a smartphone developed by Apple Inc. An iPhone can function as a video camera, a camera phone, a portable media player, and an Internet client with email and web browsing capabilities, can send texts and receive visual voicemail, and has both Wi-Fi and 3G connectivity. The user interface is built around the device’s multi-touch screen, including a virtual keyboard rather than a physical one.

1. Understanding the sentence

Disambiguation

In the first stage, the input text is pre-processed to assign a meaning to each word. Words with multiple meanings - like the word “Apple” - which could stand for Apple, the fruit or Apple Inc., the company are disambiguated depending on the context of the sentence it appears in and the context of the input text.

In the adjacent figure, the word IPhone has been used to disambiguate the Apple.

2. Identify the important concepts

Once words have been assigned a meaning, important concepts are identified. A concept is a collection of words which define an idea. For example, the concept IPhone will contain words like (iPhone, Jobs, mobile, communication, A4, cell, 3G, …)

The IPhone is a smartphone developed by Apple Inc.

3. Find the sentences with the maximum coverage and minimum overlap

S1 S2 S3 S4 S5

Next, the sentences are by concept coverage. Then sets are formed such that the members of the set fully cover the set of concepts extracted in step 2. For the example in the adjoining figure, the sets {(S3, a)}, {(S5, b), (S1, c)}, {(S1, c), (S4, d), (S2, e)} will be created. The letters [a, e] denote a confidence factor which is used to determine which set to output.

by Pravin Paratey

Going Text-mode

For many years now, I have been using the console to do most of my tasks. In this post, I will enumerate all the tools and applications I use to make my life easier on the terminal.

Terminal Emulator

The most important application, when going text-mode is the terminal emulator. This is where I spend most of my time and choosing the right one makes the experience so much smoother.

On any platform, I look for

  1. Split Windows - This is massively important to me as I work on multiple remote servers and I need to see activity on all of them simultaneously. Using keyboard shortcuts lets me navigate to the pane I am interested in, maximise it, work and then restore it back to its original size.
  2. Keyboard shortcuts - Without keyboard shortcuts to navigate between panes and maximise them, any terminal emulator is, quite frankly, useless.
  3. Transparency support - Another important factor. Most people don’t realise that setting the transparency to something between 80-90% means not only can you work comfortably in the current window, but you can see what’s happening under it as well! I usually have 2 iTerms/Terminators on top of each other. The one in the background has its panes constantly spewing out status updates, while the one on top is where I code.
  4. 256 colours - Mesa like pretty colours. You can enable 256 colour support in vim by adding set t_Co=256 to your .vimrc.

On a Mac

When I am on a Mac, nothing quite beats iTerm. Although the terminal application in Mac OS Lion has improved a lot since Snow Leopard, it’s still nowhere close to iTerm.

iTerm

On Linux

I use Terminator which can be installed by issuing a sudo apt-get install terminator on Debian-derived systems. It has most of the features of iTerm, including tabbed and split-screen windows and customizable keyboard shortcuts.

On Windows

When I am on Windows, I use a combination of Cygwin and Putty (with Paegent). The thing I miss the most is the ability to split terminals like iTerm or Terminator. I get around it by tiling windows but it’s still not the same thing.

Editor

Vim wins :-).

iTerm

Web Browser

I use either w3m or lynx although I am a little partial to the latter as I started using it in 1997.

w3m Web browser

Email client

While mutt is more powerful, the simplicity of alpine and the fact that it is easy to set up multiple Gmail accounts without jumping though the hoops of fetchmail and sendmail means I use alpine more often. I do miss mutt though and sometimes sneak in some time with it.

iTerm

Notes

Use ssh-agent to save your private keys

ssh-agent lets you save your password once per session. This is really handy if you have to constantly ssh into multiple servers or pull/push git/hg repositories. To use it, first run

1
2
3
4
pravin@pravin-pc:~$ ssh-agent 
SSH_AUTH_SOCK=/tmp/ssh-bLAqHxf21857/agent.21857; export SSH_AUTH_SOCK;
SSH_AGENT_PID=21858; export SSH_AGENT_PID;
echo Agent pid 21858;

This done, add your ssh keys by running this command once for each key

1
2
3
pravin@pravin-pc:~$ ssh-add 
Enter passphrase for /home/pravin/.ssh/id_rsa: 
Identity added: /home/pravin/.ssh/id_rsa (/home/pravin/.ssh/id_rsa)

.

by Pravin Paratey

Setting up Replica sets in MongoDB

Are you ready to set up replica sets in under 5 minutes? Set your stopwatches. Ready, set, go!

Minute 0: Overview of the servers

We are going to use replica sets consisting of 3 nodes which looks something like this,

replica set with 3 nodes

Minute 1: Downloading and installing MongoDB

The latest instructions to download and install Mongo can be found here. For me, running on a CentOS machine, all I had to do was add the 10gen repository, followed by a sudo yum install mongo-10gen mongo-10gen-server. To add the 10gen repository, adding the following lines to /etc/yum.repos.d/10gen.repo

[10gen]
name=10gen Repository
baseurl=http://downloads-distro.mongodb.org/repo/redhat/os/x86_64
gpgcheck=0
$ sudo yum install mongo-10gen mongo-10gen-server

Do this on all your servers.

Minute 2: Editing the configuration file

Next, edit the configuration file which is present at /etc/mongodb.conf and add the two lines to it

bind_ip = 192.168.1.1 # Public or Internal IP of the server. This will change with each server.
replSet = london # This string should be the same on all 3 servers

Again, this should be done on all servers.

Minute 3: Starting Mongo

Start Mongo on all servers by running

$ sudo service mongod start

On Debian systems, you will have to run the mongodb service instead,

$ sudo service mongodb start

Minute 4: Adding nodes to the primary

Pick any machine and enter the Mongo CLI by typing mongo IP.Of.Server.One. This will be your primary machine (until the cluster decides to elect another as its primary).

Next, run rs.initiate() on the machine. This will initialise the node to use replica sets. Then run rs.add("IP.Of.Server.Two") and rs.add("IP.Of.Server.Three") to add the other two nodes to the replica set.

$ mongo IP.Of.Server.One
> rs.initiate();
> rs.add("IP.Of.Server.Two");
> rs.add("IP.Of.Server.Three");
> rs.status();

That’s it! We’re done. That was easy, wasn’t it?

Removing a member of the replica set

Removing a node is as easy as typing

> rs.remove("IP.Of.Server.To.Remove:port")
by Pravin Paratey

Super quick Find & Replace

Today, we are going to explore how to perform a fast search/replace over a file when the number of strings to search and replace is large. This means, you will be given a set of source and target words like in the table below,

Source(Find) Target (Replace)
motives oranges
whale watermelon
idea bulk
himself itself

And an input file like,

Chief among these motives was the overwhelming idea of the great whale himself. Such a portentous and mysterious monster roused all my curiosity.

The task is to find the best algorithm to replace all the source words in the input text with the target words. For the above input the output will be,

Chief among these oranges was the overwhelming bulk of the great watermelon itself. Such a portentous and mysterious monster roused all my curiosity.

As always, you can get the complete source code and the test files for this article from the sidebar.

Naive solution

The easiest way of implementing this would be to search the text for every word in the source list and replace it by the word in the target list. Example code for this would look like,

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#!/usr/bin/env python

import sys, time, hashlib

class Naive:
    """ Naive implementation of find/replace for a massive word list. """
    def __init__(self, filename):
        self._words = {}
        fp = open(filename)
        for line in fp:
            word, replacement = line.strip().split()
            self._words[word] = replacement
        fp.close()

    def replace_all(self, filename):
        fp = open(filename)
        text = fp.read()
        fp.close()

        for word in self._words.keys():
            text = text.replace(word + ' ', self._words[word] + ' ')
        return text

if __name__ == '__main__':
    if len(sys.argv) < 2:
        print "Usage: python %s <find-replace-list> <file1> <file2>" % sys.argv[0]
        sys.exit(1)

    n = Naive(sys.argv[1])
    for filename in sys.argv[2:]:
        start = time.time()
        print filename, hashlib.md5(n.replace_all(filename)).hexdigest(), time.time() - start
        #print n.replace_all(filename)

Using Regular expressions

If the words to find are single alphanumeric words, you can use a regular expression (or even a string.split()) to find all words and replace them in one pass. This is much faster if the wordlist » input text.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
#!/usr/bin/env python

import sys, re, time, hashlib

class Regex:
    """ Regex implementation of find/replace for a massive word list. """
    def __init__(self, filename):
        self._words = {}
        fp = open(filename)
        for line in fp:
            word, replacement = line.strip().split()
            self._words[word] = replacement
        fp.close()

    def replace_func(self, matchObj):
        key = matchObj.group(0)
        if self._words.has_key(key):
            return self._words[key]
        else:
            return key

    def replace_all(self, filename):
        fp = open(filename)
        text = fp.read()
        fp.close()
        return re.sub("[a-zA-Z]+", self.replace_func, text)
        

if __name__ == '__main__':
    if len(sys.argv) < 2:
        print "Usage: python %s <find-replace-list> <file1> <file2>" % sys.argv[0]
        sys.exit(1)

    r = Regex(sys.argv[1])
    for filename in sys.argv[2:]:
        start = time.time()
        print filename, hashlib.md5(r.replace_all(filename)).hexdigest(), time.time() - start
        #print r.replace_all(filename)

{: class="prettyprint linenums:true”}

Using a Trie

A trie is a data structure specifically tuned to this task. A trie is an ordered tree that looks like so,

Example Trie{:class="pure-img”}

In the above figure, the words “apple”, “ape” and “zoo” have been shown.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
#!/usr/bin/env python

import sys, time, hashlib

class Trie:
    """ 
    Trie class

    We use a tuple (string, dict) to represent a node. That is, the words 

    1. ant -> spider (replace every occurrence of ant with spider)
    2. any -> every

    will be represented as,

    (None, {
        a: (None, {
            n: (None, {
                t: ('spider', {}),
                y: ('every', {})
            }),
        })
    })
    """
    def __init__(self, dict_file):
        self._trie = (None, {})
        fp = open(dict_file)
        for line in fp:
            words = line.strip().split()
            self.add(words[0], words[1])
        fp.close()
    
    def add(self, text, replacement, node=None):
        """ Adds word/replacement pair """
        if node == None: node = self._trie

        new_replacement, children, c = None, node[1], text[0]

        # New id is the id for leaf nodes, None otherwise
        if len(text) == 1: new_replacement = replacement

        if c not in children:
            children[c] = (new_replacement, {})
        elif len(text) == 1: # Leaf node
            (old_replacement, old_node) = children[c]
            children[c] = (new_replacement, old_node)
        if len(text) > 1:
            self.add(text[1:], replacement, node[1][c])

    
    def replace_all(self, text):
        output = []
        max_len, start = len(text), 0

        while start < max_len:
            (replacement, cur_node) = self._trie
            end = start
            while end < max_len and cur_node.has_key(text[end]):
                (replacement, cur_node) = cur_node[text[end]]
                end += 1
            
            # If we are not at the beginning of the string, append prev char (whitespace)
            if start != 0: output.append(text[start-1])

            # Replacement found AND end not exceeded max-len AND the word has ended
            if replacement != None and end <= max_len and (end == max_len or not text[end].isalpha()): 
                output.append(replacement)
            else:
                output.append(text[start:end])
            
            start = end + 1
        if end != max_len: output.append(text[end])
        return ''.join(output)

if __name__ == '__main__':
    if len(sys.argv) < 2:
        print "Usage: python %s <find-replace-list> <file1> <file2>" % sys.argv[0]
        sys.exit(1)

    t = Trie(sys.argv[1])
    for filename in sys.argv[2:]:
        start = time.time()
        fp = open(filename)
        data = fp.read()
        fp.close()
        print filename, hashlib.md5(t.replace_all(data)).hexdigest(), time.time() - start
        #print t.replace_all(data)

{: class="prettyprint linenums:true”}

Results

# words % found Naive (seconds) Regex (seconds) Trie (seconds)
10 10% 0.000076 0.000054 0.000190
50% 0.000050 0.000068 0.000159
100% 0.000049 0.000065 0.000211
——— ——— —————– —————– —————-
100 10% 0.000338 0.000680 0.000987
50% 0.000397 0.000287 0.000878
100% 0.000389 0.000268 0.000635
——— ——— —————– —————– —————-
1000 10% 0.020482 0.003205 0.008794
50% 0.020797 0.001782 0.007492
100% 0.022443 0.001600 0.006307
——— ——— —————– —————– —————-
10,000 10% 1.992886 0.017261 0.085929
50% 2.053409 0.015645 0.082155
100% 2.163350 0.017048 0.069294
——— ——— —————– —————– —————-
100,000 10% 213.5203 0.193326 0.910784
50% 226.2446 0.182818 0.843763
100% 254.2690 0.198825 0.755635

Huh? Isn’t the Trie supposed to be faster?

Yes, a similar implementation of Naive, Regex and Trie in a language like C/C++ will always have the Trie outperforming the other two implementations. However, in Python, the overheads of implementing the Trie using a dictionary and a tuple make it rather slow compared to the Regex implementation (which has been written and optimised in C).

If I do get around to it, I will re-implement the above code in C++.

Note: Generating a fair input file

I have generated test files to test the performance of our algorithms against the following dimensions,

  1. Size of (find, replace) pairs.
  2. Size of the input file.
  3. Percentage occurrence - This is the percentage of times a word from the “find” set occurs in the input file. We will be testing for < 10%, 50%, > 90% occurrence.

Resources

Download the source code for this article (including the fair input file generator)

by Pravin Paratey

Set data structures & Implementation

Sets are an important mathematical concept. A set can be defined as a collection of distinct elements. In mathematical literature, curly braces are used to denote a set. So, a set S containing the elements 1, 2 and 3 is denoted by S = {1, 2, 3}.

Set illustration

A set supports the following basic operations,

  1. Union - The union of two sets is a set that contains elements from both sets. For instance, given,
    • S1 = {1, 2, 3} and
    • S2 = {3, 4, 5}
    • S1 ∪ 2 = {1, 2, 3, 4, 5}
  2. Intersection - The intersection of two sets is a set that contains elements common to both sets. For instance, given,
    • S1 = {1, 2, 3} and
    • S2 = {3, 4, 5}
    • S1 ∩ 2 = {3}
  3. Difference - The difference of two sets is a set that contains elements from the first set with the elements common to both sets removed. Given,
    • S1 = {1, 2, 3} and
    • S2 = {3, 4, 5}
    • S1 − 2 = {1, 2}

We also have the following additional operations,

  1. IsMember - Tells you if an element is a member of the set. It is denoted by the ∈ symbol. So,
    • 1 ∈ {1, 2, 3} but
    • 4 ∉ {1, 2, 3}

Implementing sets

Let’s look at the different ways we can implement the Set data structure and analyse each implementation.

Bit-vectors

Bit vectors are an array of bits (x1, x2, … xn) such that xi = 1 if i is a member of the set. For example,

S1 = {1, 2, 3} will be represented as 11100 while
S2 = {3, 4, 5} will be represented as 00111.

Advantages

  1. IsMember(), Add() and Delete() are O(1) operations.
  2. Union(), Intersection() and Difference() are at most O(N) where N is the size of the universe.
  3. For non-sparse sets, bit-vectors can represent the set quite tightly.
  4. Bit operations are typically faster.

Disadvantages

  1. Takes too much memory for large universes.
  2. Needs a mapping function for non-integer members.

Linked-list

A linked list implementation keeps a sorted list of elements for every set. This means,

S1 = {1, 2, 3} will be stored as 1 -> 2 -> 3 while
S2 = {3, 4, 5} will be represented as 3 -> 4 -> 5.

Properties

  1. IsMember() is at most O(n) operation where n is the size of the set.
  2. Union(), Intersection(), Difference(), Add() and Delete() are take at most O(mn) where m is the size of the first set and n is the size of the second set.

Advantages

  • For very large universes, this outperforms a Bit-vector implementation.

Disadvantages

  • Typically the slowest implementation compared to the rest.

Hashtables

A set can also be implemented as a hashmap. The time complexity of the various functions is dependent on the particular implementation of a hashtable.

Properties

  1. IsMember() is typically a O(1) operation.
  2. Add() and Difference() are typically O(n) where n is the size of the second set.
  3. Union() and Intersection() are typically O(mn) where m is the size of the first set and n is the size of the second one.

Ordered trees

Sets can also be implemented as ordered trees. Priority Queues are an example of an ordered tree. A Binary search tree implementation of a set would give you the following properties,

Properties

  1. IsMember() is an O(log n) operation.
  2. Add(), Difference(), Union() and Intersection() are O(log n).

Applications

  1. Information retrieval - Inverted index.
  2. AND & OR queries.
by Pravin Paratey

Manage your wireless with nmcli

nmcli is a command line tool for controlling NetworkManager and reporting on its status. It is meant to be used as an alternative to nm-applet for headless servers and by power users.

Listing

To list all wireless networks in range,

$ nmcli -p dev wifi list

===================================================================
                         WiFi scan list
===================================================================
SSID          BSSID               MODE             FREQ      
-------------------------------------------------------------------
'Wireless-1'  B8:E6:25:37:13:XX   Infrastructure   2462 MHz  
'Wireless-2'  02:03:D8:08:A3:XX   Infrastructure   2412 MHz  

Note: Some of the columns have been stripped from the above table.

To list all the known wireless networks (networks that you have previously connected to), run

$ nmcli con

NAME            UUID                                   TYPE              TIME
Wireless-1      28d6c265-xxxx-4e83-907f-ecb5ab3ac37c   802-11-wireless   Thu 
Wired-Network   30d29da3-xxxx-4ea2-94ff-0edac8954ff7   802-3-ethernet    Sun 
Wireless-2      89f31b44-xxxx-4b7d-abb1-8242a1fa7040   802-11-wireless   Thu 
Wireless-3      6adcb4e8-xxxx-4e88-bf50-872d9e5eb1f3   802-11-wireless   Fri 
Wireless-4      8c4fc701-xxxx-472e-aecc-40131c0d8d31   802-11-wireless   Fri 

Connecting to your network

Once you locate the network that you want to connect to, copy its BSSID and run the command,

$ nmcli -p dev wifi connect "00:03:D8:0A:11:XX"

For a known network, you can use the UUID to connect using,

$ nmcli con up uuid 28d6c265-xxxx-4e83-907f-ecb5ab3ac37c

Deleting a known network

To delete a known network,

$ nmcli con delete uuid "30d29da3-xxxx-4ea2-94ff-0edac8954ff7"

by Pravin Paratey

Binary heaps & Priority Queues

Today, I’m going to talk about Priority queues. I’ll begin by introducing the binary heap data structure and then move on to how they can be used as a priority queue. Then I will describe one application where you might use them. Or just for fun, I might do everything I described in the reverse order.

If you’ve written a web crawler, your architecture probably looks similar to the figure below.

Typical crawler architecture

  1. You begin with a list of Seed URLs(1) which are passed to the URL Queue(2).
  2. A group of Crawlers(3) listening on the queue then get a subset of the urls to crawl. Depending on your requirements, you may return subsets such that all urls in a subset are from the same domain, from distinct domains or are quite random.
  3. Each crawler will then fetch the page and pass it to the Extractor(5) which would extract the required data and outlinks. The data can be batched to the database(6) and the extracted outlinks pushed into the queue.
  4. The process continues until the Url Queue(2) is empty.

And the binary heap?

Sometimes, you might want the URL queue to function like a priority queue and not a FIFO queue. For instance, suppose you wanted to crawl pages by the number of times that url was linked to (until now), and not in any random order. You could do this by implementing a priority queue! To understand a priority queue, we’ll have to understand a binary heap first.

Binary heaps

Binary heaps come in two forms: Max-heaps and Min-heaps. A max-heap is a binary tree with the property, “All child nodes of a node are smaller than or equal to the parent node”. A min-heap has the opposite property.

We’re going to implement a Max-heap today. Min-heaps are similar. All you need to do is change the inequality signs in the code. The figure below shows what a Max-heap looks like,

Max heap

I’m going to use Python in addition to pseudo-code to illustrate everything. Lets create a simple class and we’ll add to it as we move along.

#!/usr/bin/env python
#
# Pravin Paratey (February 22, 2011)
# Code released under BSD license

class MaxHeap:
    """ This class illustrates a Max-Heap and its associated functions """

    def __init__(self):
        """
        We initialize the heap with some values
                     15
                   __/\__
                  /      \
                 13       9
                _/\_      /\
               /    \    /  \
              5     12  8    7
             / \    /\  |
            4   0  6  2 1
        """
        self.heap = [15, 13, 9, 5, 12, 8, 7, 4, 0, 6, 2, 1]

if __name__ == '__main__':
    m = MaxHeap()

Representing a heap as an array

Since a heap is a lot like a balanced binary tree, it can very easily be stored as an array. This is quite powerful as accessing nodes simply means accessing elements in an array. This greatly simplifies the operations,

  • Left-Child(i): return 2 × i
  • Right-Child(i): return (2 × i + 1)
  • Get-Parent(i): return math.floor(i/2)

Max heap as array

...
    def parent(self, index):
        """
        Parent will be at math.floor(index/2). Since integer division
        simulates the floor function, we don't explicitly use it
        """
        return index / 2

    def left_child(self, index):
        """ 1 is added because array begins at index 0 """
        return 2 * index + 1

    def right_child(self, index):
        return 2 * index + 2

if __name__ == '__main__':
    m = MaxHeap()

Heapify

The first function we’ll look at is Heapify(). This function is responsible for maintaining the heap property of a heap. The Heapify() function assumes that the node index passed to it violates the max-heap property, i.e. it may be smaller than its children. It also assumes that the other nodes satisfy the max-heap property.

Heapify() works by swapping the current node with the largest of its children, then it does a recursive call on the swapped node. The easiest way of explaining this would be to look at the set of diagrams below,

Illustrating Heapify 1

In the figure above, 3 violates the max-heap property. Heapify works under the assumption that the rest of nodes satisfy the max-heap property.

Illustrating Heapify 2

Node 3 is compared with its children - 25 and 20 and the larger node 25 is swapped with node 3.

Illustrating Heapify 3

This process is repeated and node 9 is swapped with node 3 - at which point the tree satisfies the max-heap. This can be written in Python as,

...
    def max_heapify(self, index):
        """
        Responsible for maintaining the heap property of the heap.
        This function assumes that the subtree located at left and right
        child satisfies the max-heap property. But the tree at index
        (current node) does not. O(log n)
        """
        left_index = self.left_child(index)
        right_index = self.right_child(index)

        largest = index
        if left_index < len(self.heap) and self.heap[left_index] > self.heap[index]:
            largest = left_index
        if right_index < len(self.heap) and self.heap[right_index] > self.heap[largest]:
            largest = right_index

        if largest != index:
            self.heap[index], self.heap[largest] = self.heap[largest], self.heap[index]
            self.max_heapify(largest)

This function has a time-complexity of O(log(n))

BuildHeap

This function builds a max-heap from an unordered array. The function begins with the lowest non-leaf nodes and calls Heapify() on them.

...
    def build_max_heap(self):
        """
        Responsible for building the heap bottom up. It starts with the lowest non-leaf nodes
        and calls heapify on them. This function is useful for initialising a heap with an
        unordered array. O(n)
        """
        for i in xrange(len(self.heap)/2, -1, -1):
            self.max_heapify(i)

This is useful if you want to start with an unordered list or numbers (or urls) and want to construct a heap. The time-complexity of this function is O(n).

Heapsort

Lets digress a bit and look at the heap-sort algorithm which has a time complexity of O(nlog(n)). Heapsort can be implemented as follows,

...
    def heap_sort(self):
        """ The heap-sort algorithm with a time complexity O(n log n) """
        self.build_max_heap()
        output = []
        for i in xrange(len(self.heap)-1, 0, -1):
            self.heap[0], self.heap[i] = self.heap[i], self.heap[0]
            output.append(self.heap.pop())
            self.max_heapify(0)
        output.append(self.heap.pop())
        self.heap = output

Priority Queue

Adding an element

Adding an element is quite easy. All you have to do is appending the new element at the end of the array and then work our way up, swapping the current node with its parent (if smaller), until you reach the root or the parent is greater than the current node.

...
    def propagate_up(self, index):
        """ Compares index with parent and swaps node if larger O(log(n)) """
        while index != 0 and self.heap[self.parent(index)] < self.heap[index]:
            self.heap[index], self.heap[self.parent(index)] = 
                self.heap[self.parent(index)], self.heap[index]
            index = self.parent(index)

    def add(self, key):
        """ Adds an element in the heap O(ln(n)) """
        self.heap.append(key)
        self.propagate_up(len(self.heap) - 1) # Index value is 1 less than length

The figures below illustrate the propagate_up() function:

Illustrating Propagate up 1

In the figure above, 40 is a new node that has been added to the heap. Clearly, it violates the heap property.

Illustrating Propagate up 2

40 is compared with its parent 20 and since 40 is greater, the nodes are swapped.

Illustrating Propagate up 3

This process continues until 40 cannot be swapped with its parent (50). At this point you see that the heap property is satisfied for all its nodes.

Removing an element

Removing an element is actually the ExtractMax() function which removes the element with the maximum value (i.e. the root node). This is our priority queue!

...
    def extract_max(self):
        """
        Part of the Priority Queue, extracts the element on the top of the heap
        and then re-heapifies. O(log n)
        """
        max = self.heap[0]
        data = self.heap.pop()
        if len(self.heap) > 0:
            self.heap[0] = data
            self.max_heapify(0)
        return max

Updating an element

Updating an element is done by first locating the element and incrementing it – and like add() – and working our way up to root.

...
    def increment(self, key, value):
        """ Increments key by the input value. O(log n) """
        for i in xrange(len(self.heap)):
            if self.heap[i] == key:
                self.heap[i] += value
                self.propagate_up(i)
                break

In closing, I would like to say that while the above code was quite illustrative, in real life, you will be working with a more involved data structure than an integer at each node. It is quite easy to extend this to work with (occurrence, url) pairs. To see some example code of how this can be done, use the links below.

Source Code

by Pravin Paratey

Natural language Processing Resources

nlp

This page contains a list of resources for people interested in NLP. If you would like to add to this page, please email me.

Papers

Sentiment Analysis

  1. Annotating Expressions of Opinions and Emotions in Language _by Janyce Wiebe, Theresa Wilson & Claire Cardie._
    This paper describes a corpus annotation project to study issues in the manual annotation of opinions, emotions, sentiments, speculations, evaluations and other private states in language. The resulting corpus annotation scheme is described, as well as examples of its use.

  2. The Importance of Neutral Examples for Learning Sentiment by Moshe Koppel and Jonathan Schler.
    Most research on learning to identify sentiment ignores “neutral” examples, learning only from examples of significant (positive or negative) polarity. This paper shows that neutral examples can be used to increase the accuracy even further.

Software

Multi-functional

  1. Natural Language Toolkit - NLTK provides a various python modules to perform a range of text processing ranging from PoS-tagging and chunking to classifying and clustering.

Part-of-speech Taggers

  1. Yoshimasa Tsuruoka POS Tagger - Widely considered as the fastest, it boasts of a tagging speed of 2400 tokens/second and an accuracy of 97.10% (on the WSJ corpus).

  2. Senna - Senna is a C library which lets you do POS-tagging, Chunking, Named entity recognition and Semantic role labeling.

  3. Stanford PoS Tagger - The most famous PoS tagger out there.

  4. Malt Tagger

Corpus

  1. Wordnets - A Wordnet is a graph of words which let you extract relations like meronymy, hyponym, antonyms, etc between words.

  2. Geo-tagged Microblog corpus - This page provides a link to a dataset containing a sample of geo-tagged microblog data, for use in academic research. The dataset is described in this paper.

  3. Climate dataset - This page contains datasets related to climate. It contains emission numbers for various categories ranging from Aircraft emissions to the breed of a dog.

  4. Treebanks - A Treebank is an annotated corpus. They are used to test or train custom parsers.

by Pravin Paratey

Recursive functions, Stack overflows and Python

Today I was working on a problem that involved a Depth-first search on a graph. The graph had a cycle and I hit the maximum recursion depth for Python (which is 1000).

RuntimeError: maximum recursion depth exceeded

Although I fixed the problem (the graph wasn’t supposed to have cycles), it got me thinking on how Python handles recursion.

Recursion

In short, a recursive function is one which calls itself. A set of functions are recursive if they form a cycle:

Graphical representation of recursion

Head-recursion and Tail-recursion

A recursive function can either be head-recursive or tail-recursive. A function is tail-recursive if the very last thing it does is make a recursive call. Everything else is head-recursive.

A tail-recursive function is of particular interest as it can be transformed by a smart compiler into an iterative loop i.e. a tail-recursive function can be transformed to use constant stack space. This means you don’t have a limit to number of recursive calls you can make.

Unfortunately Python does not automatically optimize tail-recursive calls. Reading what Guido van Rossum had to say on this subject got me thinking even further. Does Python really give me enough power to rewrite recursive functions in a more “Pythonic” way?

Example

Lets look at an example. The following code is a recursive function that prints the factorial of a number:

1
2
3
4
5
6
7
8
9
#!/usr/bin/env python

def factorial(num):
    if num == 0 or num == 1:
        return 1
    else:
        return num * factorial(num - 1)

print factorial(1000)

While factorial(10) prints 3628800, factorial(1000) results in,

  File "factorial.py", line 7, in factorial1
    return num * factorial1(num - 1)
  File "factorial.py", line 7, in factorial1
    return num * factorial1(num - 1)
  File "factorial.py", line 7, in factorial1
    return num * factorial1(num - 1)
  File "factorial.py", line 7, in factorial1
    return num * factorial1(num - 1)
  File "factorial.py", line 7, in factorial1
    return num * factorial1(num - 1)
  File "factorial.py", line 7, in factorial1
    return num * factorial1(num - 1)
  File "factorial.py", line 7, in factorial1
    return num * factorial1(num - 1)
  File "factorial.py", line 7, in factorial1
    return num * factorial1(num - 1)
  File "factorial.py", line 7, in factorial1
    return num * factorial1(num - 1)
  File "factorial.py", line 7, in factorial1
    return num * factorial1(num - 1)
RuntimeError: maximum recursion depth exceeded

factorial() is head-recursive because the result of factorial(num - 1) is multiplied by num before returning. This means that the state of this function has to be saved to the stack before its child is called.

Using decorators

You can use decorators to eliminate tail recursion.

Using reduce

I re-wrote the factorial function using reduce and lambda functions.

1
2
3
4
5
6
#!/usr/bin/env python

def factorial(num):
    return reduce(lambda x, y: x*y, xrange(1,num), 1)

print factorial(1000)

Although not all functions can be re-written this way, Python does have a few tricks up its sleeve. (I know it ends quite abruptly. I kind of lost my train of thought)

by Pravin Paratey

Lesson 5 - Adding standard GUI elements

In this lesson, we’ll learn how to add standard GUI elements like the statusbar and the toolbar. We’ll learn about the Common Controls library - which let us add GUI elements like the RichEdit box, the Treeview control, the Listbox/Listview controls and more! Click here to view the full list.

Introducing Common Controls

Windows includes a library called the Common Controls library which lets you add many Windows GUI elements. These elements aren’t part of the core Windows API but are available through a dll - Comctl32.dll.

Before using any UI element from Comctl32.dll, we must load the dll. This can be conveniently done by calling the InitCommonControlsEx() function in MainWindow.cpp:

10
11
12
13
14
15
#include <windows.h>
#include <windowsx.h>
#define _WIN32_IE 0x0500 // To support INITCOMMONCONTROLSEX
#include <commctrl.h>
#include "MainWindow.h"
#include "AboutDialog.h"

First, we include the Commctrl.h, and then add the following code to MainWindow.cpp

 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
bool MainWindow::Run(int nCmdShow)
{
    if(!RegisterClassEx(&m_wndClass))
        return false;

    // Initialize Common controls
    INITCOMMONCONTROLSEX icx;
    // Ensure common control DLL is loaded
    icx.dwSize = sizeof(INITCOMMONCONTROLSEX);
    icx.dwICC = ICC_BAR_CLASSES; // Specify BAR classes
    InitCommonControlsEx(&icx); // Load the common control DLL

    m_hwnd = CreateWindowEx(

Before you hit build, you must add comctl32.lib to your linker path. This is needed to resolve function references as Comctl32.dll is an external library. To do this, go to Project > Build options, select the Linker settings tab, click the Add button and enter libcomctl32.a as illustrated in the figure below.

Adding comctl32.lib

Adding a statusbar

Adding a statusbar is extremely easy. Commctrl.h contains the function CreateStatusWindow() to create a statusbar. So where do we call this function? In lesson 3, we learnt about the Windows Message Queue and how Windows uses it to notify our application of events.

WM_CREATE

We are going to look at two messages that Windows passes to our application. The first is the WM_CREATE. This message is posted to the Message Queue of a window after it has been successfully created. This is where you would want to add any UI elements that you want to create.

27
28
29
// MainWindow.h
    HWND m_hwnd;
    static HWND m_hStatusbar;
19
20
21
// MainWindow.cpp
HINSTANCE MainWindow::m_hInstance = NULL;
HWND MainWindow::m_hStatusbar;

Lets create a separate function to draw the statusbar,

125
126
127
128
129
130
131
132
133
134
135
// Creates the toolbars and statusbar
// Parameters:
//  cs - Contains initialization parameters
// Returns:
//  void
bool MainWindow::OnCreate(HWND hwnd, LPCREATESTRUCT lpcs)
{
    // Create Statusbar
    MainWindow::m_hStatusbar = CreateStatusWindow(WS_CHILD|WS_VISIBLE, "Ready", hwnd, IDC_STATUSBAR);
    return true;
}

Define IDC_STATUSBAR as 102 in resource.h

WM_SIZE

Try resizing the window now. Did you notice that the statusbar does not resize but retains its original size and position?

Breaks on resize

This can be fixed by handing the WM_SIZE message. WM_SIZE is passed to your application whenever the user resizes a window, minimizes or maximizes it.

54
55
56
57
58
59
60
    switch (msg)
    {
    case WM_SIZE:
        // Resize the statusbar;
        SendMessage(MainWindow::m_hStatusbar,msg,wParam,lParam);
        break;
    case WM_DESTROY:

The statusbar is also a Window which can send and receive messages. We could’ve overridden its message handling proc had we wanted to add custom functionality. For now, we just pass the WM_SIZE message and its parameters to the Statusbar. The statusbar is smart enough to know how to resize itself within its parent container.

50
51
52
53
54
55
56
57
LRESULT CALLBACK MainWindow::MainWndProc (HWND hwnd, UINT msg, WPARAM wParam, LPARAM lParam)
{
    switch (msg)
    {
    case WM_SIZE:
        // Resize the statusbar
        SendMessage(MainWindow::m_hStatusbar,msg,wParam,lParam);
        break;

Adding toolbars

We are going to add two toolbars to our application. A standard toolbar containing buttons like [New, Open, Save] and a paint toolbar which will contain buttons like [Draw Line, Draw Square, etc]. The first thing to do is add the HWND variables (remember a toolbar is just a specialised Window!) to MainWindow.cpp and MainWindow.h,

19
20
21
22
HINSTANCE MainWindow::m_hInstance = NULL;
HWND MainWindow::m_hStatusbar = NULL;
HWND MainWindow::m_hMainToolbar = NULL;
HWND MainWindow::m_hPaintToolbar = NULL;
28
29
30
31
    static HWND m_hStatusbar;
    static HWND m_hMainToolbar;
    static HWND m_hPaintToolbar;
    static char m_szClassName[];

Lets head to our OnCreate method and add code to create a toolbar. This involves,

  1. Creating a toolbar window and specifying it’s style.
  2. Creating an Imagelist to hold the images visible on the toolbar buttons.
  3. Creating a TBBUTTON structure which specify the buttons and their properties.
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
bool MainWindow::OnCreate(HWND hwnd, LPCREATESTRUCT lpcs)
{
    const int numbuttons1 = 4;
    // Create Main Toolbar
    MainWindow::m_hMainToolbar = CreateWindowEx(
                                    0, TOOLBARCLASSNAME, NULL,
                                    WS_CHILD | TBSTYLE_FLAT,
                                    0, 0, 0, 0,
                                    hwnd, NULL, MainWindow::m_hInstance, NULL);

    HIMAGELIST hImageList1 = ImageList_Create(
            16, 16,                 // 16x16 button size
            ILC_COLOR16 | ILC_MASK, // ILC_MASK ensures transparent background
            numbuttons1, 0); 
    // Set the image list.
    SendMessage(MainWindow::m_hMainToolbar, TB_SETIMAGELIST, (WPARAM)0,
        (LPARAM)hImageList1);

    // Load the button images.
    SendMessage(MainWindow::m_hMainToolbar, TB_LOADIMAGES, 
        (WPARAM)IDB_STD_SMALL_COLOR, (LPARAM)HINST_COMMCTRL);

    TBBUTTON tbButtons1[numbuttons1] = {
        {MAKELONG(STD_FILENEW, 0), IDM_FILE_NEW, TBSTATE_ENABLED,
            BTNS_AUTOSIZE, {0}, 0, 0},
        {MAKELONG(STD_FILEOPEN, 0), IDM_FILE_OPEN, TBSTATE_ENABLED,
            BTNS_AUTOSIZE, {0}, 0, 0},
        {MAKELONG(STD_FILESAVE, 0), IDM_FILE_SAVE, TBSTATE_ENABLED,
            BTNS_AUTOSIZE, {0}, 0, 0},
        {MAKELONG(STD_FILESAVE, 0), 0, TBSTATE_ENABLED,
            TBSTYLE_SEP, {0}, 0, 0}
    };

    // Add buttons
    SendMessage(MainWindow::m_hMainToolbar, TB_BUTTONSTRUCTSIZE, (WPARAM)sizeof(TBBUTTON), 0);
    SendMessage(MainWindow::m_hMainToolbar, TB_ADDBUTTONS, (WPARAM)numbuttons1, (LPARAM)&tbButtons1);

    // Show toolbar
    SendMessage(MainWindow::m_hMainToolbar, TB_AUTOSIZE, 0, 0);
    ShowWindow(MainWindow::m_hMainToolbar, TRUE);

Toolbars

Next, we’ll take a look at internationalization and how you can write applications that can be easily ported to other languages.