Hello, I am an Engineering Manager at Facebook with 13+ years in Ad Technology, Natural Language Processing and Data mining. (Learn More)
by Pravin Paratey

Super quick Find & Replace

Today, we are going to explore how to perform a fast search/replace over a file when the number of strings to search and replace is large. This means, you will be given a set of source and target words like in the table below,

Source(Find)Target (Replace)
motivesoranges
whalewatermelon
ideabulk
himselfitself

And an input file like,

Chief among these motives was the overwhelming idea of the great whale himself. Such a portentous and mysterious monster roused all my curiosity.

The task is to find the best algorithm to replace all the source words in the input text with the target words. For the above input the output will be,

Chief among these oranges was the overwhelming bulk of the great watermelon itself. Such a portentous and mysterious monster roused all my curiosity.

As always, you can get the complete source code and the test files for this article from the sidebar.

Naive solution

The easiest way of implementing this would be to search the text for every word in the source list and replace it by the word in the target list. Example code for this would look like,

#!/usr/bin/env python

import sys, time, hashlib

class Naive:
    """ Naive implementation of find/replace for a massive word list. """
    def __init__(self, filename):
        self._words = {}
        fp = open(filename)
        for line in fp:
            word, replacement = line.strip().split()
            self._words[word] = replacement
        fp.close()

    def replace_all(self, filename):
        fp = open(filename)
        text = fp.read()
        fp.close()

        for word in self._words.keys():
            text = text.replace(word + ' ', self._words[word] + ' ')
        return text

if __name__ == '__main__':
    if len(sys.argv) < 2:
        print "Usage: python %s <find-replace-list> <file1> <file2>" % sys.argv[0]
        sys.exit(1)

    n = Naive(sys.argv[1])
    for filename in sys.argv[2:]:
        start = time.time()
        print filename, hashlib.md5(n.replace_all(filename)).hexdigest(), time.time() - start
        #print n.replace_all(filename)

Using Regular expressions

If the words to find are single alphanumeric words, you can use a regular expression (or even a string.split()) to find all words and replace them in one pass. This is much faster if the wordlist » input text.

#!/usr/bin/env python

import sys, re, time, hashlib

class Regex:
    """ Regex implementation of find/replace for a massive word list. """
    def __init__(self, filename):
        self._words = {}
        fp = open(filename)
        for line in fp:
            word, replacement = line.strip().split()
            self._words[word] = replacement
        fp.close()

    def replace_func(self, matchObj):
        key = matchObj.group(0)
        if self._words.has_key(key):
            return self._words[key]
        else:
            return key

    def replace_all(self, filename):
        fp = open(filename)
        text = fp.read()
        fp.close()
        return re.sub("[a-zA-Z]+", self.replace_func, text)
        

if __name__ == '__main__':
    if len(sys.argv) < 2:
        print "Usage: python %s <find-replace-list> <file1> <file2>" % sys.argv[0]
        sys.exit(1)

    r = Regex(sys.argv[1])
    for filename in sys.argv[2:]:
        start = time.time()
        print filename, hashlib.md5(r.replace_all(filename)).hexdigest(), time.time() - start
        #print r.replace_all(filename)

Using a Trie

A trie is a data structure specifically tuned to this task. A trie is an ordered tree that looks like so,

Example Trie

In the above figure, the words “apple”, “ape” and “zoo” have been shown.

#!/usr/bin/env python

import sys, time, hashlib

class Trie:
    """ 
    Trie class

    We use a tuple (string, dict) to represent a node. That is, the words 

    1. ant -> spider (replace every occurrence of ant with spider)
    2. any -> every

    will be represented as,

    (None, {
        a: (None, {
            n: (None, {
                t: ('spider', {}),
                y: ('every', {})
            }),
        })
    })
    """
    def __init__(self, dict_file):
        self._trie = (None, {})
        fp = open(dict_file)
        for line in fp:
            words = line.strip().split()
            self.add(words[0], words[1])
        fp.close()
    
    def add(self, text, replacement, node=None):
        """ Adds word/replacement pair """
        if node == None: node = self._trie

        new_replacement, children, c = None, node[1], text[0]

        # New id is the id for leaf nodes, None otherwise
        if len(text) == 1: new_replacement = replacement

        if c not in children:
            children[c] = (new_replacement, {})
        elif len(text) == 1: # Leaf node
            (old_replacement, old_node) = children[c]
            children[c] = (new_replacement, old_node)
        if len(text) > 1:
            self.add(text[1:], replacement, node[1][c])

    
    def replace_all(self, text):
        output = []
        max_len, start = len(text), 0

        while start < max_len:
            (replacement, cur_node) = self._trie
            end = start
            while end < max_len and cur_node.has_key(text[end]):
                (replacement, cur_node) = cur_node[text[end]]
                end += 1
            
            # If we are not at the beginning of the string, append prev char (whitespace)
            if start != 0: output.append(text[start-1])

            # Replacement found AND end not exceeded max-len AND the word has ended
            if replacement != None and end <= max_len and (end == max_len or not text[end].isalpha()): 
                output.append(replacement)
            else:
                output.append(text[start:end])
            
            start = end + 1
        if end != max_len: output.append(text[end])
        return ''.join(output)

if __name__ == '__main__':
    if len(sys.argv) < 2:
        print "Usage: python %s <find-replace-list> <file1> <file2>" % sys.argv[0]
        sys.exit(1)

    t = Trie(sys.argv[1])
    for filename in sys.argv[2:]:
        start = time.time()
        fp = open(filename)
        data = fp.read()
        fp.close()
        print filename, hashlib.md5(t.replace_all(data)).hexdigest(), time.time() - start
        #print t.replace_all(data)

Results

# words% foundNaive (seconds)Regex (seconds)Trie (seconds)
1010%0.0000760.0000540.000190
 50%0.0000500.0000680.000159
 100%0.0000490.0000650.000211
10010%0.0003380.0006800.000987
 50%0.0003970.0002870.000878
 100%0.0003890.0002680.000635
100010%0.0204820.0032050.008794
 50%0.0207970.0017820.007492
 100%0.0224430.0016000.006307
10,00010%1.9928860.0172610.085929
 50%2.0534090.0156450.082155
 100%2.1633500.0170480.069294
100,00010%213.52030.1933260.910784
 50%226.24460.1828180.843763
 100%254.26900.1988250.755635

Huh? Isn’t the Trie supposed to be faster?

Yes, a similar implementation of Naive, Regex and Trie in a language like C/C++ will always have the Trie outperforming the other two implementations. However, in Python, the overheads of implementing the Trie using a dictionary and a tuple make it rather slow compared to the Regex implementation (which has been written and optimised in C).

If I do get around to it, I will re-implement the above code in C++.

Note: Generating a fair input file

I have generated test files to test the performance of our algorithms against the following dimensions,

  1. Size of (find, replace) pairs.
  2. Size of the input file.
  3. Percentage occurrence - This is the percentage of times a word from the “find” set occurs in the input file. We will be testing for < 10%, 50%, > 90% occurrence.

Resources

Download the source code for this article (including the fair input file generator)