Hello, I am an Engineering Manager at Facebook with 13+ years in Ad Technology, Natural Language Processing and Data mining. (Learn More)
by Pravin Paratey

Script to generate URS from Wikipedia

A persons’ URS is a phrase that could be used instead of his/her usual name in all circumstances, which makes it absolutely clear who he/she is. A good URS for a person should meet the following criteria:

  • Everyone familiar with the person will confidently recognise him/her from the URS.
  • There is no possibility that the URS could also describe anyone other than the person.
  • Even someone who isn’t familiar with the person will have some understanding of who he/she is from the URS.
#!/usr/bin/python
""" 
Script to generate URS from the starting paragraph of Wikipedia 
articles about persons.

by Pravin Paratey (pravinp -at- gmail.com)

Current Implementation:
----------------------
1. Extract first sentence
2. Clean wiki markup
3. Observing given data, and the data on wikipedia, shows that there 
   is a pattern that is followed while writing wikipedia entries for
   persons. Replacing (was/is)(an/a/the/) with (/the) does the trick
4. Output sentence formed

Ideally:
--------
Ideally, the piece of code should identify the following concepts:
1. Name of person
2. Time period
3. Son/Daughter/Father/Mother of (in case of famous personality)
4. Renowned for

How do we go about it?
1 and 2 - straight forward. Wikipedia gives cues through its markup
3 - straight forward. String matching using "son of", "daughter of", etc
4 - will need to match against a database.

For 3, we only keep the "son of", "daughter of", "X of Y" if Y is a prominent
person. An easy way of doing this is using incoming links on wikipedia OR
to search for X and Y individually on google and noting the number of results.
"""

import re, sys, codecs

def cleanUri(m):
    """ Cleans Uri wiki markup """
    word = m.group(1)
    if '|' in word: word = word.split('|')[1]
    return word.strip()

def dotRemove(m):
    """ Replaces . by # inside tags """
    return m.group(0).replace('.', '#')

def cleanMarkup(text):
    """ Removes
    1. wiki markup
    2. sanitize html entities 
    3. comments """
    #text = re.sub(r"\[\[[\w\s\-,]+\|(\w+)\]\]", r"\1", text)
    text = re.sub(r"\[\[(.*?)\]\]", cleanUri, text)
    text = re.sub(r"\{\{.*?\}\}", r"", text)
    text = re.sub(r".*?<\/ref>", r"", text)
    text = re.sub(r"---.*?---", r"", text)
    text = re.sub(r"\[.*?\]", r"", text)
    text = text.replace("'''", "").replace("''", "'")
    text = text.replace("[[", "").replace("]]", "")
    text = text.replace("–", "-").replace("&", "&")
    return text

def getFirstSentence(text):
    """ Returns the text until first instance of '.'
    It also makes sure that the '.' isn't part of a wiki link
    or name"""
    tmp = re.sub(r"\[\[.*?\]\]", dotRemove, text)
    tmp = re.sub(r"\[.*?\]", dotRemove, tmp)
    tmp = re.sub(r".*?<\/ref>", dotRemove, tmp)
    tmp = re.sub(r"---.*?---", dotRemove, tmp)
    tmp = re.sub(r"'''.*?'''", dotRemove, tmp)
    tmp = re.sub(r"''.*?''", dotRemove, tmp)
    index = tmp.find('.')
    
    if index == -1: 
        return text
    else:
        return text[:index]

def makeArticle(m):
    """ Changes a, an to the when appropriate """
    retval = ', the'
    if len(m.group(2)) == 0:
        retval = ' '
    return retval
    
def extractURS(text):
    """ The function to call. Returns the URS """
    text = getFirstSentence(text)
    text = cleanMarkup(text)
    text = re.sub(r",?\s+(was|is)\s+(an|the|a|)", makeArticle, text)
    return text

if __name__ == '__main__':
    #fp = open(sys.argv[1])
    fp = codecs.open("input.txt", "r", "utf-8")
    fp2 = codecs.open("output.txt", "w", "utf-8")
    fp2.write(codecs.BOM_UTF8.decode("utf-8")), # Add BOM for UTF-8
    for line in fp:
        line = line.rstrip()
        if len(line) == 0 or line.startswith("#"): # For debugging
            continue
        urs = extractURS(line)
        fp2.write(urs + '\r\n')
    fp.close()
    fp2.close()


## Example Inputs and Outputs
These are inputs from Wikipedia (Click on the article and then Edit). Ex [Lala Lajpat Rai](http://en.wikipedia.org/w/index.php?title=Lala_Lajpat_Rai&action=edit). The above script outputs the URS.

**Example Input:** '''B. S. Johnson ''' (Bryan Stanley Johnson) ([[5 February]],[[1933]] - [[13 November]],[[1973]]) was an English experimental novelist, poet, literary critic and film-maker.
**Script Output:** B. S. Johnson  (Bryan Stanley Johnson) (5 February,1933 - 13 November,1973), the English experimental novelist, poet, literary critic and film-maker

## How are URS used?
URS can be directly substituted in a sentence containing that persons' name. (Hover over Bhagat Singh to see this URS.
ex. Bhagat Singh was executed by the British in 1931.
This way, a person who had no idea who Bhagat Singh was, now has more context about the person.