Hello, I am an Engineering Manager at Facebook with 13+ years in Ad Technology, Natural Language Processing and Data mining. (Learn More)
by Pravin Paratey

Natural language Processing Resources

This page contains a list of resources for people interested in NLP. If you would like to add to this page, please email me.

Papers

Sentiment Analysis

  1. Annotating Expressions of Opinions and Emotions in Language by Janyce Wiebe, Theresa Wilson & Claire Cardie.
    This paper describes a corpus annotation project to study issues in the manual annotation of opinions, emotions, sentiments, speculations, evaluations and other private states in language. The resulting corpus annotation scheme is described, as well as examples of its use.

  2. The Importance of Neutral Examples for Learning Sentiment by Moshe Koppel and Jonathan Schler.
    Most research on learning to identify sentiment ignores “neutral” examples, learning only from examples of significant (positive or negative) polarity. This paper shows that neutral examples can be used to increase the accuracy even further.

Software

Multi-functional

  1. Natural Language Toolkit - NLTK provides a various python modules to perform a range of text processing ranging from PoS-tagging and chunking to classifying and clustering.

Part-of-speech Taggers

  1. Yoshimasa Tsuruoka POS Tagger - Widely considered as the fastest, it boasts of a tagging speed of 2400 tokens/second and an accuracy of 97.10% (on the WSJ corpus).

  2. Senna - Senna is a C library which lets you do POS-tagging, Chunking, Named entity recognition and Semantic role labeling.

  3. Stanford PoS Tagger - The most famous PoS tagger out there.

  4. Malt Tagger

Corpus

  1. Wordnets - A Wordnet is a graph of words which let you extract relations like meronymy, hyponym, antonyms, etc between words.

  2. Geo-tagged Microblog corpus - This page provides a link to a dataset containing a sample of geo-tagged microblog data, for use in academic research. The dataset is described in this paper.

  3. Climate dataset - This page contains datasets related to climate. It contains emission numbers for various categories ranging from Aircraft emissions to the breed of a dog.

  4. Treebanks - A Treebank is an annotated corpus. They are used to test or train custom parsers.