Hello, I am an Engineering Manager at Facebook with 13+ years in Ad Technology, Natural Language Processing and Data mining. (Learn More)
by Pravin Paratey

Summarisation Demo

Enter some text or click the links below to populate with sample text.

↑ Genghis Khan ↑ Platypus ↑ United Kingdom ↑ Albert Einstein ↑ Vincent Van Gogh ↑ The Gray Wolf ↑ Doctor Who

Result

How it works

Let me illustrate the summarisation process for the paragraph,

The IPhone is a smartphone developed by Apple Inc. An iPhone can function as a video camera, a camera phone, a portable media player, and an Internet client with email and web browsing capabilities, can send texts and receive visual voicemail, and has both Wi-Fi and 3G connectivity. The user interface is built around the device’s multi-touch screen, including a virtual keyboard rather than a physical one.

1. Understanding the sentence

Disambiguation

In the first stage, the input text is pre-processed to assign a meaning to each word. Words with multiple meanings - like the word “Apple” - which could stand for Apple, the fruit or Apple Inc., the company are disambiguated depending on the context of the sentence it appears in and the context of the input text.

In the adjacent figure, the word IPhone has been used to disambiguate the Apple.

2. Identify the important concepts

Once words have been assigned a meaning, important concepts are identified. A concept is a collection of words which define an idea. For example, the concept IPhone will contain words like (iPhone, Jobs, mobile, communication, A4, cell, 3G, …)

The IPhone is a smartphone developed by Apple Inc.

3. Find the sentences with the maximum coverage and minimum overlap

S1 S2 S3 S4 S5

Next, the sentences are by concept coverage. Then sets are formed such that the members of the set fully cover the set of concepts extracted in step 2. For the example in the adjoining figure, the sets {(S3, a)}, {(S5, b), (S1, c)}, {(S1, c), (S4, d), (S2, e)} will be created. The letters [a, e] denote a confidence factor which is used to determine which set to output.