Summrizer: Text summarizer (Implementation) & Context Extractor
Tuesday, October 27, 2015
Recently, I've been working on implementing a text summarization script in Python (previous blog post). I've built a naive implementation of a text summarizer and also a custom Text Context Analyzer which is basically a kind of self-customized Part Of Speech and Noun Phrase tagger which determines that what the content is about i.e. the important context of the text content.
For all the impatient folks, TL;DR here is the link to the code : https://github.com/vipul-sharma20/summrizer
Please read further for the complete explanation of the implementation.
NOTE: Works only for English language :)
Implementing Summarizing Script
This summary script works well for news articles and blog posts and that's the basic motive of implementing this script. It inputs the text content, splits it into paragraphs, splits it into sentences, filter out stopwords, calculates score (relevance) of each sentence, and on the basis of the scores assigned to each sentence it displays the most relevant results depending upon how concise we want our summary to be.
Splitting the content into paragraphs and then to sentence is easier than rest of the tasks so it can be skipped. Before implementing the scoring algorithm, I filtered out the stopwords. Stopwords are the most commonly used words in any language. For example: In English we have words like this, that, he, she, I...etc. These are among the most frequently used words in the English language which may not have significance in deciding the importance of a sentence. Therefore, it is required to remove these stopwords from the text content so that the scoring algorithm does not need to score a sentence based on some irrelevant words.
Scoring
The scoreSentence() function receives two sentences, finds the intersection between the two i.e. the words/tokens common in both the sentences and then the result is normalized by the average length of the two sentence.
avg = len(s1)+len(s2) / 2.0
score = len(s1.intersection(s2)) / avg
The most important and interesting part is: how to make use of this scoring algorithm? Here, I've created an all-pair-score-graph of sentences i.e. a completely connected and weighted graph which contains scores between all the pairs of sentences in a paragraph. The function, sentenceGraph() performs this task.
Suppose scoreGraph is the obtained weighted graph. So, scoreGraph[0][5] will contain the score between sentence no. 1 and sentence no. 6. And similarly, there will be separate intersection score for all the pairs. Therefore, if there are 6 sentences in a paragraph, we will have a 6x6 matrix as a score-graph.
The scoreGraph consist of paired scores. So, to calculate individual score of each sentence, we sum up all the intersection of a particular sentence with the other sentences in the paragraph and store the result in a dictionary with the sentence as the key and the calculated score as the value. The function, build() performs this task.
Summary
To build the summary from the final score dictionary, we can choose as per our need, depending upon the conciseness of the summary required.
Complete code of summarizing script : getSummary.py
Example
I've tested the scoring algorithm on a paragraph of an article from techcrunch:
"The BBC has been testing a new service called SoundIndex, which lists the top 1,000 artists based on discussions crawled from Bebo, Last.fm, Google Groups, iTunes, MySpace and YouTube. The top five bands according to SoundIndex right now are Coldplay, Rihanna, The Ting Tings, Duffy and Mariah Carey , but the index is refreshed every six hours. SoundIndex also lets users sort by popular tracks, search by artist, or create customized charts based on music preferences or filters by age range, sex or location. Results can also be limited to just one data source (such as Last.fm)."
Result
The BBC has been testing a new service called SoundIndex, which lists the top 1,000 artists based on discussions crawled from Bebo, Last.fm, Google Groups, iTunes, MySpace and YouTube : 0.338329361595
The top five bands according to SoundIndex right now are Coldplay, Rihanna, The Ting Tings, Duffy and Mariah Carey , but the index is refreshed every six hours. : 0.286057692308
SoundIndex also lets users sort by popular tracks, search by artist, or create customized charts based on music preferences or filters by age range, sex or location. : 0.285784751456
Results can also be limited to just one data source (such as Last.fm). : 0.237041838857
As per the context of the news, it is evident that the first two sentences are the most relevant part of the paragraph and hence have higher score than the rest of the sentences.
Speaking of Coldplay, I highly recommend : Coldplay - Fix You (Live 2012 from Paris) :D
Comparison
I've tried various text compacting and text summarizing websites and used the above paragraph to test their performance and here are the results:- http://autosummarizer.com/index.php summarized it to "Sound Index also lets users sort by popular tracks, search by artist, or create customized charts based on music preferences or filters by age range, sex or location."
- http://freesummarizer.com summarized it to "SoundIndex also lets users sort by popular tracks, search by artist, or create customized charts based on music preferences or filters by age range, sex or location."
- http://smmry.com did nothing but just converted the paragraph into sentences and displayed them :|
- http://textcompactor.com summarized it to the following when I set 50 % for summary limit: "The BBC has been testing a new service called SoundIndex, which lists the top 1,000 artists based on discussions crawled from Bebo, Last.fm, Google Groups, iTunes, MySpace and YouTube. The top five bands according to SoundIndex right now are Coldplay, Rihanna, The Ting Tings, Duffy and Mariah Carey , but the index is refreshed every six hours."
http://textcompactor.com produced the same result as my script when used for 50% compaction limit :D others were pretty disappointing.
Try copy-pasting the paragraph used in the example to verify the results.
Try copy-pasting the paragraph used in the example to verify the results.
Context Extractor
The summarizing script, as explained above, works on top of a scoring algorithm. One might need to extract the only the context or the main topics from a sentence so as to know what the text content is about. This provides a very abstract idea about the content which we might be dealing with.
The phrase structure of a sentence in English is of the form:
The above rule means that a sentence (S) consists of a Noun Phrase (NP) and a Verb Phrase (VP). We can further define grammar for a Noun Phrase but let's not get into that :)
A Verb Phrase defines the action performed on or by the object whereas a Noun Phrase function as verb subject or object in a sentence. Therefore, NP can be used to extract the important topics from the sentences.
I've used Brown Corpus in Natural Language Toolkit (NLTK) for Part Of Speech (POS) tagging of the sentences and defined custom Context Free Grammar (CFG) for extracting NP.
"The Brown Corpus was the first million-word electronic corpus of English, created in 1961 at Brown University. This corpus contains text from 500 sources, and the sources have been categorized by genre, such as news, editorial, and so on."
See more at: NLTK-Brown Corpus
See more at: NLTK-Using a TaggerA part-of-speech tagger, or POS-tagger, processes a sequence of words, and attaches a part of speech tag to each word
>>> text = word_tokenize("And now for something completely different") >>> nltk.pos_tag(text) [('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]
In my context extractor script, I've used unigram as well as bigram POS tagging. A unigram tagger is based on a simple statistical algorithm: For every token/word assign a tag that is more likely for that token/word which is decided as per the lookup match found in the trained data. The drawback of unigram tagging is, we can just tag a token with a "most likely" tag in isolation with the larger context of the text.
Therefore, for better results we use an n-gram tagger, whose context is current token along with the POS tags of preceding n-1 tokens. The problem with n-gram taggers is sparse data problem which is quite immanent in NLP.
"As n gets larger, the specificity of the contexts increases, as does the chance that the data we wish to tag contains contexts that were not present in the training data."
Even for n=2 i.e. in case of a bigram tagger we can face this sparse data problem. Therefore to avoid this, I've initially used a Bigram Tagger and if it fails to tag some tokens, it backs off to the Unigram Tagger for tagging and if even the Unigram Tagger fails to tag the tokens, it backs off to a RegEx Tagger which has some naive rules for tagging nouns, adjectives, cardinal numbers, determinants etc.
I've also defined a custom CFG (Context Free Grammar) to extract Noun Phrases from the POS tagged list of tokens.
(I can discuss how the custom CFG works if someone is interested :) ! )
Here is the code which performs this task : context.py
(I can discuss how the custom CFG works if someone is interested :) ! )
Here is the code which performs this task : context.py
Example
I've used the same content as used in the summarizer script as the test example for context extracting script:
"The BBC has been testing a new service called SoundIndex, which lists the top 1,000 artists based on discussions crawled from Bebo, Last.fm, Google Groups, iTunes, MySpace and YouTube. The top five bands according to SoundIndex right now are Coldplay, Rihanna, The Ting Tings, Duffy and Mariah Carey , but the index is refreshed every six hours. SoundIndex also lets users sort by popular tracks, search by artist, or create customized charts based on music preferences or filters by age range, sex or location. Results can also be limited to just one data source (such as Last.fm)."
Result
['BBC', 'new service', 'SoundIndex', 'Bebo', 'Last.fm', 'Google Groups', 'MySpace', 'YouTube', 'SoundIndex', 'Coldplay', 'Rihanna', 'Ting Tings', 'Duffy', 'Mariah Carey', 'SoundIndex', 'lets users sort', 'popular tracks', 'music preferences', 'age range', 'data source', 'Last.fm']
This is the list of topics discussed in the test paragraph which looks good :D. NLP can never yield 100% accurate results. All we can do is train using the data set therefore in this case, some undesired results may arise.
Please suggest me some improvements :) I would love to hear your views :D
2 comments
Genius sees the answer before the question ... You truly are (Y)
ReplyDeleteThis article is interesting. On the off chance that you are confronting to summarize website , the time has come to profit the help of expert service. Enhancing the originality of your papers can be a tedious undertaking particularly that you need to check each and every word from each section. Summarizing a whole paper can be a troublesome undertaking all the more so when you have insignificant abilities. You can appreciate various advantages of rewrite support from quality to uniqueness in giving you the vital offer assistance. There is no requirement for you to physically rewrite everything as this can just devour a large portion of your time.
ReplyDelete