Data Extraction Methods for Natural Language Processing

Ashwini Ashtekar
5 min readMay 31, 2021

--

Data Extraction methods

Natural Language Processing(NLP) is a technique used to interpret between a computer and a human language.

I am going to explain basic operations/methods we can perform on raw data to extract the information using Natural Language Toolkit (nltk) library.

NLTK is a python text processing library used in natural language processing.

Methods :

1. Sentence Segmentation :

This technique provides an ability to split the text into sentences, we’ll see an example of how to implement sentence segmentation using NLTK.

Example :

Text Source : http://imoviequotes.com/famous-and-romantic-romeo-and-juliet-quotes-of-1996-film.html

Import nltktext = "A glooming peace this morning with it brings; The sun for sorrow will not show his head. Go hence, to have more talk of these sad things. Some shall be pardoned, and some punished; For never was a story of more woe Then this of Juliet and her Romeo."sentences = nltk.sent_tokenize(text)for sentence in sentences:
print(sentence)
print()

Output :

A glooming peace this morning with it brings; The sun for sorrow will not show his head. Go hence, to have more talk of these sad things.

Some shall be pardoned, and some punished; For never was a story of more woe Then this of Juliet and her Romeo.

2. Word Tokenization :

This technique provides the ability to divide a sentence into words.

Let’s see how we can apply word tokenization on our text.

Example :

from nltk.tokenize import word_tokenize
print(word_tokenize(text))

Output :

['A', 'glooming', 'peace', 'this', 'morning', 'with', 'it', 'brings', ';', 'The', 'sun', 'for', 'sorrow', 'will', 'not', 'show', 'his', 'head.Go', 'hence', ',', 'to', 'have', 'more', 'talk', 'of', 'these', 'sad', 'things', '.', 'Some', 'shall', 'be', 'pardoned', ',', 'and', 'some', 'punished', ';', 'For', 'never', 'was', 'a', 'story', 'of', 'more', 'woe', 'Then', 'this', 'of', 'Juliet', 'and', 'her', 'Romeo', '.']

3. Remove Punctuations :

It will remove the punctuations from your text and using string.punctuation pre-initialized string you can load a set of punctuation.

Loads punctuations :

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

Let’s see how to implement :

tokenizer = nltk.RegexpTokenizer(r"\w+")
new_words = tokenizer.tokenize(text)
print(new_words)

Output :

['A', 'glooming', 'peace', 'this', 'morning', 'with', 'it', 'brings', 'The', 'sun', 'for', 'sorrow', 'will', 'not', 'show', 'his', 'head', 'Go', 'hence', 'to', 'have', 'more', 'talk', 'of', 'these', 'sad', 'things', 'Some', 'shall', 'be', 'pardoned', 'and', 'some', 'punished', 'For', 'never', 'was', 'a', 'story', 'of', 'more', 'woe', 'Then', 'this', 'of', 'Juliet', 'and', 'her', 'Romeo']

4. Remove Stopwords :

Stopwords are the words in English which does not add much meaning to the sentence like the, I, me, etc. Stopwords can be removed without losing the meaning of the sentence.

By using the following function, you can load the set of stopwords

import nltk
from nltk.corpus import stopwords
print(stopwords.words('english'))

Set of stopwords :

[‘i’, ‘me’, ‘my’, ‘myself’, ‘we’, ‘our’, ‘ours’, ‘ourselves’, ‘you’, “you’re”, “you’ve”, “you’ll”, “you’d”, ‘your’, ‘yours’, ‘yourself’, ‘yourselves’, ‘he’, ‘him’, ‘his’, ‘himself’, ‘she’, “she’s”, ‘her’, ‘hers’, ‘herself’, ‘it’, “it’s”, ‘its’, ‘itself’, ‘they’, ‘them’, ‘their’, ‘theirs’, ‘themselves’, ‘what’, ‘which’, ‘who’, ‘whom’, ‘this’, ‘that’, “that’ll”, ‘these’, ‘those’, ‘am’, ‘is’, ‘are’, ‘was’, ‘were’, ‘be’, ‘been’, ‘being’, ‘have’, ‘has’, ‘had’, ‘having’, ‘do’, ‘does’, ‘did’, ‘doing’, ‘a’, ‘an’, ‘the’, ‘and’, ‘but’, ‘if’, ‘or’, ‘because’, ‘as’, ‘until’, ‘while’, ‘of’, ‘at’, ‘by’, ‘for’, ‘with’, ‘about’, ‘against’, ‘between’, ‘into’, ‘through’, ‘during’, ‘before’, ‘after’, ‘above’, ‘below’, ‘to’, ‘from’, ‘up’, ‘down’, ‘in’, ‘out’, ‘on’, ‘off’, ‘over’, ‘under’, ‘again’, ‘further’, ‘then’, ‘once’, ‘here’, ‘there’, ‘when’, ‘where’, ‘why’, ‘how’, ‘all’, ‘any’, ‘both’, ‘each’, ‘few’, ‘more’, ‘most’, ‘other’, ‘some’, ‘such’, ‘no’, ‘nor’, ‘not’, ‘only’, ‘own’, ‘same’, ‘so’, ‘than’, ‘too’, ‘very’, ‘s’, ‘t’, ‘can’, ‘will’, ‘just’, ‘don’, “don’t”, ‘should’, “should’ve”, ‘now’, ‘d’, ‘ll’, ‘m’, ‘o’, ‘re’, ‘ve’, ‘y’, ‘ain’, ‘aren’, “aren’t”, ‘couldn’, “couldn’t”, ‘didn’, “didn’t”, ‘doesn’, “doesn’t”, ‘hadn’, “hadn’t”, ‘hasn’, “hasn’t”, ‘haven’, “haven’t”, ‘isn’, “isn’t”, ‘ma’, ‘mightn’, “mightn’t”, ‘mustn’, “mustn’t”, ‘needn’, “needn’t”, ‘shan’, “shan’t”, ‘shouldn’, “shouldn’t”, ‘wasn’, “wasn’t”, ‘weren’, “weren’t”, ‘won’, “won’t”, ‘wouldn’, “wouldn’t”]

Let’s see a simple example with the above text we used:

import nltk
from nltk.corpus import stopwords
stop_words = word_tokenize(text)
removed_sw= [word for word in stop_words if not word in stopwords.words('english')]
removed_sw

Output :

['A', 'glooming', 'peace', 'morning', 'brings', ';', 'The', 'sun', 'sorrow', 'show', 'head.Go', 'hence', ',', 'talk', 'sad', 'things', '.', 'Some', 'shall', 'pardoned', ',', 'punished', ';', 'For', 'never', 'story', 'woe', 'Then', 'Juliet', 'Romeo', '.']

5. Stemming :

It is the process of reducing a word to its root form even if its an invalid word in English/Natural Language.

Implementation :

from nltk.stem import PorterStemmerporter = PorterStemmer()
words = ["trouble", "troubling", "troubled"]
for word in words:
print(word, " : ", porter.stem(word))

Output :

trouble : troubl 
troubling : troubl
troubled : troubl

6. Lemmatization :

It is also the same as stemming, but while reducing the word its ensures properly that the root word(lemma) is valid in natural language.

Implementation :

from nltk.stem import WordNetLemmatizerlemma = WordNetLemmatizer()
print("photos :", lemma.lemmatize("photos"))
# v denotes verb in parts of speech
print("running :", lemma.lemmatize("running", pos ="v"))
# a denotes adjective in parts of speech
print("better :", lemma.lemmatize("better", pos ="a"))

Output :

photos : photo
running : run
better : good

7. Named Entity Recognition (NER) :

It is also an information extraction method that identifies the named entities from the given text and classifies them to predefined categories like people, locations, values, and organizations, etc.

Implementation:

import nltktext = “A glooming peace this morning with it brings the sun for sorrow will not show his head”
stop_words = word_tokenize(text)
removed_sw = [word for word in stop_words if not word in stopwords.words(‘english’)]pos_sentences = [nltk.pos_tag(sent) for sent in removed_sw]
wt = nltk.word_tokenize(text)
pos = nltk.pos_tag(wt)
pos

Output :

[('A', 'DT'),
('glooming', 'VBG'),
('peace', 'NN'),
('this', 'DT'),
('morning', 'NN'),
('with', 'IN'),
('it', 'PRP'),
('brings', 'VBZ'),
('the', 'DT'),
('sun', 'NN'),
('for', 'IN'),
('sorrow', 'NN'),
('will', 'MD'),
('not', 'RB'),
('show', 'VB'),
('his', 'PRP$'),
('head', 'NN')]

The following are categories of NER :

CC — coordinating conjunction
CD — cardinal digit
DT — determiner
JJS — adjective, superlative ‘biggest’
PRP — possessive pronoun my, his, hers
EX — existential there (ex : “there is” — think of it like “there exists”)
FW — foreign word
IN — preposition/subordinating conjunction
JJ — adjective ‘big’
JJR — adjective, comparative ‘bigger’
NNPS — proper noun, plural ‘Indians’
LS — list marker
MD — modal could, will
NN — noun, singular
NNS — noun plural
NNP — proper noun, singular ‘John’
RB — adverb very, silently,
RBR — adverb, comparative better
PDT — predeterminer ‘all the kids’
POS — possessive ending parent’s
PRP — personal pronoun I, he, she

8. Bag of Words : Using vectorization

Bag of words is a data extraction technique where we are representing an occurrence of words within the sentence because the computers are very well to handle the numbers.

vectorization is a process of converting a text to a numerical.

Example :

from sklearn.feature_extraction.text import CountVectorizersentence = ['A glooming peace this morning with it brings;The sun for sorrow will not show his head.Go hence, to have more talk of these sad things']num_count = CountVectorizer()
print( num_count.fit_transform(sentence).todense())

Output :

[[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]]

9. Sentiment Analysis :

It is also an NLP technique used to classify a text to predefined sentiments like positive, negative, and neutral.

Example :

The movie is excellent - Positive sentiment categoryThe product quality is too bad  - Negative sentiment categoryThere is a coffee on your table - Neutral sentiment category

Conclusion :

This article was about how to extract information from the text to process on Machine Learning Algorithms using nltk, I hope you guys enjoyed learning.

--

--

Ashwini Ashtekar
Ashwini Ashtekar

Written by Ashwini Ashtekar

Data Scientist | NLP Engineer | Machine Learning Engineer

No responses yet