Source-https://ontotext.com/top-5-semantic-technology-trends-2017/

Pre-Processing Text in Python

Published in

EKbana

5 min readNov 1, 2018

So are you planning to do research in text fields but not sure about how to start?

Well, why not start with pre-processing of text as it is very important while doing research in the text field and its easy! Cleaning the text helps you get quality output by removing all irrelevant text and getting the forms of the words etc.

In this article, we will be covering:

1. Converting text to lowercase

2. Contraction

3. Sentence tokenize

4. Word tokenize

5. Spell Check

6. Lemmatize

7. Stemming

8. Remove Tags

9. Remove numbers

10. Remove punctuation

11. Remove stopwords

Let’s START!

Pre-requisites:

install Python
install NLTK
pip install autocorrect

Done with the installations? okay! let’s start coding!

Convert text to lower case:

Converting text to lower case as in, converting “Hello” to “hello” or “HELLO” to “hello”.

import nltkdef to_lower(text):
    """
    Converting text to lower case as in, converting "Hello" to  "hello" or "HELLO" to "hello".
    """
    return ' '.join([w.lower() for w in word_tokenize(text)])text = """Harry Potter is the most miserable, lonely boy you can imagine. He's shunned by his relatives, the Dursley's, that have raised him since he was an infant. He's forced to live in the cupboard under the stairs, forced to wear his cousin Dudley's hand-me-down clothes, and forced to go to his neighbour's house when the rest of the family is doing something fun. Yes, he's just about as miserable as you can get."""print (to_lower(text))

Remove Tags

Removing html tags from the text like “<head><body>” using regex.

import retext = """<head><body>hello world!</body></head>"""
cleaned_text = re.sub('<[^<]+?>','', text)
print (cleaned_text)[OUTPUT]:
hello world!

Remove Numbers

Removing numbers from the text like “1,2,3,4,5…” We usually remove numbers when we do text clustering or getting keyphrases as we numbers doesn’t give much importance to get the main words. To remove numbers, you can use: .isnumeric() else .isdigit()

text = "There was 200 people standing right next to me at 2pm."
output = ''.join(c for c in text if not c.isdigit())
print(output)[OUTPUT]:
There was  people standing right next to me at pm.

Remove punctuation

Removing punctuation from the text like “.?!” and also the symbols like “@#$” .

from string import punctuationdef strip_punctuation(s):
    return ''.join(c for c in s if c not in punctuation)
text = "Hello! how are you doing?"print (strip_punctuation(text))[OUTPUT]:
Hello how are you doing

Lemmatize

Lemmatization is similar to stemming but it brings context to the words.So it links words with similar meaning to one word.lemmatization does morphological analysis of the words. In short, lemmatize the text so as to get its root form eg: functions,funtionality as function

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
#is based on The Porter Stemming Algorithmstopword = stopwords.words(‘english’)
wordnet_lemmatizer = WordNetLemmatizer()text = “the functions of this fan is great”
word_tokens = nltk.word_tokenize(text)
lemmatized_word = [wordnet_lemmatizer.lemmatize(word) for word in word_tokens]
print (lemmatized_word)[OUTPUT]: ['the', 'function', 'of', 'this', 'fan', 'is', 'great']

Stemming

Stemming and Lemmatization are Text Normalization (or sometimes called Word Normalization)
It does not keep a lookup table for actual stems of the word but applies algorithmic rules to generate stems. It uses the rules to decide whether it is wise to strip a suffix.

import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
#is based on The Porter Stemming Algorithmstopword = stopwords.words(‘english’)
snowball_stemmer = SnowballStemmer(‘english’)text = “This is a Demo Text for NLP using NLTK. Full form of NLTK is Natural Language Toolkit”
word_tokens = nltk.word_tokenize(text)
stemmed_word = [snowball_stemmer.stem(word) for word in word_tokens]
print (stemmed_word)[OUTPUT]: ['this', 'is', 'a', 'demo', 'text', 'for', 'nlp', 'use', 'nltk', '.', 'full', 'form', 'of', 'nltk', 'is', 'natur', 'languag', 'toolkit']

word tokenize

Tokenize words to get the tokens of the text i.e breaking the sentences into list of words.

text = “This is a Demo Text for NLP using NLTK. Full form of NLTK is Natural Language Toolkit”
word_tokens = nltk.word_tokenize(text)
print (word_tokens)[OUTPUT]: ['This', 'is', 'a', 'Demo', 'Text', 'for', 'NLP', 'using', 'NLTK', '.', 'Full', 'form', 'of', 'NLTK', 'is', 'Natural', 'Language', 'Toolkit']

sentence tokenize

If the there are more than 1 sentence split it i.e breaking the sentences to a list of sentence.

text = “This is a Demo Text for NLP using NLTK. Full form of NLTK is Natural Language Toolkit”
sent_token = nltk.sent_tokenize(text)
print (sent_token)

Stop words removal

Remove irrelevant words using nltk stop words like “is,the,a” etc from the sentences as they don’t carry any information.

import nltk
from nltk.corpus import stopwords
stopword = stopwords.words(‘english’)text = “This is a Demo Text for NLP using NLTK. Full form of NLTK is Natural Language Toolkit”
word_tokens = nltk.word_tokenize(text)
removing_stopwords = [word for word in word_tokens if word not in stopword]
print (removing_stopwords)[OUTPUT]: ['This', 'Demo', 'Text', 'NLP', 'using', 'NLTK', '.', 'Full', 'form', 'NLTK', 'Natural', 'Language', 'Toolkit']

Contraction

Contraction helps to expand the word form like “ain’t”: “am not”. Contractions file has been created in my github which we will be importing to use it.

# coding: utf-8
import re
import nltk
from contractions import contractions_dictdef expand_contractions(text, contractions_dict):
    contractions_pattern = re.compile('({})'.format('|'.join(contractions_dict.keys())),
                                      flags=re.IGNORECASE | re.DOTALL)def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contractions_dict.get(match) \
            if contractions_dict.get(match) \
            else contractions_dict.get(match.lower())
        expanded_contraction = expanded_contraction
        return expanded_contractionexpanded_text = contractions_pattern.sub(expand_match, text)
    expanded_text = re.sub("'", "", expanded_text)
    return expanded_textdef main():
    text = """I ain't going there. You'll have to go alone."""
    
    text=expand_contractions(text,contractions_dict)
    sentences = nltk.sent_tokenize(text)
    tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
    
    print (tokenized_sentences)if __name__ == '__main__':
    main()

Spell Check

correct the incorrect spelled words like “wrld” to “world”

from autocorrect import spelltext = "This is a wrld of hope"
spells = [spell(w) for w in (nltk.word_tokenize(text))]
print (spells)[OUTPUT]: ['This', 'is', 'a', 'world', 'of', 'hope']

PS:

You can get all the above code from GITHUB

Once you are done with Pre-processing, you can then move to NER, clustering, word count, sentiment analysis, etc.

let me give you a small demo on word count which helps us to get the main words from the document. So, once you are done with pre-processing you are left with a clean list of words i.e without stopwords, numbers, punctuations, etc. Now what you can do is count the number of times the words are repeated which is known as word frequency. To iterate through the pre-processed list and count the frequency.

for word in pre_process_text:
    if word in word_count:
        word_count[word] += 1
    else:
        word_count[word] = 1#Print in sorted order
for w in sorted(word_count, key=word_count.get, reverse=True):
    print (w, word_count[w])

Then you get a list of words with its frequency which you can analyze and say that these are the main words of this document or that this document talks about “give the top frequency words”. Easy right?

All the best with your research!