SPACY for Beginners -NLP
SpaCy is an open-source software library for advanced Natural Language Processing, written in the programming languages Python and Cython. The library is published under the MIT license.
Today we’ll be talking about how to get started with NLP using Spacy. But before starting, make sure that you have Python and Spacy installed in your system.
To install Spacy and English Model:
$sudo pip install spacy
$python -m spacy download en
In spacy, the object “nlp” is used to create documents, access linguistic annotations and different nlp properties.
1. IMPORT SPACY
The default model which is english-core-web, for which we load the “en” model.
import spacy
nlp = spacy.load(“en”)
2. WORD TOKENIZE
Tokenize words to get the tokens of the text i.e breaking the sentences into words.
import spacy
from collections import Counter
nlp = spacy.load("en")text = """Most of the outlay will be at home. No surprise there, either. While Samsung has expanded overseas, South Korea is still host to most of its factories and research engineers. """
doc = nlp(text)
words = [token.text for token in doc]
print (words)[OUTPUT]:
['Most', 'of', 'the', 'outlay', 'will', 'be', 'at', 'home', '.', 'No', 'surprise', 'there', ',', 'either', '.', 'While', 'Samsung', 'has', 'expanded', 'overseas', ',', 'South', 'Korea', 'is', 'still', 'host', 'to', 'most', 'of', 'its', 'factories', 'and', 'research', 'engineers', '.']
3. SENTENCE TOKENIZE
Tokenize sentences if the there are more than 1 sentence i.e breaking the sentences to list of sentence.
import spacy
nlp = spacy.load("en")text = """Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language. It was developed by Steven Bird and Edward Loper in the Department of Computer and Information Science at the University of Pennsylvania."""
text = nlp(text)
list(text.sents)[OUTPUT]:
[Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language.,
It was developed by Steven Bird and Edward Loper in the Department of Computer and Information Science at the University of Pennsylvania.]
4. STOP WORDS REMOVAL
Remove irrelevant words using nltk stop words like is,the,a etc from the sentences as they don’t carry any information.
import spacy
from collections import Counter
nlp = spacy.load("en")text = """Most of the outlay will be at home. No surprise there, either. While Samsung has expanded overseas, South Korea is still host to most of its factories and research engineers. """
doc = nlp(text)
#remove stopwords and punctuations
words = [token.text for token in doc if token.is_stop != True and token.is_punct != True]
print (words)[OUTPUT]:
['Most', 'outlay', 'home', 'No', 'surprise', 'While', 'Samsung', 'expanded', 'overseas', 'South', 'Korea', 'host', 'factories', 'research', 'engineers']
5. Lemma
lemmatize the text so as to get its root form eg: functions,funtionality as function
import spacy
nlp = spacy.load('en')
text = """While Samsung has expanded overseas, South Korea is still host to most of its factories and research engineers. """
doc = nlp(text)
for token in doc:
print(token, token.lemma_)[OUTPUT]:
While while
Samsung samsung
has have
expanded expand
overseas overseas
, ,
South south
Korea korea
is be
still still
host host
to to
most most
of of
its -PRON-
factories factory
and and
research research
engineers engineer
. .
7. Get word frequency
counting the word occurrence using FreqDist library. Word frequency helps us to determine how important the word is in the document by knowing how many times the word is being used.
import spacy
from collections import Counter
nlp = spacy.load("en")text = """Most of the outlay will be at home. No surprise there, either. While Samsung has expanded overseas, South Korea is still host to most of its factories and research engineers. """
doc = nlp(text)
#remove stopwords and punctuations
words = [token.text for token in doc if token.is_stop != True and token.is_punct != True]
word_freq = Counter(words)
common_words = word_freq.most_common(5)
print (common_words)[OUTPUT]:
[('factories', 1), ('engineers', 1), ('No', 1), ('Most', 1), ('research', 1)]
8. POS tags
POS tag helps us to know the tags of each word like whether a word is noun, adjective etc.
import spacy
nlp = spacy.load("en")text = """Natural Language Toolkit, or more commonly NLTK."""
text = nlp(text)
for w in text:
print (w, w.pos_)[OUTPUT]:
Natural PROPN
Language PROPN
Toolkit PROPN
, PUNCT
or CCONJ
more ADJ
commonly ADV
NLTK NOUN
. PUNCT
9. NER
NER(Named Entity Recognition) is the process of getting the entity names
import spacy
nlp = spacy.load("en")text = """Most of the outlay will be at home. No surprise there, either. While Samsung has expanded overseas, South Korea is still host to most of its factories and research engineers. """
text = nlp(text)
labels = set([w.label_ for w in text.ents])
for label in labels:
entities = [e.string for e in text.ents if label==e.label_]
entities = list(set(entities))
print( label,entities)[OUTPUT]:
ORG ['Samsung ']
GPE ['South Korea ']
voila!!! now you know the basics of NLP 👌
You can now try some mini projects like:
- Extracting keywords of documents, articles.
- Generating part of speech for phrases.
- Getting the top used words among all documents.
You can also check : NLP for beginners using NLTK
Github Link for more codes: https://github.com/pemagrg1/SPACY-for-Beginners