NLTK or SPACY?

Getting started with NLP but confused if to start with nltk or spacy? Read more to see the comparison of both.

Pema Grg
EKbana

--

SpaCy is an open-source software library for advanced Natural Language Processing, written in the programming languages Python and Cython. The library is published under the MIT license.

Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language. It was developed by Steven Bird and Edward Loper in the Department of Computer and Information Science at the University of Pennsylvania.

According to AnalyticsVidhya analysis:

FEATURE AVAILABILITY

SPEED: KEY FUNCTIONALITIES — TOKENIZER, TAGGING, PARSING

ACCURACY: ENTITY EXTRACTION

INSTALL:

Spacy:

sudo pip install spacy

NLTK:

pip install nltk

COMPARISON Between SPACY and NLTK

  1. IMPORT
[SPACY]
import spacy
nlp = spacy.load(“en”)
[NLTK]
import nltk

2. WORD TOKENIZE

text = """Most of the outlay will be at home. No surprise there, either. While Samsung has expanded overseas, South Korea is still host to most of its factories and research engineers. """[SPACY OUTPUT]:
['Most', 'of', 'the', 'outlay', 'will', 'be', 'at', 'home', '.', 'No', 'surprise', 'there', ',', 'either', '.', 'While', 'Samsung', 'has', 'expanded', 'overseas', ',', 'South', 'Korea', 'is', 'still', 'host', 'to', 'most', 'of', 'its', 'factories', 'and', 'research', 'engineers', '.']
[NLTK OUTPUT]:
['Most', 'of', 'the', 'outlay', 'will', 'be', 'at', 'home', '.', 'No', 'surprise', 'there', ',', 'either', '.', 'While', 'Samsung', 'has', 'expanded', 'overseas', ',', 'South', 'Korea', 'is', 'still', 'host', 'to', 'most', 'of', 'its', 'factories', 'and', 'research', 'engineers', '.']

3. SENTENCE TOKENIZE

text = """Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language. It was developed by Steven Bird and Edward Loper in the Department of Computer and Information Science at the University of Pennsylvania."""
[SPACY OUTPUT]:
[Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language.,
It was developed by Steven Bird and Edward Loper in the Department of Computer and Information Science at the University of Pennsylvania.]
[NLTK OUTPUT]:
['Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language.',
'It was developed by Steven Bird and Edward Loper in the Department of Computer and Information Science at the University of Pennsylvania.']

4. STOP WORDS REMOVAL

text = """Most of the outlay will be at home. No surprise there, either. While Samsung has expanded overseas, South Korea is still host to most of its factories and research engineers. """
[SPACY OUTPUT]:
['Most', 'outlay', 'home', 'No', 'surprise', 'While', 'Samsung', 'expanded', 'overseas', 'South', 'Korea', 'host', 'factories', 'research', 'engineers']
[NLTK OUTPUT]:
['Most', 'outlay', 'home', '.', 'No', 'surprise', ',', 'either', '.', 'While', 'Samsung', 'expanded', 'overseas', ',', 'South', 'Korea', 'still', 'host', 'factories', 'research', 'engineers', '.']

5. Lemma

text = """While Samsung has expanded overseas, South Korea is still host to most of its factories and research engineers. """
[SPACY OUTPUT]:
While while
Samsung samsung
has have
expanded expand
overseas overseas
, ,
South south
Korea korea
is be
still still
host host
to to
most most
of of
its -PRON-
factories factory
and and
research research
engineers engineer
. .
[NLTK OUTPUT]
['While', 'Samsung', 'ha', 'expanded', 'overseas', ',', 'South', 'Korea', 'is', 'still', 'host', 'to', 'most', 'of', 'it', 'factory', 'and', 'research', 'engineer', '.']

7. get word frequency

text = """Most of the outlay will be at home. No surprise there, either. While Samsung has expanded overseas, South Korea is still host to most of its factories and research engineers. """
[SPACY OUTPUT]:
[('factories', 1), ('engineers', 1), ('No', 1), ('Most', 1), ('research', 1)]
[NLTK OUTPUT]:
[('factories', 1), ('still', 1), ('engineers', 1)]

8. pos tags

text = """Natural Language Toolkit, or more commonly NLTK."""[SPACY OUTPUT]: 
Natural PROPN
Language PROPN
Toolkit PROPN
, PUNCT
or CCONJ
more ADJ
commonly ADV
NLTK NOUN
. PUNCT
[NLTK OUTPUT]:
[('Natural', 'JJ'),
('Language', 'NNP'),
('Toolkit', 'NNP'),
(',', ','),
('or', 'CC'),
('more', 'JJR'),
('commonly', 'RB'),
('NLTK', 'NNP'),
('.', '.')]

9. NER

text = """Most of the outlay will be at home. No surprise there, either. While Samsung has expanded overseas, South Korea is still host to most of its factories and research engineers. """
[SPACY OUTPUT]:
ORG ['Samsung ']
GPE ['South Korea ']
[NLTK OUTPUT]:
['Samsung', 'South Korea']

PS: To see the code check: NLTK and SPACY

--

--

Writer for

curretly an NLP Engineer @EKbana(Nepal)| previously worked@Awesummly(Bangalore)| internship@Meltwater, Bangalore| Linkedin: https://www.linkedin.com/in/pemagrg/