NLTK or SPACY?
Getting started with NLP but confused if to start with nltk or spacy? Read more to see the comparison of both.
SpaCy is an open-source software library for advanced Natural Language Processing, written in the programming languages Python and Cython. The library is published under the MIT license.
Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language. It was developed by Steven Bird and Edward Loper in the Department of Computer and Information Science at the University of Pennsylvania.
According to AnalyticsVidhya analysis:
FEATURE AVAILABILITY
SPEED: KEY FUNCTIONALITIES — TOKENIZER, TAGGING, PARSING
ACCURACY: ENTITY EXTRACTION
INSTALL:
Spacy:
sudo pip install spacy
NLTK:
pip install nltk
COMPARISON Between SPACY and NLTK
- IMPORT
[SPACY]
import spacy
nlp = spacy.load(“en”) [NLTK]
import nltk
2. WORD TOKENIZE
text = """Most of the outlay will be at home. No surprise there, either. While Samsung has expanded overseas, South Korea is still host to most of its factories and research engineers. """[SPACY OUTPUT]:
['Most', 'of', 'the', 'outlay', 'will', 'be', 'at', 'home', '.', 'No', 'surprise', 'there', ',', 'either', '.', 'While', 'Samsung', 'has', 'expanded', 'overseas', ',', 'South', 'Korea', 'is', 'still', 'host', 'to', 'most', 'of', 'its', 'factories', 'and', 'research', 'engineers', '.'][NLTK OUTPUT]:
['Most', 'of', 'the', 'outlay', 'will', 'be', 'at', 'home', '.', 'No', 'surprise', 'there', ',', 'either', '.', 'While', 'Samsung', 'has', 'expanded', 'overseas', ',', 'South', 'Korea', 'is', 'still', 'host', 'to', 'most', 'of', 'its', 'factories', 'and', 'research', 'engineers', '.']
3. SENTENCE TOKENIZE
text = """Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language. It was developed by Steven Bird and Edward Loper in the Department of Computer and Information Science at the University of Pennsylvania."""
[SPACY OUTPUT]:
[Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language.,
It was developed by Steven Bird and Edward Loper in the Department of Computer and Information Science at the University of Pennsylvania.][NLTK OUTPUT]:
['Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language.',
'It was developed by Steven Bird and Edward Loper in the Department of Computer and Information Science at the University of Pennsylvania.']
4. STOP WORDS REMOVAL
text = """Most of the outlay will be at home. No surprise there, either. While Samsung has expanded overseas, South Korea is still host to most of its factories and research engineers. """
[SPACY OUTPUT]:
['Most', 'outlay', 'home', 'No', 'surprise', 'While', 'Samsung', 'expanded', 'overseas', 'South', 'Korea', 'host', 'factories', 'research', 'engineers'][NLTK OUTPUT]:
['Most', 'outlay', 'home', '.', 'No', 'surprise', ',', 'either', '.', 'While', 'Samsung', 'expanded', 'overseas', ',', 'South', 'Korea', 'still', 'host', 'factories', 'research', 'engineers', '.']
5. Lemma
text = """While Samsung has expanded overseas, South Korea is still host to most of its factories and research engineers. """
[SPACY OUTPUT]:
While while
Samsung samsung
has have
expanded expand
overseas overseas
, ,
South south
Korea korea
is be
still still
host host
to to
most most
of of
its -PRON-
factories factory
and and
research research
engineers engineer
. .[NLTK OUTPUT]
['While', 'Samsung', 'ha', 'expanded', 'overseas', ',', 'South', 'Korea', 'is', 'still', 'host', 'to', 'most', 'of', 'it', 'factory', 'and', 'research', 'engineer', '.']
7. get word frequency
text = """Most of the outlay will be at home. No surprise there, either. While Samsung has expanded overseas, South Korea is still host to most of its factories and research engineers. """
[SPACY OUTPUT]:
[('factories', 1), ('engineers', 1), ('No', 1), ('Most', 1), ('research', 1)][NLTK OUTPUT]:
[('factories', 1), ('still', 1), ('engineers', 1)]
8. pos tags
text = """Natural Language Toolkit, or more commonly NLTK."""[SPACY OUTPUT]:
Natural PROPN
Language PROPN
Toolkit PROPN
, PUNCT
or CCONJ
more ADJ
commonly ADV
NLTK NOUN
. PUNCT[NLTK OUTPUT]:
[('Natural', 'JJ'),
('Language', 'NNP'),
('Toolkit', 'NNP'),
(',', ','),
('or', 'CC'),
('more', 'JJR'),
('commonly', 'RB'),
('NLTK', 'NNP'),
('.', '.')]
9. NER
text = """Most of the outlay will be at home. No surprise there, either. While Samsung has expanded overseas, South Korea is still host to most of its factories and research engineers. """
[SPACY OUTPUT]:
ORG ['Samsung ']
GPE ['South Korea '][NLTK OUTPUT]:
['Samsung', 'South Korea']