Study | StudyLover

NLTK (Natural Language Toolkit)

Unit:1 Foundations of Python and Its Applications in Machine Learning

NLTK (Natural Language Toolkit) is a foundational library in Python for Natural Language Processing (NLP), which is the field of AI focused on enabling computers to understand, interpret, and generate human language. NLTK is often called the "Swiss Army knife" of NLP in Python because it provides a vast suite of tools and resources for a wide range of text processing tasks.

While more modern libraries like spaCy and Hugging Face Transformers are often used for production-level applications, NLTK remains an invaluable tool for learning, teaching, and research in NLP because it makes the fundamental concepts of text processing very explicit and accessible.

Key Concepts

Corpora (singular: Corpus): These are large and structured sets of text. NLTK provides easy access to dozens of corpora, such as collections of classic literature, movie reviews, and chat logs, which are essential for training and testing NLP models.
Tokenization: The process of breaking down a piece of text into smaller units, called tokens. These tokens are typically words or sentences.
Stop Words: These are common words (like "the", "a", "in", "is") that are often removed from text before processing because they usually don't carry significant meaning.
Stemming & Lemmatization: Techniques used to reduce words to their root or base form.

Stemming is a crude, rule-based process that chops off the ends of words (e.g., "running" -> "run", "ran" -> "ran"). It's fast but can be inaccurate.
Lemmatization is a more sophisticated process that uses a dictionary to find the root form of a word, known as the lemma (e.g., "running" -> "run", "ran" -> "run"). It's more accurate but slower.

Part-of-Speech (POS) Tagging: The process of marking up a word in a text as corresponding to a particular part of speech (e.g., noun, verb, adjective).

Code Examples

To run these examples, you first need to install NLTK: pip install nltk

After installation, you need to download the necessary NLTK data packages. You can do this by running a Python script with the following commands. This only needs to be done once.

import nltk

nltk.download('punkt')

nltk.download('stopwords')

nltk.download('wordnet')

nltk.download('averaged_perceptron_tagger')

1. Tokenization: Splitting Text into Words and Sentences

This is often the very first step in any NLP pipeline.

import nltk

from nltk.tokenize import sent_tokenize, word_tokenize

# Sample text

text = "Hello everyone. Welcome to the world of Natural Language Processing. NLTK makes it easy to get started."

# Tokenize the text into sentences

sentences = sent_tokenize(text)

print("--- Sentences ---")

print(sentences)

# Tokenize the text into words

words = word_tokenize(text)

print("\n--- Words ---")

print(words)

2. Stop Word Removal

Removing common, non-informative words helps to focus on the words that carry the most meaning.

import nltk

from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize

text = "this is a sample sentence, showing off the stop words filtration."

# Get the set of English stop words

stop_words = set(stopwords.words('english'))

# Tokenize the sentence

words = word_tokenize(text)

# Filter out the stop words

filtered_sentence = [w for w in words if not w.lower() in stop_words]

print(f"Original Words: {words}")

print(f"Filtered Words: {filtered_sentence}")

3. Stemming

Stemming reduces words to their root form using a crude heuristic process. Notice how "studies" and "studying" both become "studi."

import nltk

from nltk.stem import PorterStemmer

from nltk.tokenize import word_tokenize

# Create a stemmer object

stemmer = PorterStemmer()

words = ["program", "programs", "programmer", "programming", "programmers"]

stemmed_words = [stemmer.stem(w) for w in words]

print(f"Original Words: {words}")

print(f"Stemmed Words: {stemmed_words}")

# Example with a sentence

sentence = "He studies and is studying all the time."

tokenized_sentence = word_tokenize(sentence)

stemmed_sentence = [stemmer.stem(w) for w in tokenized_sentence]

print(f"\nStemmed Sentence: {' '.join(stemmed_sentence)}")

4. Lemmatization

Lemmatization is a more advanced form of stemming that uses a vocabulary and morphological analysis to return the dictionary form of a word (the lemma).

import nltk

from nltk.stem import WordNetLemmatizer

from nltk.tokenize import word_tokenize

# Create a lemmatizer object

lemmatizer = WordNetLemmatizer()

words = ["rocks", "corpora", "better", "running"]

lemmatized_words = [lemmatizer.lemmatize(w) for w in words]

print(f"Original Words: {words}")

print(f"Lemmatized Words: {lemmatized_words}")

# Lemmatizing with a specific part of speech can be more accurate

# 'v' stands for verb

print(f"\n'running' as a verb: {lemmatizer.lemmatize('running', pos='v')}")

5. Part-of-Speech (POS) Tagging

This example identifies the grammatical part of speech for each word in a sentence.

import nltk

from nltk.tokenize import word_tokenize

text = "The quick brown fox jumps over the lazy dog."

# Tokenize the text

words = word_tokenize(text)

# Perform POS tagging

tagged_words = nltk.pos_tag(words)

print("--- Part-of-Speech Tags ---")

print(tagged_words)

# The tags are abbreviations, e.g.:

# DT: Determiner ('The')

# JJ: Adjective ('quick', 'brown', 'lazy')

# NN: Noun, singular ('fox', 'dog')

# VBZ: Verb, 3rd person singular present ('jumps')

# IN: Preposition ('over')

NumPy FlashText