N-Gram Models for Text Analysis in Python

What is an N-Gram?

An N-Gram is a contiguous sequence of n items from a given sample of text or speech. In the context of text analysis, these items can be words, characters, or symbols. N-Grams are fundamental in building models for language processing tasks such as text classification, sentiment analysis, language modeling, and more.

The size of the N-Gram—represented by 'n'—determines its context:

Unigram (n=1): A single word.
Bigram (n=2): A sequence of two words.
Trigram (n=3): A sequence of three words.
And so on...

Example of N-Grams

Consider the sentence: "I love Natural Language Processing."

Unigrams: ["I", "love", "Natural", "Language", "Processing"]
Bigrams: ["I love", "love Natural", "Natural Language", "Language Processing"]
Trigrams: ["I love Natural", "love Natural Language", "Natural Language Processing"]

Why Use N-Grams?

N-Grams are widely used in text analytics because they help to capture the local structure and semantics of language. Here are some of the common applications:

Text Classification: Identifying the category of texts based on features extracted as N-Grams.
Language Modeling: Predicting the next word in a sequence based on the previously observed N-Grams.
Spelling Correction: Finding the most probable word by looking at all variations within N-Grams.

Setting Up Your Environment

We will use Python and the NLTK (Natural Language Toolkit) library to work with N-Grams. If you haven't already installed NLTK, you can do so using pip:

pip install nltk

Once NLTK is installed, you can set it up in your Python script as follows:

import nltk
nltk.download('punkt')

This will allow us to use the tokenizer provided by NLTK to break our text into words or sentences before generating N-Grams.

Creating N-Grams with NLTK

Let’s start with generating unigrams, bigrams, and trigrams from a given text. Here’s a step-by-step guide:

Step 1: Tokenization

First, we need to tokenize our text into words:

from nltk import word_tokenize

text = "I love Natural Language Processing."
tokens = word_tokenize(text)
print(tokens)

Output:

['I', 'love', 'Natural', 'Language', 'Processing', '.']

Step 2: Generating N-Grams

NLTK provides a convenient method to generate N-Grams. Let’s create unigrams, bigrams, and trigrams:

from nltk.util import ngrams

# Create N-Grams
unigrams = list(ngrams(tokens, 1))
bigrams = list(ngrams(tokens, 2))
trigrams = list(ngrams(tokens, 3))

print("Unigrams:", unigrams)
print("Bigrams:", bigrams)
print("Trigrams:", trigrams)

Output:

Unigrams: [(‘I’,), (‘love’,), (‘Natural’,), (‘Language’,), (‘Processing’,), (‘.’,)]
Bigrams: [(‘I’, ‘love’), (‘love’, ‘Natural’), (‘Natural’, ‘Language’), (‘Language’, ‘Processing’), (‘Processing’, ‘.’)]
Trigrams: [(‘I’, ‘love’, ‘Natural’), (‘love’, ‘Natural’, ‘Language’), (‘Natural’, ‘Language’, ‘Processing’), (‘Language’, ‘Processing’, ‘.’)]

N-Gram Frequency Distribution

One of the powerful applications of N-Grams is to analyze how frequently different N-Grams occur in a body of text. Here’s how you can create a frequency distribution of bigrams:

from nltk import FreqDist

# Create bigrams
bigrams_freq = FreqDist(bigrams)

# Print the most common bigrams
print(bigrams_freq.most_common())

Output Example:

[(‘I’, ‘love’), 1), (‘love’, ‘Natural’), 1), (‘Natural’, ‘Language’), 1), (‘Language’, ‘Processing’), 1), (‘Processing’, ‘.’), 1)]

Using N-Grams in a Text Analysis Pipeline

N-Grams can be effectively combined with other NLP techniques such as Feature Extraction and Machine Learning classifiers. For instance, you could use the N-Grams as features to train a classifier that predicts sentiment or categories of text. Here’s a simple illustration of this concept:

Example: Using N-Grams in Sentiment Analysis

Prepare your dataset: We need a collection of text data with known sentiment labels.
Generate N-Grams: Just like before, tokenize and generate N-Grams.
Feature Extraction: Use the frequency of these N-Grams as features for classification.
Train your model: Classify using a suitable algorithm like Naive Bayes or SVM.

Here's an example of using sklearn to set up a simple Naive Bayes classifier with N-Grams as features:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

# Example data
data = ["I love this movie", "This movie is bad", "I hate this movie"]
labels = ["positive", "negative", "negative"]

# Create a model
model = make_pipeline(CountVectorizer(ngram_range=(1, 2)), MultinomialNB())

# Train the model
model.fit(data, labels)

# Make a prediction
print(model.predict(["I really love this movie"]))

Expect the output to indicate the positive sentiment classification.

Conclusion

N-Gram models provide a robust framework for understanding language and textual data better. From generating simple word sequences to their applications in complex NLP tasks, mastering N-Grams with Python's NLTK library empowers you to delve deeper into the world of Natural Language Processing. The added ability to use N-Grams in conjunction with machine learning creates endless possibilities for text analysis.

Level Up Your Skills with Xperto-AI