What is an N-Gram?
An N-Gram is a contiguous sequence of n items from a given sample of text or speech. In the context of text analysis, these items can be words, characters, or symbols. N-Grams are fundamental in building models for language processing tasks such as text classification, sentiment analysis, language modeling, and more.
The size of the N-Gram—represented by 'n'—determines its context:
- Unigram (n=1): A single word.
- Bigram (n=2): A sequence of two words.
- Trigram (n=3): A sequence of three words.
- And so on...
Example of N-Grams
Consider the sentence: "I love Natural Language Processing."
- Unigrams: ["I", "love", "Natural", "Language", "Processing"]
- Bigrams: ["I love", "love Natural", "Natural Language", "Language Processing"]
- Trigrams: ["I love Natural", "love Natural Language", "Natural Language Processing"]
Why Use N-Grams?
N-Grams are widely used in text analytics because they help to capture the local structure and semantics of language. Here are some of the common applications:
- Text Classification: Identifying the category of texts based on features extracted as N-Grams.
- Language Modeling: Predicting the next word in a sequence based on the previously observed N-Grams.
- Spelling Correction: Finding the most probable word by looking at all variations within N-Grams.
Setting Up Your Environment
We will use Python and the NLTK (Natural Language Toolkit) library to work with N-Grams. If you haven't already installed NLTK, you can do so using pip:
pip install nltk
Once NLTK is installed, you can set it up in your Python script as follows:
import nltk nltk.download('punkt')
This will allow us to use the tokenizer provided by NLTK to break our text into words or sentences before generating N-Grams.
Creating N-Grams with NLTK
Let’s start with generating unigrams, bigrams, and trigrams from a given text. Here’s a step-by-step guide:
Step 1: Tokenization
First, we need to tokenize our text into words:
from nltk import word_tokenize text = "I love Natural Language Processing." tokens = word_tokenize(text) print(tokens)
Output:
['I', 'love', 'Natural', 'Language', 'Processing', '.']
Step 2: Generating N-Grams
NLTK provides a convenient method to generate N-Grams. Let’s create unigrams, bigrams, and trigrams:
from nltk.util import ngrams # Create N-Grams unigrams = list(ngrams(tokens, 1)) bigrams = list(ngrams(tokens, 2)) trigrams = list(ngrams(tokens, 3)) print("Unigrams:", unigrams) print("Bigrams:", bigrams) print("Trigrams:", trigrams)
Output:
Unigrams: [(‘I’,), (‘love’,), (‘Natural’,), (‘Language’,), (‘Processing’,), (‘.’,)]
Bigrams: [(‘I’, ‘love’), (‘love’, ‘Natural’), (‘Natural’, ‘Language’), (‘Language’, ‘Processing’), (‘Processing’, ‘.’)]
Trigrams: [(‘I’, ‘love’, ‘Natural’), (‘love’, ‘Natural’, ‘Language’), (‘Natural’, ‘Language’, ‘Processing’), (‘Language’, ‘Processing’, ‘.’)]
N-Gram Frequency Distribution
One of the powerful applications of N-Grams is to analyze how frequently different N-Grams occur in a body of text. Here’s how you can create a frequency distribution of bigrams:
from nltk import FreqDist # Create bigrams bigrams_freq = FreqDist(bigrams) # Print the most common bigrams print(bigrams_freq.most_common())
Output Example:
[(‘I’, ‘love’), 1), (‘love’, ‘Natural’), 1), (‘Natural’, ‘Language’), 1), (‘Language’, ‘Processing’), 1), (‘Processing’, ‘.’), 1)]
Using N-Grams in a Text Analysis Pipeline
N-Grams can be effectively combined with other NLP techniques such as Feature Extraction and Machine Learning classifiers. For instance, you could use the N-Grams as features to train a classifier that predicts sentiment or categories of text. Here’s a simple illustration of this concept:
Example: Using N-Grams in Sentiment Analysis
- Prepare your dataset: We need a collection of text data with known sentiment labels.
- Generate N-Grams: Just like before, tokenize and generate N-Grams.
- Feature Extraction: Use the frequency of these N-Grams as features for classification.
- Train your model: Classify using a suitable algorithm like Naive Bayes or SVM.
Here's an example of using sklearn to set up a simple Naive Bayes classifier with N-Grams as features:
from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.pipeline import make_pipeline # Example data data = ["I love this movie", "This movie is bad", "I hate this movie"] labels = ["positive", "negative", "negative"] # Create a model model = make_pipeline(CountVectorizer(ngram_range=(1, 2)), MultinomialNB()) # Train the model model.fit(data, labels) # Make a prediction print(model.predict(["I really love this movie"]))
Expect the output to indicate the positive sentiment classification.
Conclusion
N-Gram models provide a robust framework for understanding language and textual data better. From generating simple word sequences to their applications in complex NLP tasks, mastering N-Grams with Python's NLTK library empowers you to delve deeper into the world of Natural Language Processing. The added ability to use N-Grams in conjunction with machine learning creates endless possibilities for text analysis.