An N-Gram is a contiguous sequence of n items from a given sample of text or speech. In the context of text analysis, these items can be words, characters, or symbols. N-Grams are fundamental in building models for language processing tasks such as text classification, sentiment analysis, language modeling, and more.
The size of the N-Gram—represented by 'n'—determines its context:
Consider the sentence: "I love Natural Language Processing."
N-Grams are widely used in text analytics because they help to capture the local structure and semantics of language. Here are some of the common applications:
We will use Python and the NLTK (Natural Language Toolkit) library to work with N-Grams. If you haven't already installed NLTK, you can do so using pip:
pip install nltk
Once NLTK is installed, you can set it up in your Python script as follows:
import nltk nltk.download('punkt')
This will allow us to use the tokenizer provided by NLTK to break our text into words or sentences before generating N-Grams.
Let’s start with generating unigrams, bigrams, and trigrams from a given text. Here’s a step-by-step guide:
First, we need to tokenize our text into words:
from nltk import word_tokenize text = "I love Natural Language Processing." tokens = word_tokenize(text) print(tokens)
Output:
['I', 'love', 'Natural', 'Language', 'Processing', '.']
NLTK provides a convenient method to generate N-Grams. Let’s create unigrams, bigrams, and trigrams:
from nltk.util import ngrams # Create N-Grams unigrams = list(ngrams(tokens, 1)) bigrams = list(ngrams(tokens, 2)) trigrams = list(ngrams(tokens, 3)) print("Unigrams:", unigrams) print("Bigrams:", bigrams) print("Trigrams:", trigrams)
Output:
Unigrams: [(‘I’,), (‘love’,), (‘Natural’,), (‘Language’,), (‘Processing’,), (‘.’,)]
Bigrams: [(‘I’, ‘love’), (‘love’, ‘Natural’), (‘Natural’, ‘Language’), (‘Language’, ‘Processing’), (‘Processing’, ‘.’)]
Trigrams: [(‘I’, ‘love’, ‘Natural’), (‘love’, ‘Natural’, ‘Language’), (‘Natural’, ‘Language’, ‘Processing’), (‘Language’, ‘Processing’, ‘.’)]
One of the powerful applications of N-Grams is to analyze how frequently different N-Grams occur in a body of text. Here’s how you can create a frequency distribution of bigrams:
from nltk import FreqDist # Create bigrams bigrams_freq = FreqDist(bigrams) # Print the most common bigrams print(bigrams_freq.most_common())
Output Example:
[(‘I’, ‘love’), 1), (‘love’, ‘Natural’), 1), (‘Natural’, ‘Language’), 1), (‘Language’, ‘Processing’), 1), (‘Processing’, ‘.’), 1)]
N-Grams can be effectively combined with other NLP techniques such as Feature Extraction and Machine Learning classifiers. For instance, you could use the N-Grams as features to train a classifier that predicts sentiment or categories of text. Here’s a simple illustration of this concept:
Here's an example of using sklearn to set up a simple Naive Bayes classifier with N-Grams as features:
from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.pipeline import make_pipeline # Example data data = ["I love this movie", "This movie is bad", "I hate this movie"] labels = ["positive", "negative", "negative"] # Create a model model = make_pipeline(CountVectorizer(ngram_range=(1, 2)), MultinomialNB()) # Train the model model.fit(data, labels) # Make a prediction print(model.predict(["I really love this movie"]))
Expect the output to indicate the positive sentiment classification.
N-Gram models provide a robust framework for understanding language and textual data better. From generating simple word sequences to their applications in complex NLP tasks, mastering N-Grams with Python's NLTK library empowers you to delve deeper into the world of Natural Language Processing. The added ability to use N-Grams in conjunction with machine learning creates endless possibilities for text analysis.
25/09/2024 | Python
15/11/2024 | Python
26/10/2024 | Python
22/11/2024 | Python
17/11/2024 | Python
08/12/2024 | Python
06/12/2024 | Python
22/11/2024 | Python
21/09/2024 | Python
21/09/2024 | Python
22/11/2024 | Python
22/11/2024 | Python