Text Classification Using NLTK in Python

Text classification is an essential task in Natural Language Processing (NLP) that involves categorizing text into predefined categories. Whether it's spam detection in emails, sentiment analysis from customer reviews, or differentiating between news articles, text classification is invaluable across various domains. In this post, we will leverage the Natural Language Toolkit (NLTK) in Python to build a simple text classification model. Let's get started!

1. Setting Up Your Environment

Before diving into the code, let’s set up our environment. Make sure you have Python and NLTK installed. You can install NLTK using pip:

pip install nltk

Next, we'll need to download some necessary NLTK datasets and libraries:

import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('movie_reviews')

This code downloads the necessary components we’ll use in our text classification task.

2. Loading the Dataset

For our example, we’ll use the movie reviews dataset included in NLTK. This dataset consists of 2,000 movie reviews categorized as positive or negative. Here’s how to load the dataset:

from nltk.corpus import movie_reviews
import random

# Load the dataset and create a list of tuples with the reviews and their labels
documents = [(list(movie_reviews.words(fileid)), category) 
             for category in movie_reviews.categories() 
             for fileid in movie_reviews.fileids(category)]

# Shuffle the documents to ensure randomness
random.shuffle(documents)

print("Sample document:", documents[0])

This will give you a list of tuples, where each tuple consists of the words in the review and its corresponding label.

3. Feature Extraction

Before we can train a model, we need to convert our text data into numerical features. A common approach is to use a "bag of words" model. Here, we extract features based on the frequency of words in the reviews.


# Get all words from the movie reviews
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())

# Select the top 2000 most frequent words as features
word_features = list(all_words.keys())[:2000]

# Function to extract features from a given document
def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features[word] = (word in document_words)
    return features

# Create a list of featuresets
featuresets = [(document_features(doc), category) for (doc, category) in documents]

In this code, document_features checks if each word from our top 2000 most frequent words is present in the document and creates a features dictionary.

4. Splitting the Data

Now, we split our dataset into a training set and a testing set to evaluate the performance of our model.


# Split the data into training and testing sets
train_set = featuresets[:1600]
test_set = featuresets[1600:]

print("Training set size:", len(train_set))
print("Test set size:", len(test_set))

5. Training the Classifier

With our training set ready, we can now create and train a Naive Bayes classifier, which is suitable for text classification tasks due to its simplicity and effectiveness.

from nltk import NaiveBayesClassifier

# Train the Naive Bayes classifier
classifier = NaiveBayesClassifier.train(train_set)

# Print the classifier accuracy
accuracy = nltk.classify.accuracy(classifier, test_set)
print("Accuracy of the classifier:", accuracy)

This code trains the Naive Bayes classifier on our training data and evaluates its accuracy on the test set.

6. Making Predictions

Once the classifier is trained, we can use it to make predictions on new reviews.


# Function to classify a new review
def classify_review(review):
    features = document_features(nltk.word_tokenize(review.lower()))
    return classifier.classify(features)

# Example usage
new_review = "This movie was fantastic! The plot was intriguing and the actors were great."
print("Predicted Classification:", classify_review(new_review))

Input a new review and the classifier will determine whether it’s positive or negative based on what it has learned from the training data.

7. Evaluating the Classifier

To gain further insights into how well our classifier is performing, we can inspect the most informative features.


# Show the most informative features
classifier.show_most_informative_features(10)

This will display the top 10 features that were most informative in distinguishing positive reviews from negative ones.

Conclusion

By following these steps, we've built a simple yet effective text classification model using NLTK in Python. Text classification has numerous applications, and with the foundational knowledge provided in this tutorial, you can further explore and enhance your models. Remember to experiment with different classifiers and feature extraction techniques to improve the performance of your text classification tasks. Happy coding!

Level Up Your Skills with Xperto-AI