Text classification is an essential task in Natural Language Processing (NLP) that involves categorizing text into predefined categories. Whether it's spam detection in emails, sentiment analysis from customer reviews, or differentiating between news articles, text classification is invaluable across various domains. In this post, we will leverage the Natural Language Toolkit (NLTK) in Python to build a simple text classification model. Let's get started!
Before diving into the code, let’s set up our environment. Make sure you have Python and NLTK installed. You can install NLTK using pip:
pip install nltk
Next, we'll need to download some necessary NLTK datasets and libraries:
import nltk nltk.download('punkt') nltk.download('stopwords') nltk.download('movie_reviews')
This code downloads the necessary components we’ll use in our text classification task.
For our example, we’ll use the movie reviews dataset included in NLTK. This dataset consists of 2,000 movie reviews categorized as positive or negative. Here’s how to load the dataset:
from nltk.corpus import movie_reviews import random # Load the dataset and create a list of tuples with the reviews and their labels documents = [(list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category)] # Shuffle the documents to ensure randomness random.shuffle(documents) print("Sample document:", documents[0])
This will give you a list of tuples, where each tuple consists of the words in the review and its corresponding label.
Before we can train a model, we need to convert our text data into numerical features. A common approach is to use a "bag of words" model. Here, we extract features based on the frequency of words in the reviews.
# Get all words from the movie reviews all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words()) # Select the top 2000 most frequent words as features word_features = list(all_words.keys())[:2000] # Function to extract features from a given document def document_features(document): document_words = set(document) features = {} for word in word_features: features[word] = (word in document_words) return features # Create a list of featuresets featuresets = [(document_features(doc), category) for (doc, category) in documents]
In this code, document_features
checks if each word from our top 2000 most frequent words is present in the document and creates a features dictionary.
Now, we split our dataset into a training set and a testing set to evaluate the performance of our model.
# Split the data into training and testing sets train_set = featuresets[:1600] test_set = featuresets[1600:] print("Training set size:", len(train_set)) print("Test set size:", len(test_set))
With our training set ready, we can now create and train a Naive Bayes classifier, which is suitable for text classification tasks due to its simplicity and effectiveness.
from nltk import NaiveBayesClassifier # Train the Naive Bayes classifier classifier = NaiveBayesClassifier.train(train_set) # Print the classifier accuracy accuracy = nltk.classify.accuracy(classifier, test_set) print("Accuracy of the classifier:", accuracy)
This code trains the Naive Bayes classifier on our training data and evaluates its accuracy on the test set.
Once the classifier is trained, we can use it to make predictions on new reviews.
# Function to classify a new review def classify_review(review): features = document_features(nltk.word_tokenize(review.lower())) return classifier.classify(features) # Example usage new_review = "This movie was fantastic! The plot was intriguing and the actors were great." print("Predicted Classification:", classify_review(new_review))
Input a new review and the classifier will determine whether it’s positive or negative based on what it has learned from the training data.
To gain further insights into how well our classifier is performing, we can inspect the most informative features.
# Show the most informative features classifier.show_most_informative_features(10)
This will display the top 10 features that were most informative in distinguishing positive reviews from negative ones.
By following these steps, we've built a simple yet effective text classification model using NLTK in Python. Text classification has numerous applications, and with the foundational knowledge provided in this tutorial, you can further explore and enhance your models. Remember to experiment with different classifiers and feature extraction techniques to improve the performance of your text classification tasks. Happy coding!
05/11/2024 | Python
22/11/2024 | Python
15/10/2024 | Python
08/11/2024 | Python
06/10/2024 | Python
22/11/2024 | Python
05/11/2024 | Python
22/11/2024 | Python
22/11/2024 | Python
22/11/2024 | Python
08/12/2024 | Python
21/09/2024 | Python