logologo
  • AI Tools

    DB Query GeneratorMock InterviewResume BuilderLearning Path GeneratorCheatsheet GeneratorAgentic Prompt GeneratorCompany ResearchCover Letter Generator
  • XpertoAI
  • MVP Ready
  • Resources

    CertificationsTopicsExpertsCollectionsArticlesQuestionsVideosJobs
logologo

Elevate Your Coding with our comprehensive articles and niche collections.

Useful Links

  • Contact Us
  • Privacy Policy
  • Terms & Conditions
  • Refund & Cancellation
  • About Us

Resources

  • Xperto-AI
  • Certifications
  • Python
  • GenAI
  • Machine Learning

Interviews

  • DSA
  • System Design
  • Design Patterns
  • Frontend System Design
  • ReactJS

Procodebase © 2024. All rights reserved.

Level Up Your Skills with Xperto-AI

A multi-AI agent platform that helps you level up your development skills and ace your interview preparation to secure your dream job.

Launch Xperto-AI

Text Classification Using NLTK in Python

author
Generated by
ProCodebase AI

22/11/2024

NLTK

Sign in to read full article

Text classification is an essential task in Natural Language Processing (NLP) that involves categorizing text into predefined categories. Whether it's spam detection in emails, sentiment analysis from customer reviews, or differentiating between news articles, text classification is invaluable across various domains. In this post, we will leverage the Natural Language Toolkit (NLTK) in Python to build a simple text classification model. Let's get started!

1. Setting Up Your Environment

Before diving into the code, let’s set up our environment. Make sure you have Python and NLTK installed. You can install NLTK using pip:

pip install nltk

Next, we'll need to download some necessary NLTK datasets and libraries:

import nltk nltk.download('punkt') nltk.download('stopwords') nltk.download('movie_reviews')

This code downloads the necessary components we’ll use in our text classification task.

2. Loading the Dataset

For our example, we’ll use the movie reviews dataset included in NLTK. This dataset consists of 2,000 movie reviews categorized as positive or negative. Here’s how to load the dataset:

from nltk.corpus import movie_reviews import random # Load the dataset and create a list of tuples with the reviews and their labels documents = [(list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category)] # Shuffle the documents to ensure randomness random.shuffle(documents) print("Sample document:", documents[0])

This will give you a list of tuples, where each tuple consists of the words in the review and its corresponding label.

3. Feature Extraction

Before we can train a model, we need to convert our text data into numerical features. A common approach is to use a "bag of words" model. Here, we extract features based on the frequency of words in the reviews.

# Get all words from the movie reviews all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words()) # Select the top 2000 most frequent words as features word_features = list(all_words.keys())[:2000] # Function to extract features from a given document def document_features(document): document_words = set(document) features = {} for word in word_features: features[word] = (word in document_words) return features # Create a list of featuresets featuresets = [(document_features(doc), category) for (doc, category) in documents]

In this code, document_features checks if each word from our top 2000 most frequent words is present in the document and creates a features dictionary.

4. Splitting the Data

Now, we split our dataset into a training set and a testing set to evaluate the performance of our model.

# Split the data into training and testing sets train_set = featuresets[:1600] test_set = featuresets[1600:] print("Training set size:", len(train_set)) print("Test set size:", len(test_set))

5. Training the Classifier

With our training set ready, we can now create and train a Naive Bayes classifier, which is suitable for text classification tasks due to its simplicity and effectiveness.

from nltk import NaiveBayesClassifier # Train the Naive Bayes classifier classifier = NaiveBayesClassifier.train(train_set) # Print the classifier accuracy accuracy = nltk.classify.accuracy(classifier, test_set) print("Accuracy of the classifier:", accuracy)

This code trains the Naive Bayes classifier on our training data and evaluates its accuracy on the test set.

6. Making Predictions

Once the classifier is trained, we can use it to make predictions on new reviews.

# Function to classify a new review def classify_review(review): features = document_features(nltk.word_tokenize(review.lower())) return classifier.classify(features) # Example usage new_review = "This movie was fantastic! The plot was intriguing and the actors were great." print("Predicted Classification:", classify_review(new_review))

Input a new review and the classifier will determine whether it’s positive or negative based on what it has learned from the training data.

7. Evaluating the Classifier

To gain further insights into how well our classifier is performing, we can inspect the most informative features.

# Show the most informative features classifier.show_most_informative_features(10)

This will display the top 10 features that were most informative in distinguishing positive reviews from negative ones.

Conclusion

By following these steps, we've built a simple yet effective text classification model using NLTK in Python. Text classification has numerous applications, and with the foundational knowledge provided in this tutorial, you can further explore and enhance your models. Remember to experiment with different classifiers and feature extraction techniques to improve the performance of your text classification tasks. Happy coding!

Popular Tags

NLTKPythonNatural Language Processing

Share now!

Like & Bookmark!

Related Collections

  • Mastering NumPy: From Basics to Advanced

    25/09/2024 | Python

  • Python Basics: Comprehensive Guide

    21/09/2024 | Python

  • PyTorch Mastery: From Basics to Advanced

    14/11/2024 | Python

  • Advanced Python Mastery: Techniques for Experts

    15/01/2025 | Python

  • FastAPI Mastery: From Zero to Hero

    15/10/2024 | Python

Related Articles

  • Advanced Web Scraping Techniques with Python

    08/12/2024 | Python

  • Parsing Syntax Trees with NLTK

    22/11/2024 | Python

  • Understanding Context Managers in Python

    13/01/2025 | Python

  • Working with Python's C Extensions

    13/01/2025 | Python

  • Understanding Python Exception Handling

    21/09/2024 | Python

  • Video Processing Fundamentals in Python

    06/12/2024 | Python

  • Understanding Background Subtraction in Python with OpenCV

    06/12/2024 | Python

Popular Category

  • Python
  • Generative AI
  • Machine Learning
  • ReactJS
  • System Design