Advanced Language Modeling Using NLTK

Sign in to read full article

Language modeling is a foundational aspect of natural language processing (NLP), a field that has been surging in popularity thanks to advancements in AI. Whether you're generating text, building chatbots, or analyzing sentiment, mastering the art of language modeling will greatly enhance your NLP projects. In this blog post, we’ll dive deep into advanced language modeling using NLTK (Natural Language Toolkit) in Python, providing you with a hands-on approach to build practical applications.

What is Language Modeling?

At its core, a language model is a probabilistic model that predicts the next word in a sequence, given the words that came before it. The model assesses which words are likely to follow given the context of the previous words by assigning probabilities to the sequences based on the patterns in the training data.

Getting Started with NLTK

Before we get into advanced language modeling, first, ensure that you have NLTK installed. You can easily install it via pip:

pip install nltk

You'll also want to download some essential resources.

import nltk
nltk.download('punkt')
nltk.download('gutenberg')

Building a Simple Language Model

For demonstration, we will be using the works of Shakespeare as our training data. It provides a rich set of language patterns.

Tokenization: We break the text into words (tokens).

from nltk.corpus import gutenberg
from nltk.tokenize import word_tokenize

# Load Shakespeare’s text
shakespeare_text = gutenberg.raw('shakespeare-hamlet.txt')

# Tokenizing the text into words
tokens = word_tokenize(shakespeare_text.lower())
print(tokens[:20])

# Displaying the first 20 tokens

Creating n-grams: An n-gram is a contiguous sequence of n items from a given sample of text. For a basic bigram language model, we create pairs of sequential words.

from nltk import ngrams
from collections import Counter

# Creating bigrams
bigrams = list(ngrams(tokens, 2))
bigram_freq = Counter(bigrams)

print(bigram_freq.most_common(10))

# Displaying the 10 most common bigrams

Calculating Probabilities

Now that we have bigrams, we can calculate the probabilities for the next word given the previous word.

def bigram_probabilities(bigram_freq):
    bigram_prob = {}
    for (w1, w2), count in bigram_freq.items():
        bigram_prob[(w1, w2)] = count / sum(bigram_freq.values())
    return bigram_prob

bigram_probabilities_dict = bigram_probabilities(bigram_freq)
print(bigram_probabilities_dict)

# Displaying bigram probabilities

Text Generation with the Bigram Model

With our bigram probabilities defined, we can generate text by selecting words based on the probabilities we calculated.

import random

def generate_text(bigram_prob, start_word, length=20):
    current_word = start_word
    output = [current_word]

    for _ in range(length - 1):

# Filter bigrams starting with the current word
        possible_next_words = {w2: prob for (w1, w2), prob in bigram_prob.items() if w1 == current_word}

# If there are no possible next words, stop
        if not possible_next_words:
            break

# Making a random choice based on weighted probabilities
        next_word = random.choices(list(possible_next_words.keys()), weights=list(possible_next_words.values()))[0]
        output.append(next_word)
        current_word = next_word
    
    return ' '.join(output)

print(generate_text(bigram_probabilities_dict, start_word='to', length=20))

Enhancing Language Model with Smoothed n-grams

While our bigram model is a good start, it can face challenges, especially with vocabulary not present in the training set. This is where smoothing techniques become useful. Smoothing can help you assign non-zero probabilities to unseen n-grams.

def smoothed_bigram_probabilities(bigram_freq, vocabulary_size):
    smoothed_prob = {}
    total_bigrams = sum(bigram_freq.values())
    
    for (w1, w2), count in bigram_freq.items():
        smoothed_prob[(w1, w2)] = (count + 1) / (total_bigrams + vocabulary_size)

    return smoothed_prob

smoothed_bigram_probabilities_dict = smoothed_bigram_probabilities(bigram_freq, len(set(tokens)))

Sentiment Analysis with Advanced Language Models

Language models can also be extended for tasks like sentiment analysis. By training on annotated datasets, you can predict the sentiment of a text. While NLTK provides some built-in sentiment analysis tools, you can also build a custom model based on your trained n-grams.

To create a sentiment classifier, you may use a combination of n-grams and a machine learning classifier like Naive Bayes.

from nltk.classify import NaiveBayesClassifier
from nltk.classify.util import accuracy

# Function to extract features from bigrams
def bigram_features(words):
    return {bigram: True for bigram in ngrams(words, 2)}

# An example of labeled data
labeled_data = [('I love this!', 'positive'),
                ('This is awful!', 'negative')]

# Preparing training data
training_data = [(bigram_features(word_tokenize(text.lower())), sentiment) for text, sentiment in labeled_data]

# Training the classifier
classifier = NaiveBayesClassifier.train(training_data)

# Checking the accuracy
print(f'Accuracy: {accuracy(classifier, training_data)}')

# Predicting a new sample
print(classifier.classify(bigram_features(word_tokenize("This product is great!".lower()))))

By combining advanced language models with machine learning techniques, you can create robust systems capable of understanding and generating human-like text.

Advanced Topics: Neural Language Models

While classic n-grams work well for many cases, they can be limited in capturing longer dependencies. Enter neural language models! Leveraging libraries like TensorFlow or PyTorch alongside NLTK opens up new horizons where recurrent neural networks (RNNs), long short-term memory networks (LSTMs), and transformers can be employed to achieve state-of-the-art results in language modeling.

For Python enthusiasts passionate about NLP, integrating NLTK with deep learning libraries can take your projects to the next level. Be sure to explore these paths as you venture deeper into the world of language modeling!

Popular Tags

Python NLTK Natural Language Processing

Share now!

Like & Bookmark!