Understanding Transformer Architecture

The Transformer architecture has revolutionized natural language processing (NLP) and become the foundation for many state-of-the-art models. In this post, we'll break down the key components of Transformers and see how they're implemented in Python.

The Big Picture

At its core, a Transformer is designed to process sequential data, such as text. Unlike traditional recurrent neural networks (RNNs), Transformers use a mechanism called "attention" to weigh the importance of different parts of the input sequence when producing an output.

The architecture consists of an encoder and a decoder, each made up of several identical layers. Let's dive into the main components:

Self-Attention
Multi-Head Attention
Positional Encoding
Feed-Forward Networks

Self-Attention: The Heart of Transformers

Self-attention allows the model to consider the relationships between different words in a sentence. It's called "self" attention because it relates different positions of a single sequence to compute a representation of the same sequence.

Here's a simplified Python implementation of self-attention:


import numpy as np

def self_attention(query, key, value):
    # Compute attention scores
    scores = np.dot(query, key.T) / np.sqrt(key.shape[1])
    
    # Apply softmax to get attention weights
    weights = np.exp(scores) / np.sum(np.exp(scores), axis=1, keepdims=True)
    
    # Compute weighted sum of values
    output = np.dot(weights, value)
    
    return output

# Example usage
query = np.random.randn(1, 64)  # 1 word, 64-dimensional embedding
key = value = np.random.randn(5, 64)  # 5 words, 64-dimensional embeddings

result = self_attention(query, key, value)
print(result.shape)  # Output: (1, 64)

This example shows how a single word (query) attends to a sequence of words (key/value) to produce an output representation.

Multi-Head Attention: Parallel Processing

Multi-head attention extends the idea of self-attention by allowing the model to jointly attend to information from different representation subspaces. It's like having multiple "attention mechanisms" working in parallel.

Here's a basic implementation:


def multi_head_attention(query, key, value, num_heads=8):
    head_dim = query.shape[1] // num_heads
    
    # Split embeddings into multiple heads
    query_heads = np.split(query, num_heads, axis=1)
    key_heads = np.split(key, num_heads, axis=1)
    value_heads = np.split(value, num_heads, axis=1)
    
    # Apply self-attention to each head
    head_outputs = [self_attention(q, k, v) for q, k, v in zip(query_heads, key_heads, value_heads)]
    
    # Concatenate head outputs
    return np.concatenate(head_outputs, axis=1)

# Example usage
query = np.random.randn(1, 512)  # 1 word, 512-dimensional embedding
key = value = np.random.randn(5, 512)  # 5 words, 512-dimensional embeddings

result = multi_head_attention(query, key, value)
print(result.shape)  # Output: (1, 512)

Positional Encoding: Adding Order to Chaos

Since Transformers process all words in parallel, they need a way to understand the order of words in a sequence. This is where positional encoding comes in. It adds position-dependent signals to the input embeddings.

Here's a simple implementation of sinusoidal positional encoding:


def positional_encoding(seq_len, d_model):
    positions = np.arange(seq_len)[:, np.newaxis]
    angles = np.arange(d_model)[np.newaxis, :] / d_model
    angles = angles * np.power(10000, -2 * (angles // 2) / d_model)
    
    encodings = np.zeros((seq_len, d_model))
    encodings[:, 0::2] = np.sin(positions * angles[:, 0::2])
    encodings[:, 1::2] = np.cos(positions * angles[:, 1::2])
    
    return encodings

# Example usage
seq_len, d_model = 10, 512
pos_encodings = positional_encoding(seq_len, d_model)
print(pos_encodings.shape)  # Output: (10, 512)

These positional encodings are added to the input embeddings before they're fed into the Transformer layers.

Putting It All Together

In practice, these components are combined into encoder and decoder layers, which are then stacked to form the complete Transformer architecture. The Hugging Face Transformers library provides high-level abstractions for working with Transformer models:


from transformers import AutoModel, AutoTokenizer

# Load a pre-trained BERT model
model_name = "bert-base-uncased"
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Tokenize input text
text = "Understanding Transformers is fascinating!"
inputs = tokenizer(text, return_tensors="pt")

# Get model outputs
outputs = model(**inputs)

# Access the last hidden states
last_hidden_states = outputs.last_hidden_state
print(last_hidden_states.shape)

This example demonstrates how to use a pre-trained BERT model (which is based on the Transformer architecture) to process text input.

By understanding these core components of Transformers, you'll be better equipped to work with and fine-tune models for various NLP tasks using the Hugging Face Transformers library. As you continue exploring, you'll discover the flexibility and power that Transformers bring to modern NLP applications.