The Transformer architecture has revolutionized natural language processing (NLP) and become the foundation for many state-of-the-art models. In this post, we'll break down the key components of Transformers and see how they're implemented in Python.
At its core, a Transformer is designed to process sequential data, such as text. Unlike traditional recurrent neural networks (RNNs), Transformers use a mechanism called "attention" to weigh the importance of different parts of the input sequence when producing an output.
The architecture consists of an encoder and a decoder, each made up of several identical layers. Let's dive into the main components:
Self-attention allows the model to consider the relationships between different words in a sentence. It's called "self" attention because it relates different positions of a single sequence to compute a representation of the same sequence.
Here's a simplified Python implementation of self-attention:
import numpy as np def self_attention(query, key, value): # Compute attention scores scores = np.dot(query, key.T) / np.sqrt(key.shape[1]) # Apply softmax to get attention weights weights = np.exp(scores) / np.sum(np.exp(scores), axis=1, keepdims=True) # Compute weighted sum of values output = np.dot(weights, value) return output # Example usage query = np.random.randn(1, 64) # 1 word, 64-dimensional embedding key = value = np.random.randn(5, 64) # 5 words, 64-dimensional embeddings result = self_attention(query, key, value) print(result.shape) # Output: (1, 64)
This example shows how a single word (query) attends to a sequence of words (key/value) to produce an output representation.
Multi-head attention extends the idea of self-attention by allowing the model to jointly attend to information from different representation subspaces. It's like having multiple "attention mechanisms" working in parallel.
Here's a basic implementation:
def multi_head_attention(query, key, value, num_heads=8): head_dim = query.shape[1] // num_heads # Split embeddings into multiple heads query_heads = np.split(query, num_heads, axis=1) key_heads = np.split(key, num_heads, axis=1) value_heads = np.split(value, num_heads, axis=1) # Apply self-attention to each head head_outputs = [self_attention(q, k, v) for q, k, v in zip(query_heads, key_heads, value_heads)] # Concatenate head outputs return np.concatenate(head_outputs, axis=1) # Example usage query = np.random.randn(1, 512) # 1 word, 512-dimensional embedding key = value = np.random.randn(5, 512) # 5 words, 512-dimensional embeddings result = multi_head_attention(query, key, value) print(result.shape) # Output: (1, 512)
Since Transformers process all words in parallel, they need a way to understand the order of words in a sequence. This is where positional encoding comes in. It adds position-dependent signals to the input embeddings.
Here's a simple implementation of sinusoidal positional encoding:
def positional_encoding(seq_len, d_model): positions = np.arange(seq_len)[:, np.newaxis] angles = np.arange(d_model)[np.newaxis, :] / d_model angles = angles * np.power(10000, -2 * (angles // 2) / d_model) encodings = np.zeros((seq_len, d_model)) encodings[:, 0::2] = np.sin(positions * angles[:, 0::2]) encodings[:, 1::2] = np.cos(positions * angles[:, 1::2]) return encodings # Example usage seq_len, d_model = 10, 512 pos_encodings = positional_encoding(seq_len, d_model) print(pos_encodings.shape) # Output: (10, 512)
These positional encodings are added to the input embeddings before they're fed into the Transformer layers.
In practice, these components are combined into encoder and decoder layers, which are then stacked to form the complete Transformer architecture. The Hugging Face Transformers library provides high-level abstractions for working with Transformer models:
from transformers import AutoModel, AutoTokenizer # Load a pre-trained BERT model model_name = "bert-base-uncased" model = AutoModel.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name) # Tokenize input text text = "Understanding Transformers is fascinating!" inputs = tokenizer(text, return_tensors="pt") # Get model outputs outputs = model(**inputs) # Access the last hidden states last_hidden_states = outputs.last_hidden_state print(last_hidden_states.shape)
This example demonstrates how to use a pre-trained BERT model (which is based on the Transformer architecture) to process text input.
By understanding these core components of Transformers, you'll be better equipped to work with and fine-tune models for various NLP tasks using the Hugging Face Transformers library. As you continue exploring, you'll discover the flexibility and power that Transformers bring to modern NLP applications.
21/09/2024 | Python
05/10/2024 | Python
08/11/2024 | Python
17/11/2024 | Python
26/10/2024 | Python
05/10/2024 | Python
08/11/2024 | Python
06/10/2024 | Python
14/11/2024 | Python
25/09/2024 | Python
14/11/2024 | Python