Embeddings in Transformer architecture

"Attention is All You Need" is a seminal paper in the field of natural language processing (NLP) and machine learning, published by Vaswani et al. in 2017. The paper introduces the Transformer model, which relies entirely on a mechanism called "self-attention" to draw global dependencies between input and output. This approach eliminates the need for recurrent and convolutional networks, which were commonly used in sequence modeling tasks.

Major key points from the paper :

Self-Attention Mechanism: The model uses self-attention to weigh the importance of different words in a sentence when encoding a given word. This allows the model to capture long-range dependencies more effectively.
Positional Encoding: Since the Transformer model does not use recurrent layers, it incorporates positional encodings to retain the order of words in a sequence.
Scalability: The Transformer model can be scaled up significantly due to its parallelizable architecture, making it efficient to train on large datasets.
State-of-the-Art Performance: At the time of its publication, the Transformer model achieved state-of-the-art results on several NLP tasks, including translation and text generation.

This paper has had a profound impact on the development of NLP models and has led to further advancements such as the BERT and GPT series.

The very first task in transformer architecture is Embedding of the input sequence.

Embedding is a process of converting discrete tokens (like words, sub-words, characters) into continuous , dense vector representations that model can process. These vectors are called embeddings.

Why Do We Need Embeddings?

Handling Discrete Data: Natural language consists of discrete tokens (e.g., words), which are difficult for neural networks to process directly. Embeddings transform these tokens into a numerical form that captures their semantic meaning.

Dimensionality Reduction: Instead of representing a word as a sparse one-hot vector (which would be very large if the vocabulary is large), embeddings represent each word as a dense vector of fixed size $d_{model}$ .

Transformer uses an embedding matrix to map each token to the corresponding vector. This matrix $E$ has dimensions $V \times d_{model}$ .

[V is the vocabulary space and $d_{model}$ is the dimension of the embedding space.]

The paper does not use pre-trained embeddings like GloVe or Word2Vec. Instead, it uses learned embeddings.

Learned During Training:

• The embeddings are learned from scratch during the training process of the Transformer model.

• The embedding matrix is initialized randomly, and its values are updated as part of the model’s overall training, through backpropagation.

Task-Specific:

• Since the embeddings are learned alongside the other parameters of the model, they are optimized for the specific task that the Transformer is being trained on (e.g., machine translation, text generation).

Positional Encoding: Adding Positional Information

After embedding, the Transformer adds positional encodings to the embeddings. This is crucial because, unlike RNNs or CNNs, the Transformer processes all tokens in the sequence simultaneously, and it needs a way to represent the order of the tokens.

Positional Encoding adds information about the position of each token in the sequence to its embedding vector. This is done by adding a positional encoding vector $PE$ to each embedding vector $e_i$ . • The resulting input to the model is: $z_i = e_i + PE[i]$

where $z_i$ is the position-aware embedding of the token at position $i$ .

Calculating PE :

PE(pos, 2i) = sin(pos \big / 10000^{2i/d_{model}})

PE(pos, 2i+1) = cos(pos \big / 10000^{2i/d_{model}})

How It Works:

Embedding Vector:

• Each token in the input sequence is initially mapped to an embedding vector of size $d_{model}$ .

• For example, if $d_{model}$ = 512 , each token’s embedding is a 512-dimensional vector.
Positional Encoding:

• The positional encoding for each token’s position in the sequence is also a vector of size $d_{model}$ .

• This vector is calculated using the sine and cosine functions as described earlier.
Addition of Embedding and Positional Encoding:

• The positional encoding vector is added element-wise to the embedding vector for each token.

• Since both the embedding and the positional encoding vectors have the same size, this addition is straightforward.
Resulting Vector:

• After adding the positional encoding to the embedding, the resulting vector for each token still has a size of $d_{model}$ .

Summary :

Dimension of the Output for Each Token: $d_{model}$ . Purpose: This dimension ensures that the positional information is integrated with the token’s semantic embedding, allowing the Transformer to process the sequence with both positional and semantic information.

So, after the positional encoding is added, each token in the sequence is represented by a $d_{model}$ -dimensional vector, which is then used as input to the subsequent layers of the Transformer (such as the multi-head attention and feed-forward networks).

Level Up Your Skills with Xperto-AI