logologo
  • Dashboard
  • Features
  • AI Tools
  • FAQs
  • Jobs
  • Modus
logologo

We source, screen & deliver pre-vetted developers—so you only interview high-signal candidates matched to your criteria.

Useful Links

  • Contact Us
  • Privacy Policy
  • Terms & Conditions
  • Refund & Cancellation
  • About Us

Resources

  • Certifications
  • Topics
  • Collections
  • Articles
  • Services

AI Tools

  • AI Interviewer
  • Xperto AI
  • Pre-Vetted Top Developers

Procodebase © 2025. All rights reserved.

Level Up Your Skills with Xperto-AI

A multi-AI agent platform that helps you level up your development skills and ace your interview preparation to secure your dream job.

Launch Xperto-AI

Embeddings in Transformer architecture

author
Written by
Nidhi Singh

10/09/2024

NLP

Sign in to read full article

"Attention is All You Need" is a seminal paper in the field of natural language processing (NLP) and machine learning, published by Vaswani et al. in 2017. The paper introduces the Transformer model, which relies entirely on a mechanism called "self-attention" to draw global dependencies between input and output. This approach eliminates the need for recurrent and convolutional networks, which were commonly used in sequence modeling tasks.

Major key points from the paper :

  1. Self-Attention Mechanism: The model uses self-attention to weigh the importance of different words in a sentence when encoding a given word. This allows the model to capture long-range dependencies more effectively.

  2. Positional Encoding: Since the Transformer model does not use recurrent layers, it incorporates positional encodings to retain the order of words in a sequence.

  3. Scalability: The Transformer model can be scaled up significantly due to its parallelizable architecture, making it efficient to train on large datasets.

  4. State-of-the-Art Performance: At the time of its publication, the Transformer model achieved state-of-the-art results on several NLP tasks, including translation and text generation.

This paper has had a profound impact on the development of NLP models and has led to further advancements such as the BERT and GPT series.


The very first task in transformer architecture is Embedding of the input sequence.

Embedding is a process of converting discrete tokens (like words, sub-words, characters) into continuous , dense vector representations that model can process. These vectors are called embeddings.

Why Do We Need Embeddings?

Handling Discrete Data: Natural language consists of discrete tokens (e.g., words), which are difficult for neural networks to process directly. Embeddings transform these tokens into a numerical form that captures their semantic meaning.

Dimensionality Reduction: Instead of representing a word as a sparse one-hot vector (which would be very large if the vocabulary is large), embeddings represent each word as a dense vector of fixed size dmodeld_{model}dmodel​.

Transformer uses an embedding matrix to map each token to the corresponding vector. This matrix EEE has dimensions V×dmodelV \times d_{model}V×dmodel​.

[V is the vocabulary space and dmodeld_{model}dmodel​ is the dimension of the embedding space.]

The paper does not use pre-trained embeddings like GloVe or Word2Vec. Instead, it uses learned embeddings.

Learned During Training:

• The embeddings are learned from scratch during the training process of the Transformer model.

• The embedding matrix is initialized randomly, and its values are updated as part of the model’s overall training, through backpropagation.

Task-Specific:

• Since the embeddings are learned alongside the other parameters of the model, they are optimized for the specific task that the Transformer is being trained on (e.g., machine translation, text generation).

Positional Encoding: Adding Positional Information

After embedding, the Transformer adds positional encodings to the embeddings. This is crucial because, unlike RNNs or CNNs, the Transformer processes all tokens in the sequence simultaneously, and it needs a way to represent the order of the tokens.

Positional Encoding adds information about the position of each token in the sequence to its embedding vector. This is done by adding a positional encoding vector PEPEPE to each embedding vector eie_iei​ . • The resulting input to the model is: zi=ei+PE[i]z_i = e_i + PE[i]zi​=ei​+PE[i]

where ziz_izi​ is the position-aware embedding of the token at position iii.

Calculating PE :

PE(pos,2i)=sin(pos/100002i/dmodel)PE(pos, 2i) = sin(pos \big / 10000^{2i/d_{model}}) PE(pos,2i)=sin(pos/100002i/dmodel​) PE(pos,2i+1)=cos(pos/100002i/dmodel) PE(pos, 2i+1) = cos(pos \big / 10000^{2i/d_{model}})PE(pos,2i+1)=cos(pos/100002i/dmodel​)

How It Works:

  1. Embedding Vector:

    • Each token in the input sequence is initially mapped to an embedding vector of size dmodeld_{model}dmodel​ .

    • For example, if dmodeld_{model}dmodel​= 512 , each token’s embedding is a 512-dimensional vector.

  2. Positional Encoding:

    • The positional encoding for each token’s position in the sequence is also a vector of size dmodeld_{model}dmodel​.

    • This vector is calculated using the sine and cosine functions as described earlier.

  3. Addition of Embedding and Positional Encoding:

    • The positional encoding vector is added element-wise to the embedding vector for each token.

    • Since both the embedding and the positional encoding vectors have the same size, this addition is straightforward.

  4. Resulting Vector:

    • After adding the positional encoding to the embedding, the resulting vector for each token still has a size of dmodeld_{model}dmodel​ .

Summary :

Dimension of the Output for Each Token: dmodeld_{model}dmodel​ . Purpose: This dimension ensures that the positional information is integrated with the token’s semantic embedding, allowing the Transformer to process the sequence with both positional and semantic information.

So, after the positional encoding is added, each token in the sequence is represented by a dmodeld_{model}dmodel​-dimensional vector, which is then used as input to the subsequent layers of the Transformer (such as the multi-head attention and feed-forward networks).

Popular Tags

NLPMachineLearningTransformerModel

Share now!

Like & Bookmark!

Related Collections

  • Machine Learning: Mastering Core Concepts and Advanced Techniques

    21/09/2024 | Machine Learning

Related Articles

  • Feature Engineering for Machine Learning: Elevating Your Data Game

    01/08/2024 | Machine Learning

  • Understanding the Basics of Natural Language Processing (NLP)

    21/09/2024 | Machine Learning

  • Embeddings in Transformer architecture

    10/09/2024 | Machine Learning

Popular Category

  • Python
  • Generative AI
  • Machine Learning
  • ReactJS
  • System Design