Deep Learning – IIT Ropar Week 12 Assignment Answers

Deep Learning - IIT Ropar

Deep Learning – IIT Ropar Week 12 Assignment Answers (Jan-Apr 2026)


1. Why does a basic encoder–decoder model struggle with long input sequences?

  • It relies on a single vector to represent the entire input
  • It updates the decoder state too frequently during decoding
  • It removes uncommon words during preprocessing
  • It predicts output tokens without using hidden states
Answer : a

2. What is the primary role of the encoder in this translation system?

  • To generate translated words step by step
  • To map the input sequence into hidden representations
  • To compute attention weights during decoding
  • To produce probability scores for output tokens
Answer : b

3. Why does the decoder generate translations sequentially?

  • Output probabilities cannot be computed in parallel
  • The encoder cannot process the multiple words together
  • Attention mechanisms only work sequentially
  • Each output word depends on previous output words
Answer : d

4. Which component is mainly responsible for losing early input information?

  • The word embedding lookup table
  • The softmax layer used for output prediction
  • The fixed-length context vector from the encoder
  • The optimization algorithm used during the training
Answer : c

5. Which architectural change best addresses this limitation?

  • Introducing an attention mechanism in decoding
  • Increasing the size of the output vocabulary
  • Adding more layers to the encoder network
  • Applying stronger regularization during training
Answer : a

6. What is the main purpose of attention in this summarization system?

  • To speed up the training by skipping the encoder states
  • To reduce the length of the input article
  • To allow decoder to focus on relevant encoder states
  • To eliminate the need for an encoder network
Answer : c

7. How are attention weights typically computed?

  • By copying the decoder hidden state directly
  • By averaging encoder hidden states equally
  • By randomly sampling encoder representations
  • By normalizing alignment scores using softmax
Answer : d

8. How is the context vector formed after attention weights are computed?

  • By computing a weighted sum of encoder states
  • By concatenating all decoder hidden states
  • By selecting only the last encoder state
  • By applying normalization to output probabilities
Answer : a

9. What does a high attention weight on a word indicate?

  • The word appears more frequently in training data
  • The word strongly influences the current output
  • The decoder has finished generating the output
  • The encoder failed to process that particular word
Answer : b

10. Why is softmax used when computing attention weights?

  • To prevent the overfitting in the decoder network
  • To reduce the numerical precision during training
  • To limit the number of encoder states used
  • To convert scores into a probability distribution
Answer : d

11. Why is self-attention especially useful for long documents?

  • It automatically reduces the length of the very long documents
  • It processes text strictly in a left-to-right sequential manner
  • It allows all words to attend to each other regardless of distance
  • It removes the need for the word embeddings in the architecture
Answer : c

12. What is the role of Query and Key vectors in self-attention?

  • They determine how strongly one word should attend to another
  • They store the final predicted output tokens of the sequence
  • They define the size of the vocabulary used by the model
  • They replace positional encodings during sequence processing
Answer : a

13. Why is the dot product scaled by √dk in attention?

  • To reduce the number of tokens processed in each attention head
  • To prevent large dot products from saturating the softmax
  • To increase the representational depth of the transformer
  • To simplify gradient computation during backpropagation
Answer : b

14. What does it mean if a word attends strongly to several distant words?

  • The model produces unstable or random attention patterns
  • The model is overfitting to individual training examples
  • The model has entirely ignored all the positional information
  • The model captures multiple meaningful contextual relationships
Answer : d

15. Why are positional encodings required in transformer models?

  • To remove the dependence on embedding layers for tokens
  • To provide information about the order of tokens
  • To replace attention weights during the decoding stage
  • To control the learning rate used during model training
Answer : b