Deep Learning – IIT Ropar Week 12 Assignment Answers
Deep Learning – IIT Ropar Week 12 Assignment Answers (Jan-Apr 2026)
1. Why does a basic encoder–decoder model struggle with long input sequences?
- It relies on a single vector to represent the entire input
- It updates the decoder state too frequently during decoding
- It removes uncommon words during preprocessing
- It predicts output tokens without using hidden states
Answer : a
2. What is the primary role of the encoder in this translation system?
- To generate translated words step by step
- To map the input sequence into hidden representations
- To compute attention weights during decoding
- To produce probability scores for output tokens
Answer : b
3. Why does the decoder generate translations sequentially?
- Output probabilities cannot be computed in parallel
- The encoder cannot process the multiple words together
- Attention mechanisms only work sequentially
- Each output word depends on previous output words
Answer : d
4. Which component is mainly responsible for losing early input information?
- The word embedding lookup table
- The softmax layer used for output prediction
- The fixed-length context vector from the encoder
- The optimization algorithm used during the training
Answer : c
5. Which architectural change best addresses this limitation?
- Introducing an attention mechanism in decoding
- Increasing the size of the output vocabulary
- Adding more layers to the encoder network
- Applying stronger regularization during training
Answer : a
6. What is the main purpose of attention in this summarization system?
- To speed up the training by skipping the encoder states
- To reduce the length of the input article
- To allow decoder to focus on relevant encoder states
- To eliminate the need for an encoder network
Answer : c
7. How are attention weights typically computed?
- By copying the decoder hidden state directly
- By averaging encoder hidden states equally
- By randomly sampling encoder representations
- By normalizing alignment scores using softmax
Answer : d
8. How is the context vector formed after attention weights are computed?
- By computing a weighted sum of encoder states
- By concatenating all decoder hidden states
- By selecting only the last encoder state
- By applying normalization to output probabilities
Answer : a
9. What does a high attention weight on a word indicate?
- The word appears more frequently in training data
- The word strongly influences the current output
- The decoder has finished generating the output
- The encoder failed to process that particular word
Answer : b
10. Why is softmax used when computing attention weights?
- To prevent the overfitting in the decoder network
- To reduce the numerical precision during training
- To limit the number of encoder states used
- To convert scores into a probability distribution
Answer : d
11. Why is self-attention especially useful for long documents?
- It automatically reduces the length of the very long documents
- It processes text strictly in a left-to-right sequential manner
- It allows all words to attend to each other regardless of distance
- It removes the need for the word embeddings in the architecture
Answer : c
12. What is the role of Query and Key vectors in self-attention?
- They determine how strongly one word should attend to another
- They store the final predicted output tokens of the sequence
- They define the size of the vocabulary used by the model
- They replace positional encodings during sequence processing
Answer : a
13. Why is the dot product scaled by √dk in attention?
- To reduce the number of tokens processed in each attention head
- To prevent large dot products from saturating the softmax
- To increase the representational depth of the transformer
- To simplify gradient computation during backpropagation
Answer : b
14. What does it mean if a word attends strongly to several distant words?
- The model produces unstable or random attention patterns
- The model is overfitting to individual training examples
- The model has entirely ignored all the positional information
- The model captures multiple meaningful contextual relationships
Answer : d
15. Why are positional encodings required in transformer models?
- To remove the dependence on embedding layers for tokens
- To provide information about the order of tokens
- To replace attention weights during the decoding stage
- To control the learning rate used during model training
Answer : b