Deep Learning – IIT Ropar Week 11 Assignment Answers

Deep Learning - IIT Ropar

Deep Learning – IIT Ropar Week 11 Assignment Answers (Jan-Apr 2026)


1. Why is an RNN more suitable than a feed-forward neural network for this task?

  • Because it trains faster on the large text datasets
  • Because it removes the need for the labeled data
  • Because it handles sequences with an internal state
  • Because it avoids the weight sharing across the layers
Answer : c

2. If the model forgets the beginning of long emails, what is the most likely reason?

  • The optimizer updates the weights too slowly
  • The output layer has too few neurons
  • The vocabulary size is not large enough
  • The hidden state changes at every time step
Answer : d

3. Which training phenomenon explains why early words have little influence on later predictions?

  • The loss function becomes flat during the training
  • The model memorizes only the recent inputs
  • Gradients become smaller over many time steps
  • The learning rate decreases automatically
Answer : c

4. If truncated BPTT is used with a window size of 15, what limitation does this introduce?

  • The output cannot be computed past 15 steps
  • The hidden state is reset after 15 steps
  • Only the last 15 steps influence learning
  • The network stops training after 15 steps
Answer : c

5. Which architectural change would best help this system retain important early information?

  • Applying stronger regularization
  • Reducing the number of layers
  • Using a gated recurrent model
  • Increasing the number of output units
Answer : c

6. Why is sequence modeling required for this video-based task?

  • Because the different frames have different resolutions
  • Because motion depends on changes over time
  • Because videos always contain multiple objects
  • Because the images are hard to process
Answer : b

7. What does exploding gradient behavior typically cause during training?

  • Constant output values
  • Reduced model capacity
  • Very large weight updates
  • Very slow convergence
Answer : c

8. Which technique is commonly used to control exploding gradients?

  • Increasing the learning rate
  • Limiting gradient magnitude
  • Reducing the batch size
  • Removing nonlinearities
Answer : b

9. Why would an LSTM outperform a vanilla RNN in this application?

  • It reorganizes time-dependent inputs into a layered structure
  • It suppresses the influence of the information from distant time steps
  • It eliminates recurrent connections to simplify the sequence modeling
  • It stores and controls information using gated memory mechanisms
Answer : d

10. Which LSTM gate controls how much previous information should be erased?

  • Reset gate
  • Input gate
  • Forget gate
  • Output gate
Answer : c

11. Why is it reasonable to generate only one output after processing the full message?

  • Because only the final word matters
  • Because sentiment depends on the full sequence
  • Because the intermediate outputs are unavailable
  • Because the model cannot output multiple values
Answer : b

12. What is a key architectural difference between a GRU and an LSTM?

  • GRUs have no memory at all
  • GRUs process all data without recurrence
  • GRUs merge input and forget functions
  • GRUs uses more gates than a LSTM
Answer : c

13. What is the function of the reset gate in a GRU?

  • It controls how the hidden state contributes to the final output at each time step
  • It determines how much new input information is added to the hidden state
  • It regulates the flow of error gradients during backpropagation through time
  • It controls how much past hidden state is used when forming the new state
Answer : d

14. Why do LSTM and GRU gates use the sigmoid activation function?

  • It speeds up the training process
  • It removes the need for normalization
  • It outputs values between 0 and 1
  • It prevents the gradient explosion
Answer : c

15. Which design choice best helps the model capture long-range emotional cues?

  • Reducing the size of the hidden state representation
  • Using gated recurrent units to regulate information
  • Removing earlier time steps to simplify the sequence
  • Relying only on the most recent inputs for prediction
Answer : b