In SARSA, the next action used in the update is

The action actually chosen by the current (often \epsilon-greedy) policy in the next time step

Irrelevant, because SARSA does not require a next action

Determined by a different policy than the one being evaluated

The greedy action that maximizes the Q-value in the next state

Why is Q-Learning considered an off-policy method?

It requires running a secondary “behavior policy” to gather experience but updates the values as if it followed a greedy policy.

It does not require a replay buffer or any historical data.

It updates its estimates based on Monte Carlo returns only.

It never explores and only exploits the best action.

It abandons temporal difference learning.

In the context of TD methods, “bootstrapping” refers to which concept?

Updating value estimates based partly on other learned estimates rather than exclusively on actual returns

Learning only from complete returns collected at the end of each episode

Combining multiple policies simultaneously

Taking random actions in the environment to initialize the replay buffer

Bypassing the need for an explicit Q-value function

In a standard DQN, the neural network typically

Outputs a single Q-value for each possible action in the environment

Automatically splits the environment into separate tasks

Receives the next action as part of the input

Directly predicts the best action without any Q-value

Which statement best describes the Monte Carlo approach in Reinforcement Learning?

It uses complete episodes to estimate returns and updates the value function only at the end of each episode.

It is fundamentally incompatible with policy evaluation.

It updates value estimates based on a single step lookahead.

It estimates value functions purely by bootstrapping from existing estimates.

It performs value updates after every state transition without waiting for episode completion.

Which technique is commonly used in DQN to stabilize learning?

Maintaining a target network that is updated slowly compared to the main Q-network.

Dynamically modifying the environment’s reward signals in every step.

Completely ignoring experience replay so as not to overfit.

Removing the discount factor to avoid infinite returns.

Using purely on-policy updates with SARSA.

In a standard DQN architecture, which statement is true about how the neural network is used?

The network takes the state as input and outputs Q-values for all possible discrete actions.

The neural network outputs the policy probabilities for each action.

The network takes the state as input and outputs a single Q-value, forcing you to run it multiple times.

The network only estimates the next state’s reward, ignoring future states.

The network directly outputs the optimal action without any Q-value estimation.

What is the main difference between Q-Learning and SARSA?

Q-Learning’s update uses the greedy action for the next state, whereas SARSA uses the action actually taken by the current policy.

Q-Learning is an on-policy method, while SARSA is off-policy.

SARSA requires knowledge of the transition probabilities.

SARSA converges faster than Q-Learning in all cases

Q-Learning does not use a discount factor, while SARSA does.

In DQN for discrete actions, how does the agent select the best action after the network outputs Q-values?

It chooses the action with the highest Q-value.

It picks the action with the smallest Q-value to minimize cost.

It uses an actor network to select continuous actions.

It always picks actions in a round-robin manner.

The network outputs a probability distribution, from which the action is sampled.

In off-policy methods, what is the main purpose of importance sampling?

To correct for the mismatch between the behavior policy (used to generate data) and the target policy (used to evaluate/improve)

To reduce the variance of Monte Carlo estimates by ignoring certain transitions

To convert on-policy data into a model-based approach

To enable purely deterministic updates in a continuous action space

To ensure that the discount factor can be changed dynamically

What is one main reason why a plain DQN might struggle in very high-dimensional continuous action spaces?

DQN outputs Q-values for discrete actions, and enumerating a continuous space is infeasible

DQN is not a function approximator

The \epsilon-greedy strategy can only handle continuous actions

Continuous spaces do not allow for discounting of future rewards

Experience Replay is impossible to maintain for large observation spaces

In Monte Carlo methods, which of the following is a typical requirement for estimating value functions?

The method uses the average of the returns observed for each state (or state-action pair) across many episodes.

Episodes can be truncated at any time and the incomplete return is used directly.

Each episode must be guaranteed to be infinite.

The method only works with deterministic environments.

No environment interaction is needed; it relies on purely analytical solutions.

Why are Upper Confidence Bound (UCB) methods effective for action selection?

They prioritize actions with high uncertainty.

Using stochastic rewards for initialization

Avoiding suboptimal actions entirely.

Setting probabilities equal for all actions.

They eliminate the need for exploration.

Which of the following is true regarding Monte Carlo methods and infinite episodes?

They require episodes to terminate eventually, or use a concept like continuing tasks with average reward methods.

They cannot handle infinite-horizon problems at all.

They ignore discounting entirely in infinite episodes.

They rely on partial returns mid-episode for updates.

they assume episodes are infinite and never update the value function.

In Monte Carlo Reinforcement Learning, which strategy can be used to ensure that all state-action pairs are sampled (visited) eventually?

Exploring starts, where each episode begins from every possible state-action pair with nonzero probability

Using an infinitely high discount factor

Restricting the agent to a subset of actions

Relying on bootstrapping from immediate rewards

Using purely deterministic policies

In SARSA (a TD control method), the update rule for the state-action value function Q typically includes which of the following terms?

The next action actually taken by the current policy

A value function that depends on no discount factor

A direct model of state transitions

The target action chosen by an off-policy method

What does the term “Markov property” signify in Markov Decision Processes (MDPs)?

The future depends only on the present state

The rewards are discounted exponentially

The transition probabilities are stochastic

The policy is deterministic for every state

What does the Bellman Equation represent in Reinforcement Learning?

The immediate reward plus discounted future rewards

The expected long-term reward for a policy

The difference between predicted and actual rewards

The best policy derivation for multi-agent systems

The probability of state transitions

deep rl

University

•

25 Qs

Similar activities

NMMS MAT 7காலம் சார்ந்த கணக்குகள்

University

•

20 Qs

English Diagnostic 2B

University

•

20 Qs

W2 Interaction F2F (Virtual) Tutorial - Accounting Concepts

University

•

20 Qs

Pangkat Minorya

University

•

20 Qs

Review of Biosafety, Biosecurity and Biorisk Management

University

•

20 Qs

MIS Chapter 4

University

•

20 Qs

MIL QUIZ

12th Grade - University

•

20 Qs

Le droit de propriété

University - Professional Development

•

20 Qs

deep rl

Quiz

•

Other

•

University

•

Practice Problem

•

Easy

lucky star

Used 2+ times

FREE Resource

25 questions

Show all answers

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

How does the n-step TD approach differ from TD(0)?

TD(0) is only used for deterministic policies, while n-step TD is for stochastic policies.

n-step TD randomly selects how many steps to wait before an update

n-step TD updates only at the end of the episode, just like Monte Carlo

TD(0) is an on-policy method while n-step TD is off-policy.

n-step TD uses longer traces of rewards and states before performing a single update, rather than a one-step lookahead.

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

Which key feature distinguishes Temporal Difference (TD) learning from Monte Carlo methods?

TD requires access to the full model of the environment’s transition probabilities.

TD is only applicable to deterministic policies.

TD waits until the end of an episode to update value estimates.

TD uses bootstrapping from current estimates rather than waiting for the final outcome.

TD cannot update its estimates online.

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

In DQN, what is the key purpose of the Experience Replay Buffer?

It amplifies the most recent transition repeatedly to speed up learning

It only stores states without actions or rewards

It replaces the need for a target network

It ensures that all experiences are used exactly once to avoid correlation

It stores past experiences and samples them randomly to break correlation in sequential data

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

Q-Learning and SARSA both estimate Q-values, but Q-Learning is considered off-policy because

The actions used in the bootstrapped target are always taken from the same policy that generates behavior

It ignores the discount factor in updates

It never uses \epsilon-greedy exploration

It uses the same policy for both exploration and evaluation

It updates using a greedy action for the next state, not necessarily the one followed by the agent during data collection

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

Why does DQN use a separate target network?

To generate randomized actions for exploration

To eliminate the need for discounting future rewards

To stabilize Q-value updates by keeping target estimates fixed for a while

To convert a continuous action space into a discrete one

To independently learn a model of the transition probabilities

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

Double Q-Learning was introduced primarily to address which issue?

Handling continuous actions without an actor-critic method

The inability of Q-Learning to handle function approximation

The instability caused by batch updates in Q-Learning

Overestimation bias in Q-value updates due to using \max over the same Q function

Lack of exploration in Q-Learning

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

TD methods differ from Dynamic Programming (DP) primarily because TD methods

Do not use the concept of value functions

Only work in deterministic environments

Always converge faster than DP methods

Can learn directly from raw experience without knowing transition probabilities

Require a perfect model of the environment’s transitions

Access all questions and much more by creating a free account

Create resources

Host any resource

Get auto-graded reports

Continue with Google

Continue with Email

Continue with Classlink

Continue with Clever

or continue with

Microsoft

Apple

Others

Already have an account?

Similar Resources on Wayground

20 questions

CTKT

Quiz

•

11th Grade - University

21 questions

ISA111 - QUIZ #1

Quiz

•

University

20 questions

Customer Service Quiz

Quiz

•

University

20 questions

Disney!

Quiz

•

KG - Professional Dev...

20 questions

TEDx Orientation Quiz

Quiz

•

University

20 questions

Digital Marketing - Organic Marketing

Quiz

•

University

20 questions

Magajastra

Quiz

•

University

20 questions

AIRC 1004 Exam 2 Review

Quiz

•

University

Popular Resources on Wayground

15 questions

Fractions on a Number Line

Quiz

•

3rd Grade

20 questions

Equivalent Fractions

Quiz

•

3rd Grade

25 questions

Multiplication Facts

Quiz

•

5th Grade

54 questions

Analyzing Line Graphs & Tables

Quiz

•

4th Grade

$fractions$

22 questions

fractions

Quiz

•

3rd Grade

20 questions

Main Idea and Details

Quiz

•

5th Grade

20 questions

Context Clues

Quiz

•

6th Grade

15 questions

Equivalent Fractions

Quiz

•

4th Grade

Discover more resources for Other

7 questions

How James Brown Invented Funk

Interactive video

•

10th Grade - University

5 questions

Helping Build the Internet: Valerie Thomas | Great Minds

Interactive video

•

11th Grade - University

12 questions

IREAD Week 4 - Review

Quiz

•

3rd Grade - University

23 questions

Subject Verb Agreement

Quiz

•

9th Grade - University

7 questions

Renewable and Nonrenewable Resources

Interactive video

•

4th Grade - University

19 questions

Review2-TEACHER

Quiz

•

University

15 questions

Pre2_STUDENT

Quiz

•

University

20 questions

Ch. 7 Quadrilateral Quiz Review

Quiz

•

KG - University