SL

Deep Reinforcement Learning – Vocabulary

Deep Reinforcement Learning – Core Ideas

  • Fundamental definition- Reinforcement Learning (RL) = framework that can reason about decision making by finding optimal sequences of actions in uncertain environments.

    • Deep models supply rich, non-linear function approximators → RL agents can learn complex input–output mappings end-to-end, for instance, directly from raw pixel inputs to control signals, bypassing manual feature engineering.

  • High-level workflow (modern vs. classic engineering)- Classic control: characterize → simulate → hand-craft controller where engineers explicitly design control laws.

    • Deep-RL pipeline: characterize the system → simulate its behavior → run RL; the network inverts the dynamics. This means instead of defining a controller that explicitly follows system dynamics, RL learns a policy (mapping states to actions) that implicitly acts optimally within the dynamics it experiences during simulation. It discovers the action that leads to desired outcomes given the state and learned dynamics, rather than explicitly modeling the system's response to control inputs.

  • Key dependency: everything the agent can influence must be simulated (or collected on-line) → simulation quality becomes decisive. An inaccurate simulator leads to policies that perform poorly or catastrophically when transferred to the real world (the "reality gap"). High fidelity is crucial for the learned policy to generalize.

Simulation-Centric Challenges

  • Must create/own a faithful simulator for every controllable variable – costly & sometimes impossible. Developing a high-fidelity simulator for complex real-world systems (e.g., robotics, autonomous driving, social interactions) is resource-intensive, requiring deep domain expertise and computational power. For some systems, accurate physics/behavioral models might not even exist (e.g., human behavior).

  • Trade-off: simulation provides safe experimentation but can introduce reality gap. Policies trained purely in simulation often fail or degrade significantly when deployed in the real world due to discrepancies between the simulated and real dynamics/sensors/actuators. This necessitates techniques like domain randomization or sim-to-real transfer.

  • Sample complexity, stability, generalization all become throttled by simulator speed & fidelity. Training RL agents requires vast amounts of interaction data (samples). A slow simulator directly limits how much data can be generated per unit time. Low fidelity can lead to unstable training or policies that don't generalize beyond the specific, inaccurate simulation.

Model Taxonomy (high-level)

  • Model-Free vs. Model-Based- Model-free: does not learn transition model p(s'\mid s,a). This transition model is the probability distribution over the next state s' given the current state s and action a. It describes how the environment behaves.

    • Value-iteration family (e.g. Q-Table, DQN,…): Algorithms estimate the optimal value function (e.g., Q-values) which quantifies the goodness of being in a state or taking an action. The optimal policy is then derived implicitly by choosing actions that maximize this value. They are target-driven, focusing on estimating future rewards.

    • Policy-iteration / policy-gradient family (e.g. REINFORCE, PPO, SAC,…): These algorithms directly learn a policy function \pi(a \mid s) that maps states to actions. They optimize the policy parameters to maximize the expected return, often using gradient ascent.

    • Model-based: learns or is given p(s'\mid s,a). Algorithms that explicitly learn or are given the environment's transition dynamics (p(s'\mid s,a)) and/or reward function. This model is then used for planning (e.g., Monte Carlo Tree Search in AlphaZero) or to generate synthetic experience for learning value functions or policies.

    • Dyna-Q, World-Models, MBVE, MBMF, AlphaZero, etc.

DQN & DDQN – Foundations

  • DQN (Deep Q-Network)- Addresses instability when combining Q-Learning + Neural Net. Traditional Q-learning with a neural network is unstable because: (1) Correlated samples: Sequential experiences in an environment are highly correlated, violating the i.i.d. assumption necessary for stable neural network training. (2) Non-stationarity: The target Q-values (r + \gamma \max_{a'} Q(s', a'; \theta)) constantly change as the main network's weights \theta are updated, leading to a moving target problem.

    • Two critical innovations:

    • Experience replay – store tuples (s,a,r,s'), sample i.i.d. mini-batches to decorrelate. By randomly sampling from a buffer of past experiences, the correlations between successive updates are broken, making the training data appear more i.i.d.

    • Target network – periodically (or softly) copy main weights to stabilize bootstrapping. A separate neural network with fixed or slowly updated weights (\theta{\text{target}}) computes the target Q-values (r + \gamma \max{a'} Q(s', a'; \theta_{\text{target}})), providing a stable target for the main network to learn towards.

    • Limitations:

    • Discrete action space only. The max operator requires iterating over all possible actions, which is infeasible for continuous or very large discrete action spaces.

    • Exploration largely \varepsilon-greedy (inefficient in sparse-reward domains). Simple random exploration can be inefficient, especially in complex environments with sparse rewards, as it relies heavily on chance to discover rewarding states.

  • DDQN (Double DQN)- Fixes over-estimation bias of max operator in DQN. In DQN, the same network is used both to select the action with the highest Q-value and to evaluate that Q-value. This can lead to consistently overestimating Q-values, particularly in noisy or stochastic environments, accumulating errors.

    • Uses current network for action selection while using target network for action evaluation:

      • Action selection: a^* = \arg\maxa Q(s', a; \theta{\text{main}}) (selects action using main network)

      • Action evaluation: Q{\text{target}} = r + \gamma Q(s', a^*; \theta{\text{target}}) (evaluates selected action using target network)

    • Empirically superior in more complex tasks (e.g., Lunar-Lander).

    • Sample update (Python-like snippet highlights):

    python next_q = model.predict(next_states) best_a = np.argmax(next_q, axis=1) target_q = target_model.predict(next_states) targets = rewards + gamma * target_q[np.arange(batch), best_a] * (1 - dones)

    • Soft target update (Polyak): \theta{\text{target}} \leftarrow (1-\tau)\,\theta{\text{target}} + \tau\,\theta (A more gradual update where a small fraction \tau of the main network's weights are transferred to the target network at each step, providing a smoother transition for the targets compared to hard updates).

  • Training advice- Expect lengthy training (≈ 1 day for Lunar-Lander on CPU). RL often requires millions of environmental steps to learn robust policies, translating to significant computational time.

    • Prefer GPU acceleration. Neural network computations benefit immensely from parallel processing on GPUs, drastically reducing training time.

    • Tune learning-rate (adaptive or schedule). The learning rate dictates step size in gradient descent. A low learning rate is too slow; too high can cause divergence. Adaptive methods (Adam, RMSProp) or schedules (decaying LR) are common.

    • Non-standard exploration (NoisyNet, parameter-space noise, etc.) when \varepsilon-greedy fails. For complex tasks, simple random exploration might not discover rewarding states. More sophisticated exploration strategies can inject noise into network weights (NoisyNet) or action parameters to facilitate more diverse behavior.

    • Multiple random seeds in parallel; monitor per-seed curves. Due to inherent randomness (initialization, exploration, environment stochasticity), RL training can be highly variable. Running multiple experiments with different random seeds provides a more reliable estimate of an algorithm's performance and stability.

    • Hyper-parameter sweeps (batch, replay size, target-update period, etc.). Optimal hyper-parameters vary greatly between environments and algorithms. Systematic searches (grid search, random search, Bayesian optimization) are often necessary.

DQN Family – Progressive Enhancements

  • ReplayMemory + TargetNet = baseline DQN (NIPS-2013).

  • Nature-2015 refinements (bigger net, reward-clipping, RMSProp): Reward clipping limits rewards to [-1, 1] to prevent divergent Q-values and make learning more stable. RMSProp is an adaptive learning rate optimizer that can accelerate convergence.

  • Double DQN – over-estimation fix.

  • Dueling DQN – disentangle state value vs. advantage: Q(s,a)=V(s) + A(s,a). It trains two separate streams within the network: one estimates the state-value function V(s) (how good it is to be in a state, irrespective of action) and the other estimates the advantage function A(s,a) (how much better/worse an action is compared to the average action in that state).

    • Two decoder heads; improves learning focus. This architecture encourages better learning of state values, especially in environments where many actions have similar effects, leading to improved stability and faster learning.

  • Prioritized Experience Replay (PER)- Sample probability P(i)=\frac{pi^{\alpha}}{\sumj pj^{\alpha}} proportional to TD-error magnitude |\deltai|. Instead of uniform sampling, PER samples experiences more frequently if they have a higher TD-error (i.e., the current Q-value estimate is far from the target), meaning the agent learns more from "surprising" or "important" transitions.

    • Importance-sampling weight wi = \left(\frac{1}{N\,P(i)}\right)^{\beta} / \maxj w_j to correct bias. Because prioritized sampling introduces bias (over-representing high-TD-error samples), importance-sampling weights are used to correct the gradients during the neural network update, ensuring that the expected value of the update remains correct.

    • Algorithm snippet (proportional variant) provided in slide.

  • N-step returns: Instead of using immediate reward plus discounted next state Q-value, N-step returns use a sum of N future rewards and then the Q-value of the state reached after N steps. This balances bias (from bootstrapping) and variance (from Monte Carlo returns).

  • Noisy Nets: Introduce learnable noise in the network's weights, encouraging a more efficient and correlated exploration strategy over time, often outperforming \varepsilon-greedy.

  • Distributional RL (C51, IQN): Instead of learning a single expected Q-value, these methods learn a distribution over possible returns. This provides a richer understanding of the uncertainty in returns and can lead to more robust policies.

  • Rainbow aggregate many tricks. It combines seven key DQN improvements (DQN, Double DQN, Dueling DQN, Prioritized Replay, Multi-step Learning, NoisyNets, Distributional RL) into a single, highly performant agent.

Network Architectures Snapshot

  • (a) Vanilla DQN: single output head Q(s;\theta). A single output layer that predicts Q-values for all discrete actions given the current state input. Suitable for simple discrete action spaces.

  • (b) Dueling DQN: two heads V(s), A(s,a) then aggregate. Features two parallel streams after a shared convolutional or dense feature extractor. One stream outputs a scalar value function V(s), and the other outputs an advantage function A(s,a) vector for all actions. These are combined to form the final Q-values. Improves performance by separating state evaluation from action advantages.

  • (c) Policy-gradient / REINFORCE: (\pi(s;\theta) only). The network directly outputs a probability distribution over actions \pi_\theta(a\mid s). Used for stochastic policies, typically in discrete or sampled continuous action spaces.

  • (d) Two-head Actor-Critic: separate actor (\pi) and critic V sharing encoder. Has a shared neural network encoder for feature extraction, which then branches into two separate "heads": an Actor head (outputting policy \pi) and a Critic head (outputting value function V). This allows both components to benefit from shared representations.

  • (e) Decoupled AC: distinct encoders for actor & critic. Actor and Critic networks have entirely separate architectures or encoders, potentially allowing them to learn more specialized representations for their respective tasks.

  • (f) DDPG / TD3 / SAC 2019: Actor + two Critic-Q nets + target networks. These are continuous control algorithms. DDPG has an actor network (deterministic policy \mu) and a critic Q-network (evaluates Q-values). TD3 adds a second critic (twin critics for min reduction) and delayed/smoothed actor updates for stability. SAC 2019 similarly uses an actor and two Q-critics.

  • (g) Entropy-regularized SAC 2018: Actor, V-critic, and two Q critics. Similar to (f) but explicitly includes a value critic V in addition to the two Q-critics, often used to aid in entropy regularization for exploration. The policy aims to maximize both reward and entropy.

Policy Gradient (PG) Methods – Overview

  • Contrast with value-based: directly optimize policy \pi_\theta(a\mid s) w.r.t. \theta. Instead of learning Q-values and deriving a policy, PG methods directly parameterize the policy (e.g., with a neural network) and adjust its parameters \theta to maximize the expected cumulative reward. No explicit Q-table.

  • Stochastic vs. deterministic policies- Stochastic: \pi_\theta(a\mid s)=P(a|s) – inherent exploration, differentiable via likelihood-ratio. The policy outputs a probability distribution over actions, providing inherent exploration as actions are sampled from this distribution, allowing the agent to naturally discover novel behaviors. Differentiable via the likelihood-ratio trick, which forms the basis for the PG theorem.

    • Deterministic: a = \mu_\theta(s) – easier in continuous control (DDPG). The policy outputs a single, specific action for each state. This is often easier in continuous control because there's no need to sample actions from a distribution during inference, and the gradient flow is more direct. Exploration must be explicitly added (e.g., through action noise).

  • Advantages- Handles high-dim & continuous actions. By outputting probability distributions (stochastic) or direct action values (deterministic) from a neural network, PG methods scale gracefully to complex action spaces.

    • Converges to stochastic optimal policies when needed. In environments where optimal behavior requires randomization (e.g., poker, rock-paper-scissors), PG methods can naturally find stochastic policies.

    • Exploration can be shaped via entropy bonuses. An entropy term can be added to the reward function to encourage the policy to be more exploratory (i.e., assign more uniform probabilities to actions), aiding in discovering better solutions.

  • Disadvantages- High variance gradient estimate; often only local optimum. Policy gradients are typically estimated from sampled trajectories, leading to high variance, which can slow down training and make it unstable.

    • Sample-inefficient; needs large batches or variance-reduction tricks (baselines, GAE, etc.). Requires many interactions with the environment to get accurate gradient estimates, often more than value-based methods. To mitigate high variance, techniques like subtracting a baseline (e.g., the value function) or using Generalized Advantage Estimation (GAE) are crucial.

Main PG Algorithms (chronological)
  • REINFORCE (Williams 1992)- Monte-Carlo episode returns Gt; update: \theta \leftarrow \theta + \alpha Gt \nabla\theta \log \pi\theta(at|st). A foundational Monte-Carlo policy gradient algorithm that updates the policy parameters based on the total discounted return received after taking an action. The update rule increases the probability of actions that led to high returns and decreases it for actions leading to low returns.

    • Implementation trick: use sparse_categorical_crossentropy with sample a weights = advantages. This simplifies the gradient computation in deep learning frameworks by effectively weighting the loss for each action.

  • Actor-Critic family- Actor updates like PG; Critic learns V(s) or Q(s,a) as baseline → lower variance. The actor network learns the policy and decides which actions to take. The critic network estimates the value function and provides a low-variance estimate of the advantage, which guides the actor's updates. By subtracting the critic's value from the Monte Carlo return, the update focuses on how much better or worse an action was than expected.

    • A2C / A3C: synchronous vs. asynchronous workers. A2C (Advantage Actor-Critic) is the synchronous version, where multiple parallel agents collect experiences and update a centralized model. A3C (Asynchronous Advantage Actor-Critic) uses asynchronous updates, allowing agents to train on separate copies of the environment and update a global network independently, often leading to faster and more stable training.

  • TRPO: trust-region constraint via KL divergence to limit step size. Trust Region Policy Optimization introduces a trust-region constraint via KL divergence (Kullback-Leibler divergence) to prevent large, destabilizing policy changes that could lead to poor performance, ensuring more monotonic improvement.

  • PPO: practical TRPO approximation; clipped surrogate objective:

    \L^{\text{CLIP}}(\theta) = \mathbb{E}[\min(rt(\theta)\,At,\;\text{clip}(rt(\theta),1-\epsilon,1+\epsilon)At)]
    Where rt(\theta) = \frac{\pi\theta(at \mid st)}{\pi{\theta{\text{old}}}(at \mid st)} (ratio of new to old policy probabilities). The clipping function limits how much the new policy can deviate from the old policy, preventing overly aggressive updates. PPO is popular due to its good balance of performance, stability, and ease of implementation.

  • DDPG: deterministic actor + critic; uses target nets & replay. Deep Deterministic Policy Gradient combines ideas from DQN (experience replay, target networks) with deterministic policy gradients suitable for continuous action spaces. It learns a deterministic actor \mu\theta(s) and a Q-critic Q\phi(s,a).

  • TD3: twin critics (min reduction), delayed actor update, target-policy smoothing. Twin Delayed DDPG (TD3) uses two Q-networks and takes the minimum of their predictions to reduce overestimation bias, updates the policy network less frequently than the Q-networks for stability, and adds a small clipped noise to the target action, making the Q-function smoother.

  • SAC: entropy-regularized maximum; objective includes \alpha\,\mathcal{H}[\pi(\cdot|s)]. Soft Actor-Critic is an entropy-regularized maximum entropy reinforcement learning algorithm. Its objective includes an entropy term \alpha\,\mathcal{H}[\pi(\cdot|s)], which encourages the agent not only to maximize reward but also to maximize the randomness of its actions, promoting exploration and robustness.

When to choose which
  • Use PG / Actor-Critic when:- Action space continuous. They naturally handle continuous actions by outputting real-valued action commands or parameters of continuous distributions.

    • Need stochastic behavior. In competitive or uncertain environments, a stochastic policy can be optimal, and PG methods can naturally discover this.

    • Willing to trade sample efficiency for asymptotic performance. While PG methods often require many samples, they can achieve higher asymptotic performance on complex continuous control tasks and scale better to high-dimensional state spaces compared to value-based methods.

  • Use DQN / DDQN when:- Discrete actions. They are designed for and perform optimally with discrete action spaces.

    • Limited samples → replay advantage. Experience replay allows for efficient reuse of past data, making them more sample-efficient than on-policy PG methods.

    • Classic games, tabular-like settings. Excel in environments like Atari games where the state space can be high-dimensional but actions are discrete and the environment is typically deterministic or low-stochastic.

Imitation Learning (Behavioral Cloning)

  • Concept: learn (\pi(a\mid s)) directly from expert demonstrations without explicit reward. The agent is trained like a supervised learning problem, mapping observed states to expert actions, without needing a separate reward signal from the environment.

  • Early system: ALVINN (1989) – camera → steering angles.

  • Pipeline: collect {(s,a)}_{\text{expert}} → supervised learning. The agent's policy network learns to directly mimic the actions of an expert from a dataset of state-action pairs.

  • Key issues- Covariate shift: learner drifts into unseen states → compounding error. If the learned policy deviates even slightly from the expert's trajectory, it might encounter states it has never seen in the training data. This distribution shift can lead to compounding error, where small errors accumulate over time, causing the agent to drift further and further from the desired behavior.

    • No reward → cannot surpass the teacher. The agent is limited by the expert's performance. Since it doesn't receive real-time rewards or feedback on its own actions, it cannot discover better strategies than those demonstrated by the expert.

    • Generalization limited by demo coverage; multimodal actions ignored. The agent's performance is highly dependent on the diversity and quality of the expert demonstrations. If the expert data doesn't cover all possible scenarios or if the expert's behavior is suboptimal, the learned policy will reflect these limitations.

  • Mitigations- Data augmentation or on-policy aggregation (e.g., DAgger). DAgger (Dataset Aggregation) iteratively collects new expert demonstrations on states visited by the learner's current policy, helping to bridge the covariate shift.

    • Powerful models + multi-task learning. Using expressive neural network architectures and potentially training on multiple related imitation tasks simultaneously can improve generalization.

Grand Challenges in RL

  • Algorithmic stability- Value-based: Issues like target delay (lag between main and target networks), replay size, clipping (reward or gradient clipping), and LR sensitivity (learning rate being too high or low) can cause unstable Q-value estimates and divergent training.

    • PG: High variance of gradient estimates requires careful variance reduction. Baseline design (e.g., choosing a good value function for advantage estimation) is crucial. Batch size impacts how accurately the gradient is estimated.

    • Model-based: Issues with model exploitation (policy over-optimizing for flaws in the learned model), and BPTT complexity (Backpropagation Through Time for long planning horizons) can make them difficult to train and unstable.

  • Sample complexity – expensive real-world interaction. RL algorithms typically require millions or even billions of environmental interactions, which is expensive or impossible for real-world systems (e.g., robots, clinical trials). This is a critical bottleneck for deploying RL.

  • Generalization – performance outside training distribution. Agents often fail to generalize effectively to states or environments slightly different from those encountered during training. This can be due to overfitting, lack of diverse training data, or inability to learn abstract, transferable skills.

  • Right problem formulation – source of supervision (rewards, preferences, language, demos…). Defining the source of supervision (e.g., designing effective reward functions, utilizing human preferences, incorporating language instructions, or leveraging expert demonstrations) is often the hardest part of applying RL. Poorly designed rewards can lead to unintended or suboptimal behavior (reward hacking).

  • Moravec’s Paradox – low-level perception & motor control still hard for AI. This phenomenon highlights that tasks easy for humans (e.g., real-time low-level perception, fine motor control) are difficult for AI, while tasks difficult for humans (e.g., complex calculations) are relatively easier. Grounding RL in the physical world remains a significant challenge.

  • Unexpected situations – desire agents that survive & adapt with minimal supervision. Current RL often requires extensive pre-training or finely tuned rewards for robust behavior, making it challenging for agents to handle novel or adverse conditions not seen during training.

  • Distinction between easy (closed-world, pure optimization) vs. hard universes (open-world, survival, no simulator). RL excels in highly controlled, deterministic or well-simulated environments. Real-world tasks are "open-world" – dynamic, unpredictable, often lacking perfect simulators, making current RL methods less effective.

Emerging Directions

  • Human preferences as reward signal (e.g., Deep RL from Human Preferences). Instead of manually crafting reward functions, human evaluators provide feedback (e.g., ranking trajectories). An RL agent then learns a reward model from these preferences, which is then used to train the policy.

  • Language guidance / latent language to shape objectives. Using natural language instructions or embedding a "language-like" latent space to guide exploration, define goals, or specify desired behaviors. This allows for more intuitive and flexible control of RL agents.

  • Large Language Models (LLMs) + RL- Use LLM as simulation of human conversation to train dialogue agents via RL on imagined data. LLMs can simulate user interactions or generate synthetic conversational data. RL agents can then be trained within this "simulated" conversational environment to improve dialogue policies.

    • RL selects actions maximizing desired outcomes beyond pure imitation. RL is used to fine-tune LLMs to align with human values, instructions, and preferences (e.g., RLHF - Reinforcement Learning from Human Feedback), moving beyond simple next-token prediction to more complex sequential decision-making that optimizes for abstract goals like helpfulness, harmlessness, and honesty.

Quick Road-Map of “RL 2” Content

  1. DQN → 2. Double DQN → 3. Dueling DQN → 4. PER.

  2. Continuous extensions:- DDPG → TD3 → SAC.

  3. Policy-Gradient path:- Actor-Critic (A2C/A3C) → PPO.

  4. Multi-agent: MADDPG, QMIX.

Practical Tips & Takeaways

  • Always start with a simpler baseline (e.g., CartPole) before scaling. This ensures that the fundamental algorithm and setup are correct before tackling more complex environments, where debugging is significantly harder.

  • Track multiple seeds; watch for variance. Essential for robust evaluation of an algorithm's true performance. High variance indicates instability or sensitivity to random initialization/exploration.

  • Log return, loss, Q-max, entropy to diagnose.

    • Return: The primary metric for performance.

    • Loss: Indicates if the network is learning (decreasing loss) but can hide issues if the targets are unstable.

    • Q-max: Monitoring the maximum Q-value can indicate instability (e.g., exploding Q-values) or convergence.

    • Entropy: For PG methods, entropy indicates exploration. Decreasing entropy usually means the policy is converging to a more deterministic behavior.

  • Gradually integrate improvements (Double, Dueling, PER) – each adds complexity. Implement and test each enhancement individually to understand its impact and ensure it improves performance on your specific task.

  • For PG: ensure proper normalization of advantages; entropy coefficient tuning is crucial. Normalizing advantages helps stabilize training by keeping gradients at a reasonable scale. The entropy coefficient balances exploration vs. exploitation; tuning it allows for fine-grained control over stochasticity.

  • GPU indispensable for 3-D sims or high-dim continuous tasks. Complex visual inputs and large policy/value networks require significant computational power for training.

  • Hyper-parameter search (learning rate, \gamma, batch, clip-range, \alpha entropy, \tau soft-update) \Rightarrow automatable via sweeping tools. Tools like Optuna, Ray Tune, Weights & Biases Sweeps can automate the process of finding optimal hyper-parameters, which is critical for good performance in RL.

Open Questions (Wrap-up)

  • How to specify reward robustly in evolving environments? This is a major challenge. Current research explores methods beyond manual reward shaping, such as Reward Learning from Human Preferences (where a reward function is learned from human comparative feedback on trajectories), Inverse Reinforcement Learning (IRL) (inferring the underlying reward function from expert demonstrations), Skill Discovery / Intrinsic Motivation (agents learning skills or intrinsic rewards for curiosity/novelty to explore without explicit external rewards), and Language-conditioned Rewards (using natural language to define or modify reward functions).

  • Can we achieve fully autonomous continual learning without resets? This refers to agents learning continuously in a never-ending environment, adapting to non-stationarity and avoiding catastrophic forgetting (loss of previously learned skills). While still an active research area, approaches include Elastic Weight Consolidation (EWC) and other regularization methods, replay-based methods storing and replaying past samples, modular architectures for different skills, and meta-learning for adaptation (learning to learn quickly to new tasks or changing dynamics).

  • Best way to leverage big, prior data to bootstrap RL exploration? Using large datasets (e.g., from human demonstrations, offline logs) can significantly reduce sample complexity and improve exploration. Key approaches include Offline RL (Batch RL), which learns a policy entirely from a fixed offline dataset without further interaction (challenges include distribution shift), Imitation Learning/Behavioral Cloning as Pre-training followed by online RL fine-tuning, model pre-training for model-based RL, and data augmentation or augmenting replay buffers with offline data.

  • How to remain robust as environment dynamics drift? This relates to domain generalization and robustness to real-world changes. Methods include Domain Randomization (training in diverse simulations with varying parameters to learn robust features), Adaptive Control / Meta-RL (designing agents that can quickly adapt their policy to new dynamics in real-time), System Identification / Online Adaptation (explicitly learning or adapting the environment's dynamics model online), Robust Optimization (formulating the RL problem to optimize performance against various potential disturbances), and Uncertainty-aware RL (agents explicitly model uncertainty in their predictions to guide more cautious actions).