1/8
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
What are the key components of a Partially Observable Stochastic Game (POSG)? Explain how a POSG differs from a standard Markov Decision Process (MDP) and why this more general framework is necessary for MARL.
N agents, Individual Rewards, Individual observations, observation function, gamma, transition function, state, Joint action
N agents, individual reward and observation and the joint action are different from MDP and this is necessary for MARL because we have multiple agents in the same environment, and thus it is more complicated since agents can act differently and this can give a different reward and observation for different agents. For example an agent passing can be a good action for him, but if the receiving agent did not yet learn how to receive the ball it seemed like a bad action for him and thus there is a difference here.
Considering the components of a POSG, particularly the reward functions, explain how one can formally distinguish between a fully cooperative, a fully competitive (zero-sum), and a general-sum (mixed-motive) multi-agent environment.
Fully Cooperative → All agents have the same reward function, they share a common goal and receive the same reward for their collective actions
Fully competitive → One agents gain, is another agents loss. For any state and joint action the sum of all agents rewards is 0. For example with 2 agents the rewards are inverse
General sum → Has elements of both cooperation and competition. One agents success doesnt require another agent to fail. There is no relationship between individual reward functions.
Explain the concept of “non-stationarity” in the context of Independent Learners (e.g., IDQN, IPPO) in MARL. Why does this phenomenon pose a significant challenge to the convergence and performance of these algorithms?
Non-stationarity is a problem when multiple independent agents learn simultaneously. From the perspective of a single agent the other agents are part of the environment. Because the other agents are part of the environment and their policies change the environment becomes non stationary, with changing dynamics and reward probabilities.
This is challenging because it violates the markov assumption which assumes the transition and reward function are stationary. An action which at one point was good might be bad later on. Furthermore learning becomes unstable as the target is always shifting, making it difficult for the policy to converge.
Describe the core idea behind the Centralized Training for Decentralized Execution (CTDE) paradigm. What are the main advantages it aims to achieve compared to purely Decentralized Training & Decen-tralized Execution (DTDE) and purely Centralized Training & Centralized Execution (CTCE)?
Makes use of centrally available information during training, to learn effectively, coordinate policies than can be executed by the agents at runtime. During the training phase a centralized mechanism can access global information, like full environment state or joint actions of all agents. This helps to overcome the non stationary problem. During execution the agents independently run using its local observation, without the help of the central controller.
Advantage over DTDE
Improved coordination and stability, in DTDE agents treat each other as part of the environment which suffers from the non stationarity problem.
Advantage over CTCE
Suffers from exploding action space with the number of agents, making it not useful for many problems, furthermore it needs a central controller at runtime which also makes it useless for certain problems.
Explain the Value Decomposition Network (VDN) algorithm. Write down its core assumption regarding the joint action-value function (Qtot) and discuss one key limitation of this assumption.
The value decomposition network is a method for teaching a team of agents to work together. It follows the CTDE approach, meaning the agents learn as a group but make decisions on their own during runtime. Each of the agents has their indvidual Q networks, where the TD error uses the sum of the Q values and the teams reward.
Assumption
The individual agents execute greedily on local Q values. This assumes each of the agents is equally as important.
Limitation
Because of the simple addition assumption, complex synergies and relations between agents cant be modeled, it is too restrictive
What is the Individual-Global-Max (IGM) principle, and how does the QMIX algorithm attempt to satisfy it? Explain the role of the monotonic mixing network in QMIX.
The IGM principle is when the best action for the team is when each of the individual agents chooses the best action for themselves.
QMIX satisfies this principle by using a special monotonic mixing network.
The network combines the individual Q values of each agent in a total team Q-value
Its monotonic, which guarantees that if any agents individual Q value increases the total team value cannot decrease.
This means that if the agents act greedily to maximize their own values, they also maximize the teams value, satisfying the IGM principle
Among the Centralized Training for Decentralized Execution (CTDE) algorithms discussed, identify one specifically designed to operate effectively in mixed cooperative-competitive environments. Describe the key architectural component(s) of this algorithm that enable this capability.
The Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm is designed for mixed cooperative-competitive environments.
Its key architectural component is the use of centralized critics. Each agent's critic is trained using the observations and actions of all agents. This gives the critic full context to model the intentions of both teammates and opponents, enabling effective learning in complex mixed settings.
Explain the credit assignment problem in cooperative MARL. Describe an algorithmic technique that employs a “counterfactual baseline” to address this problem.
The credit assignment problem is figuring out which agent's action deserves credit for a shared team reward.
The COMA algorithm solves this using a counterfactual baseline.
It calculates an agent's contribution by comparing the actual team outcome to a "what if" scenario where only that single agent had acted differently. This isolates the value of that agent's specific action.
Explain how a policy-based CTDE algorithm, typically used for cooperative tasks, might leverage a centralized value function during training to improve the learning of decentralized policies. How does this differ from an approach where each agent has its own distinct critic designed for potentially individual rewards?
In cooperative tasks, a policy-based Centralized Training for Decentralized Execution (CTDE) algorithm like MAPPO uses a single, centralized value function (V(s)).
This function estimates the shared
team reward from a given state. 1111During training, all agents use this same value estimate to calculate their individual policy updates. 2 This provides a stable and consistent learning signal for everyone, promoting better coordination toward the common goal.
This differs from an approach like
MADDPG, where each agent has its own distinct critic (Qi). 333That critic is designed to learn an
individual agent's reward (ri), which might differ from others. 4 This is necessary for mixed-motive or competitive settings where agents have different, non-shared goals.