Scenario used to ground all subsequent agent concepts, offering a tangible example to understand agent decision-making in a dynamic environment.
Environment = simplified Blackjack game, where the agent's goal is to maximize reward (winning).
Dealer showing card (e.g., 10): this is part of the observable state that the agent uses to make decisions.
Player current sum (e.g., 14): another critical component of the state, representing the player's current hand value.
Goal: build an autonomous agent that decides whether to HIT (take another card) or STICK (stop and receive no more cards), aiming to get as close to 21 as possible without exceeding it.
Card values reminder:
Number cards 2\dots 10 keep their face value, straightforwardly contributing to the sum.
Face cards J,Q,K all worth 10, simplifying their value for calculation.
Ace can be 1 or 11 (usable-ace condition): this introduces a dynamic element to the hand value, allowing for strategic flexibility. A "usable ace" means the ace can be counted as 11 without busting.
Moves sequence:
Four cards dealt initially: two to the player (one face-up, one face-down for the dealer) and two to the dealer (typically one face-up, one face-down).
Each turn player chooses Hit or Stick: this is the core decision point for the agent.
Dealer plays, game ends with Win / Draw / Lose: outcomes are determined based on specific Blackjack rules after the agent and dealer have finished their turns.
Key performance concept: Policy \pi(a\,|\,s) mapping states to actions, which dictates the agent's behavior for every possible state it encounters.
First-visit, episodic MC algorithm, meaning the value of a state-action pair is updated only upon its first visit within an episode.
Core ingredients:
Exploration / Exploitation trade-off handled via many random start episodes: The agent needs to explore different actions to discover optimal strategies while also exploiting known good actions. MC methods often use techniques like epsilon-greedy exploration.
Large number of simulated episodes required $\rightarrow$ parallelizable: Due to the stochastic nature of card games and the need for comprehensive state-action value estimation, many game simulations are needed, which can be run concurrently.
State representation grid (dealer showing card $\times$ player sum $\times$ usable-ace boolean) was shown with:
1 = Hit, 0 = Stick: this grid visually represents the learned optimal policy for each possible state.
Demonstrates learned policy surface: A visualization showing the optimal action (hit or stick) for every combination of player sum, dealer card, and usable ace status.
Update rule (Monte-Carlo):
V(s) \leftarrow V(s)+\alpha \,(Gt - V(s)) where Gt is the return (total reward) on the first visit to state s in an episode, and \alpha is the learning rate. This rule updates the estimated value of a state based on the actual returns observed.
Observation: Internet/LLMs already contain Blackjack domain knowledge $\rightarrow$ opportunity for zero-shot agents. This suggests that LLMs, pre-trained on vast text corpora, can leverage existing human knowledge about game strategies without explicit RL training.
Idea $\rightarrow$ Ask an LLM (GPT-4, Claude, etc.) directly for optimal action without RL training: The LLM acts as a direct policy function, providing a decision given a state description.
Files referenced: LangChain notebook 090_LLM_Blackjack_langchain.ipynb
, illustrating how LLMs can be integrated into agent workflows.
Analogy to existing research on Atari-GPT (generative policy learned from textual demonstration): Similar to how Atari-GPT learned to play games from text descriptions and strategy guides, LLMs can leverage their understanding of natural language to interpret game rules and optimal strategies.
Explosion of demo videos (Salesforce, Google, DeepMind): Many impressive demonstrations showcase AI agents performing complex tasks, often leading to high expectations.
Key question: are current “agents” real autonomy or orchestrated prompts? This addresses the skepticism regarding whether these systems genuinely exhibit intelligent, autonomous behavior or if their impressive feats are primarily due to clever prompt engineering and scripting.
1957 Bellman: Dynamic Programming V(s)= \maxa \sum{s',r} p(s',r|s,a)[ r + \gamma V(s')]. Introduced fundamental concepts for solving sequential decision-making problems by breaking them into smaller subproblems.
1988 Sutton: Temporal-Difference (TD) Learning. Introduced learning methods that do not require a model of the environment's dynamics, learning directly from experience by bootstrapping from estimated values.
1992 Watkins: Q-Learning Q{t+1}(s,a)=Qt(s,a)+\alpha\,[ r + \gamma \max{a'}Qt(s',a') - Q_t(s,a)]\, a model-free reinforcement learning algorithm that learns an action-value function, which gives the expected utility of taking a given action in a given state.
1995 Tesauro: TD-Gammon. A seminal achievement demonstrating TD learning's power, achieving superhuman performance in Backgammon.
2013 DQN on Atari (Deep Q-Network). Combined Q-learning with deep neural networks, enabling agents to learn directly from high-dimensional sensory input like raw pixel data, leading to breakthrough performance in various Atari games.
2017 AlphaGo / AlphaZero; 2019 AlphaStar & AlphaFold. AlphaGo defeated human Go champions, AlphaZero generalized this to other games (chess, shogi) by learning from self-play, AlphaStar excelled in StarCraft II, and AlphaFold revolutionized protein folding prediction.
2023-2025 LLM-centric web agents: WebAgent, WebLINX, WebVoyager, MCP, >100 frameworks. A recent explosion in agents that integrate LLMs for robust reasoning and interaction with complex web environments, enabling broader applications.
Shift described by Zaharia et al. (2024): from single models to compound AI systems. This paradigm shift involves integrating multiple AI components, including LLMs, specialized tools, and memory modules, to achieve more complex goals.
Evolution of prompting methods:
CoT (Chain-of-Thought): Guides LLMs to perform multi-step reasoning by prompting them to generate intermediate thoughts before giving a final answer.
Zero-shot CoT: Achieves CoT reasoning without specific examples, relying solely on the LLM's pre-trained knowledge and a simple prompt like "Let's think step by step."
Self-Consistency: Improves CoT by generating multiple reasoning paths and then selecting the most consistent answer across these paths.
ReAct (Reason & Act) = combine natural-language reasoning + tool use: This method allows LLMs to interleave reasoning (planning, problem-solving) with tool usage (e.g., search, calculator) in a flexible manner, enabling more dynamic and capable agents.
New capabilities: memory (to retain information over time), planning (to sequence actions strategically), multi-agent collaboration (where agents work together on complex tasks), robotics (for physical world interaction), scientific discovery (automating research processes).
Autonomous software entity able to Perceive $\rightarrow$ Decide/Reason $\rightarrow$ Act. This is the fundamental cycle of an intelligent agent: gathering information from its environment, processing that information to make a decision, and then performing an action.
Must be: Autonomous (self-governing), Proactive (initiating actions, not just reacting), Reactive (responding to environmental changes), Social (capable of interacting with other agents or humans).
Comparison vs. traditional software:
Traditional = deterministic (predictable output for given input), static (fixed behavior), user-initiated (requires human command).
Agents = adaptive (learn and change behavior), self-improving (get better over time), capable of pattern recognition & natural-language interaction (understand and produce human language, identify complex trends).
Memory: essential for maintaining state and learning from past experiences.
Short-term: typically for immediate context, like the current conversation turns.
Long-term: for persistent knowledge and learned experiences, often stored in vector databases or similar structures.
Planning module (Reflection, Self-critique, Sub-goal decomposition): enables the agent to break down complex tasks into manageable steps, evaluate its own progress, and refine its strategy.
Action executor (tool calls, code exec, search, calculator, calendar, etc.): the mechanism by which the agent interacts with its environment, performing actions by calling external functions or services.
Example tool functions: Calendar()
(for scheduling), Calculator()
(for numerical computations), CodeInterpreter()
(for executing programming code to solve problems or analyze data).
Reactive (Stimulus–Response)
No internal state, immediate mapping ot \rightarrow at: These agents directly map observations to actions based on predefined rules or learned associations, without maintaining a model of the world or plan.
Useful as API wrappers: Simple and efficient for tasks that involve a direct response to a specific input.
Learning Agents
Data collection, RL training, policy improvement: These agents continuously gather data, learn from experience (often via reinforcement learning), and update their internal policies to improve performance over time.
Expensive & environment-specific but self-improving: Requires significant computational resources and interaction with the environment, but enables agents to achieve high levels of performance in their specific domain.
Deliberative (BDI – Beliefs/Desires/Intentions)
Maintains explicit world model, performs symbolic planning: These agents possess a detailed internal representation of their environment, a set of goals they wish to achieve (desires), and intentions (committed plans). They engage in complex reasoning and explicit planning.
Philosophically powerful but computationally heavy: Offers a robust framework for complex intelligence but can be resource-intensive due to the explicit reasoning and planning involved.
Hybrid
Layers of the three above; orchestration of sub-agents or external tools: Combines the strengths of reactive, learning, and deliberative approaches. Often involves a high-level deliberative planner orchestrating reactive components and learning modules, or coordinating specialized sub-agents.
Two main categories:
Web Agents
Observation = rendered page + DOM tree: These agents perceive web pages as humans do, by parsing the visual layout and the underlying Document Object Model structure.
Actions = clicks, form fill, keystrokes: They interact with web applications by simulating user actions like clicking buttons, entering text into forms, or typing.
Pros: human-level capability (can navigate and interact with almost any website); Cons: latency (can be slow due to web page loading times), brittleness (susceptible to changes in website layout or structure).
API Agents
Observation = structured API responses, vectors, chat logs: These agents interact with services via well-defined APIs, receiving structured data in response.
Actions = API calls: They perform actions by sending requests to specific API endpoints.
Pros: low latency (API calls are generally fast), safer (interactions are constrained by the API's design, reducing unintended side effects); Cons: requires pre-existing APIs (cannot interact with services that lack a defined API).
Scripted Workflows: Simple, predefined sequences of automated tasks, typically rule-based and requiring manual configuration.
RPA Bots (UI automation): Robots that mimic human interactions with software interfaces, automating repetitive tasks across different applications.
Conversational Workflows (LLM, RAG): Systems that use LLMs to engage in natural language conversations, often augmented with Retrieval-Augmented Generation (RAG) to access and synthesize information from internal knowledge bases.
Agentic Workflows – iterative, self-reflective reasoning agents orchestrating tasks: The most advanced level, where autonomous agents can execute complex, multi-step workflows, adapting to dynamic situations and self-correcting their plans as needed.
Goals: modularity (breaking down agents into distinct, interchangeable components), extensibility (easily adding new capabilities), human-in-the-loop (allowing human oversight and intervention), rapid experimentation (facilitating quick iteration and testing of agent designs).
Core elements for each agent:
LLM Brain (GPT-4, Claude-3, etc.): The large language model serves as the core reasoning and generation engine for the agent.
Tools / Skills: Specific functions or APIs that the agent can call to interact with the external world (e.g., web scraping, database access, code execution).
Behaviour Description via prompt templates: The agent's desired behavior, roles, and constraints are defined through carefully crafted prompts.
Memory persistence layer: Enables the agent to store and retrieve past interactions, observations, and learned knowledge, providing context for future decisions.
Coordination logic (planner / orchestrator / router): Mechanisms that manage how the agent sequences tasks, selects tools, handles external interactions, and potentially collaborates with other agents.
A framework for building multi-agent systems, demonstrating how specialized agents can cooperate to achieve a larger goal.
Three cooperating agents:
Query-Parser (Stock Data Analyst) $\rightarrow$ Pydantic-typed dictionary: An agent specialized in understanding natural language queries related to stock data and converting them into a structured, machine-readable format.
Code-Writer (Senior Python Dev) $\rightarrow$ produces script: An agent responsible for generating Python code based on the parsed query, for example, to fetch and process stock data.
Code-Executor / Plotter (implicit) $\rightarrow$ runs script & returns figure: An agent or component that executes the generated code and, if the task involves visualization, renders and returns a plot or figure.
Demonstrated end-to-end pipeline: “Plot 2024 stock values of IBM & TSLA”, showing how the agents seamlessly hand off tasks to complete a complex request.
Repository OktayBalaban/Wordle_Bot
: An open-source example illustrating an agent that iteratively plays the popular word game Wordle.
Agent iteratively guesses, observes colored feedback, updates word list: The bot proposes a word, interprets the color-coded feedback (green, yellow, gray), and uses this information to narrow down the possible correct words for the next guess, demonstrating perception, reasoning, and action in a constrained environment.
LangGraph is a library designed to build stateful, multi-actor applications with LLMs, providing a flexible way to define complex agentic workflows.
Multi-agent routing pattern:
Researcher $\rightarrow$ may issue search()
function or finish: An agent specialized in gathering information, which can decide to perform a web search if more data is needed or conclude its task if the query is satisfied.
Chart-Generator $\rightarrow$ transforms data into visualization: An agent focused on data visualization, receiving structured data and creating charts or graphs.
Router node decides next hop based on agent messages ("continue" / "FINAL ANSWER"): A central component that directs the flow of execution between different agents or decides when the overall task is complete, based on explicit signals from the agents.
Motivation: standardize agent-to-agent & agent-to-tool interaction. Aims to provide a universal language for how AI components communicate and collaborate, addressing the fragmentation in the agent ecosystem.
Client–Server architecture:
MCP Client embedded in host app, interpolates prompts, discovers tools: This client-side component acts as an interface, preparing prompts for models and identifying available tools.
MCP Server exposes: Tools (model-controlled), Resources (data), Prompts (templates): The server-side component publishes various capabilities that agents can utilize, defining how models can control tools, access data, and utilize predefined prompt structures.
Advantages over bespoke APIs:
One-time setup vs. per-tool wiring: Reduces development effort by establishing a general communication protocol instead of custom integrations for each tool or agent.
Automatic discovery, dynamic adaptability, built-in compatibility & scalability: Agents can automatically discover and use new tools/services, adapt to changing environments, ensure interoperability, and scale more easily.
Quote (Yang 2025): tools can be seen as low-autonomy agents; conversely, agents as high-autonomy tools. This highlights the fluidity between what constitutes a tool and an agent based on the level of autonomous decision-making and complexity.
Distinction:
AI Safety: focuses on preventing AI systems from causing unintended harm to humans, society, or the environment. This includes issues like bias, misuse, and catastrophic risks.
AI Security: focuses on protecting AI systems themselves from malicious attacks, such as data poisoning, adversarial attacks, or unauthorized access.
Physical-world stakes high (Waymo fully autonomous vehicles $\rightarrow$ up to 27\% of SF ride-share market in 20\text{ months}): The deployment of AI agents in critical real-world systems, like self-driving cars, underscores the severe consequences of safety and security failures.
Attackers already leveraging agents:
AgentPoison (Chen 2024): adversary inserts malicious memory embeddings $\rightarrow$ agent issues dangerous actions (e.g., sudden stop in self-driving). This demonstrates how subtle manipulations of an agent's internal state can lead to harmful or disruptive behavior.
Exponential growth in cyber-threat volume (Amazon sees 7.5\times 10^8 hits/day vs 1\times 10^8 months ago): The increasing sophistication and volume of cyberattacks, partly driven by AI-assisted tools, pose a significant challenge.
Conclusions:
In near term, AI helps attackers more than defenders $\rightarrow$ proactive defenses essential: The rapid advancement of offensive AI capabilities outpaces defensive measures, necessitating a forward-looking approach to security.
Build provably secure systems, red-team agents, apply alignment & robust training: Strategies to enhance AI security include formal verification, adversarial testing (red-teaming), ensuring AI values align with human values, and developing models that are resilient to attacks.
Workforce impact: McKinsey & Mollick discussion — companies focus on low-end automation but smarter models can perform high-end tasks cheaper. This suggests a potential shift where AI agents will increasingly automate not just repetitive, low-skilled jobs but also more complex, knowledge-based roles.
Case study (Cao 2025): autonomous agent recreated entire Cochrane systematic-review issue in 2 days, saving 12 person-years. This highlights the immense efficiency gains and productivity increases possible with advanced AI agents, capable of automating highly specialized and time-consuming tasks.
Promise: hyperefficient virtual coworkers executing multi-step workflows with planner, analyst, checker roles. AI agents can act as intelligent assistants, taking on various roles within a complex workflow (e.g., planning, data analysis, verification), leading to unprecedented levels of productivity.
Risk: displacement of white-collar jobs, need for reskilling & governance. The flip side of increased efficiency is the potential for significant job displacement across various sectors, necessitating proactive measures for workforce reskilling and robust ethical and regulatory governance frameworks.
Blackjack MC example links to earlier lectures on Monte-Carlo and Exploration-Exploitation: Reinforces how fundamental RL concepts are applied to build decision-making agents.
Bellman $\rightarrow$ Q-Learning $\rightarrow$ DQN timeline shows deep-RL evolution culminating in LLM agents: Illustrates the historical progression from theoretical foundations to practical, high-performance AI systems that now incorporate LLMs.
ReAct combines symbolic reasoning (earlier logic lectures) with learned policies (RL lectures): Demonstrates a powerful synergy between symbolic AI (rule-based logic, planning) and statistical learning (reinforcement learning, neural networks) in modern agent design.
>100 agent frameworks as of 1 Jun 2025 (aiagentsdirectory.com): Indicates a rapidly expanding and diverse ecosystem of tools and platforms for building AI agents.
Waymo captured 0 \rightarrow 27\% rideshare share in 20 months (YipitData): A concrete example of rapid market penetration and real-world impact of autonomous AI systems.
Amazon cyber-attempts rose from 1\times10^8 to 7.5\times10^8 per day: Illustrates the escalating scale of cyber threats, emphasizing the urgent need for robust AI security solutions.
Cochrane review automation: 12 person-years saved: A compelling testament to the potential of AI agents to significantly boost productivity and efficiency in knowledge work.
Bellman Optimality: V^(s)=\maxa \sum{s'} p(s'|s,a)\,[ r(s,a,s') + \gamma V^(s') ]\, defines the optimal value function for a state, which is the maximum expected return achievable from that state, considering the best possible action choices.
Q-Learning update: Q\leftarrow Q + \alpha\,[ r + \gamma \max_{a'}Q(s',a') - Q(s,a) ]\, describes how the Q-value (expected future reward for taking an action in a state) is updated based on observed rewards and the maximum Q-value of the next state, ensuring convergence to optimal action-values.
Monte-Carlo value update: V\leftarrow V + \alpha(G - V)\, shows how the estimated value of a state is updated using a fraction of the difference between the observed return (G) from an episode and the current value estimate (V).
Agents broaden AI from model inference to autonomous decision-making chains: The shift is from merely predicting outputs from inputs to creating systems that can perceive, reason, plan, and act independently over extended periods.
Multiple archetypes exist; choosing depends on latency, risk, domain, need for learning: Different agent architectures (reactive, learning, deliberative, hybrid) are suited for various applications, based on performance requirements and complexity of the problem.
Framework ecosystem is exploding; standards like MCP aim to tame complexity: The proliferation of agent development tools highlights the field's rapid growth, while emerging standards seek to bring order and interoperability.
Safety & security must evolve in parallel with capability: As AI agents become more powerful and deployed in sensitive domains, robust safety and security measures are paramount to mitigate risks and ensure responsible development.