Knowledge Graph Retrieval
Testing Multi-Agent Reinforcement Learning in Local Environments
Category
Multi-Agent Systems
Started On
June 2026
Version
v0.7
Status
In Progress
Last Updated
29 June 2026
Read Time
12 min
Description
An engineering experiment exploring how multiple LLM-based agents can improve their strategy through pure self-play and structured reflection without human feedback.
Context
I built this because I was curious about how multi-agent reinforcement learning (MARL) could be applied to complex strategy games without massive compute. I wanted to see what could be done with a lightweight local setup.
Why I Started This
I started this experiment because I wanted to bridge the gap between static LLM reasoning and dynamic environment interactions. Every model I was studying—from AlphaStar to OpenAI Five—relied on massive compute clusters. I wanted to build a microscopic version of that loop.
Research Question
"Can multiple LLM-based agents improve their strategy through pure self-play and structured reflection without human feedback?"
Initial Hypothesis
I believe giving agents structured feedback after every game and storing it in long-term memory will significantly improve their long-term planning and win rate against baseline bots.
Background Research
- [1]
Reviewed the ReAct and Reflexion papers to understand prompt-based reasoning.
- [2]
Analyzed the Voyager paper for skill library construction in Minecraft.
- [3]
Studied OpenAI's Hide and Seek environment for emergent behaviors.
- [4]
Read CrewAI's documentation for orchestrating multi-agent workflows.
Tech Stack Architecture
Models
GPT-4o, Claude 3.5 Sonnet, Llama 3 70B
Frameworks
LangGraph, CrewAI, AutoGen
Memory
Short-Term Memory, Reflection Memory
Retrieval
Vector Database, Knowledge Graph
Infrastructure
Python, Docker, Qdrant
Tools
Terminal, Code Execution, Browser
Building Process
Day 01
Created first multi-agent prototype using LangGraph. Agents could pass basic messages but lacked context between turns.
Day 03
Added short-term memory and basic environment interactions. The agents could now 'see' the board state.
Day 05
Implemented tool calling for the agents to run simulations. They started predicting opponent moves.
Day 07
Agents started communicating effectively, but frequently hallucinated game rules when forced into a corner.
Day 10
Performance dropped massively due to context window overflow. The prompt was too bloated.
Day 12
Fixed reasoning loop by adding a summarization module that compressed the game history.
The Experiments
01.
GPT-4oChange Made
Added Reflection Module
Observation
Reasoning and strategy improved over 5 episodes.
Problems
High latency and API costs skyrocketed.
Conclusion /Keep reflection, but trigger it less frequently.
02.
Claude 3.5 SonnetChange Made
Removed Reflection, increased context
Observation
Much faster inference and execution.
Problems
Lost long-term planning ability in later rounds.
Conclusion /Reflection is mandatory for multi-step strategy.
03.
Llama 3 70B (Local)Change Made
Hybrid Memory Architecture
Observation
Best long-term recall and zero API costs.
Problems
Retrieval became the bottleneck (too slow).
Conclusion /Need to optimize the vector search pipeline.
Things That Broke
Memory overflow leading to agent confusion in late-game scenarios.
Infinite loops when agents tried to delegate tasks to each other.
Prompt failures when parsing complex XML-formatted tool outputs.
Slow inference making real-time strategy impossible.
What I Learned
Reflection improved reasoning much more than just increasing the context window.
Memory management was significantly harder than expected.
Simpler prompts with strict constraints performed better than complex ones.
Tool calling requires rigorous error handling and validation.
Open Questions
- ?
Can agents learn effectively without ANY human-written rules?
- ?
Can a world model improve the agents' planning horizon?
- ?
How can latency be reduced to support real-time execution?
Future Improvements
Implement a fully persistent hierarchical memory system.
Add a dedicated 'Critic' agent to evaluate plans before execution.
Build a real-time visual dashboard to monitor agent states.
Integrate Reinforcement Learning from AI Feedback (RLAIF).