Knowledge Graph Retrieval

Testing Multi-Agent Reinforcement Learning in Local Environments

Started On

June 2026

Version

v0.7

Status

In Progress

Last Updated

29 June 2026

Read Time

12 min

Description

An engineering experiment exploring how multiple LLM-based agents can improve their strategy through pure self-play and structured reflection without human feedback.

Context

I built this because I was curious about how multi-agent reinforcement learning (MARL) could be applied to complex strategy games without massive compute. I wanted to see what could be done with a lightweight local setup.

Why I Started This

I started this experiment because I wanted to bridge the gap between static LLM reasoning and dynamic environment interactions. Every model I was studying—from AlphaStar to OpenAI Five—relied on massive compute clusters. I wanted to build a microscopic version of that loop.

Research Question

"Can multiple LLM-based agents improve their strategy through pure self-play and structured reflection without human feedback?"

Initial Hypothesis

I believe giving agents structured feedback after every game and storing it in long-term memory will significantly improve their long-term planning and win rate against baseline bots.

Background Research

[1]
Reviewed the ReAct and Reflexion papers to understand prompt-based reasoning.
[2]
Analyzed the Voyager paper for skill library construction in Minecraft.
[3]
Studied OpenAI's Hide and Seek environment for emergent behaviors.
[4]
Read CrewAI's documentation for orchestrating multi-agent workflows.

Tech Stack Architecture

Models

GPT-4o, Claude 3.5 Sonnet, Llama 3 70B

Frameworks

LangGraph, CrewAI, AutoGen

Memory

Short-Term Memory, Reflection Memory

Retrieval

Vector Database, Knowledge Graph

Infrastructure

Python, Docker, Qdrant

Tools

Terminal, Code Execution, Browser

Building Process

Day 01

Created first multi-agent prototype using LangGraph. Agents could pass basic messages but lacked context between turns.

Day 03

Added short-term memory and basic environment interactions. The agents could now 'see' the board state.

Day 05

Implemented tool calling for the agents to run simulations. They started predicting opponent moves.

Day 07

Agents started communicating effectively, but frequently hallucinated game rules when forced into a corner.

Day 10

Performance dropped massively due to context window overflow. The prompt was too bloated.

Day 12

Fixed reasoning loop by adding a summarization module that compressed the game history.

The Experiments

01.

GPT-4o

Change Made

Added Reflection Module

Observation

Reasoning and strategy improved over 5 episodes.

Problems

High latency and API costs skyrocketed.

Conclusion /Keep reflection, but trigger it less frequently.

02.

Claude 3.5 Sonnet

Change Made

Removed Reflection, increased context

Observation

Much faster inference and execution.

Problems

Lost long-term planning ability in later rounds.

Conclusion /Reflection is mandatory for multi-step strategy.

03.

Llama 3 70B (Local)

Change Made

Hybrid Memory Architecture

Observation

Best long-term recall and zero API costs.

Problems

Retrieval became the bottleneck (too slow).

Conclusion /Need to optimize the vector search pipeline.

Things That Broke

Memory overflow leading to agent confusion in late-game scenarios.
Infinite loops when agents tried to delegate tasks to each other.
Prompt failures when parsing complex XML-formatted tool outputs.
Slow inference making real-time strategy impossible.

What I Learned

Reflection improved reasoning much more than just increasing the context window.
Memory management was significantly harder than expected.
Simpler prompts with strict constraints performed better than complex ones.
Tool calling requires rigorous error handling and validation.