← Back to Research

APEE: Adaptive Poly-Agentic Evaluation Ecosystem

A comprehensive framework for evaluating and benchmarking multi-agent AI systems using LLM-as-a-Judge methodology. 12 collaborative scenarios tested across 6 collaboration patterns with ensemble judges.

Key Findings

Overview

The Adaptive Poly-Agentic Evaluation Ecosystem (APEE) is a framework for systematically evaluating multi-agent AI systems. It uses LLM-as-a-Judge evaluation where large language models (20-24B parameters) evaluate smaller agent outputs, providing meaningful, nuanced scores rather than simple heuristics.

LLM-as-a-Judge
Large models evaluate smaller agent outputs with detailed feedback and reasoning.
Ensemble Judges
Multiple judge models from different families reduce evaluation bias.
Three-Tier Metrics
Individual (L1) → Collaborative (L2) → Ecosystem (L3) evaluation levels.
6 Collaboration Patterns
Parallel, sequential, debate, hierarchical, consensus, peer review.

Quick Start

Terminal
# Clone and install
git clone https://github.com/ahjavid/technical-notes-blog.git
cd technical-notes-blog/posts/apee-evaluation-ecosystem
pip install -e .

# Pull required models
ollama pull llama3.2:3b qwen2.5-coder:3b phi4-mini:3.8b
ollama pull gpt-oss:20b mistral-small3.2:24b

# Run LLM-as-a-Judge evaluation
python examples/proper_apee_evaluation.py

# Run advanced evaluation modes
python examples/advanced_evaluation_demo.py --mode all

Architecture

Agents (small 3B models, diverse families):

Role Model Family Strength
Coder (Executor) llama3.2:3b Llama Code Generation
Analyst (Analyzer) qwen2.5-coder:3b Qwen Analysis
Reviewer phi4-mini:3.8b Phi Code Review

Judges (large models, different families):

Judge Model Size Family
Judge 1 gpt-oss:20b 20B GPT-OSS
Judge 2 mistral-small3.2:24b 24B Mistral

Model Benchmark Results

0.892
phi4-mini Best
0.889
llama3.2 2nd
6
Models Tested
19
Scenarios
December 2025 Rankings (6 Models × 19 Scenarios)
  • 🥇 phi4-mini:3.8b: Quality 0.892 (±0.092) — Best overall, excels at code_review (0.991)
  • 🥈 llama3.2:3b: Quality 0.889 (±0.130) — Best code_generation (0.983), fastest at 3000ms
  • 🥉 gemma3:4b: Quality 0.876 (±0.121) — Strong code_debug (0.960)
  • 4. qwen3:4b: Quality 0.866 (±0.077) — Lowest variance, best instruction_following (0.840)
  • 5. granite4:3b: Quality 0.837 (±0.101) — Fastest (1904ms), best math (0.889)
  • 6. qwen2.5-coder:3b: Quality 0.829 (±0.138) — Best qa_reasoning (0.957)

Full Performance Matrix

Category qwen3 llama3.2 gemma3 granite4 qwen2.5-coder phi4-mini
analysis 0.861 0.934 0.837 0.812 0.939 0.927
code_debug 0.776 0.933 0.960 0.901 0.965 0.934
code_explanation 0.893 0.837 0.871 0.836 0.855 0.876
code_generation 0.927 0.983 0.922 0.853 0.888 0.927
code_review 0.912 0.983 0.969 0.928 0.790 0.991
instruction_following 0.840 0.571 0.607 0.599 0.635 0.713
math 0.719 0.800 0.838 0.889 0.538 0.855
qa_reasoning 0.900 0.936 0.928 0.895 0.957 0.903
reasoning 0.800 0.909 0.890 0.894 0.897 0.899
summarization 0.912 0.842 0.839 0.765 0.792 0.766

Best in category highlighted. Enhanced heuristic scoring with ROUGE, keyword matching, and constraint validation.

LLM-as-a-Judge Results

December 2025 Evaluation
  • Adversarial review leads: 8.0/10 overall — debate pattern excels (L2=8.2)
  • L2 Collaborative improved: Average 7.3/10 (was 5.7/10 before message fix)
  • L3 Ecosystem strongest: All scenarios score 7.2-8.7/10
  • Score range: 6.6-8.0 with mean 7.03/10 across all 12 scenarios (Basic mode)

Multi-Agent Collaborative Evaluation (Basic Mode)

Scenario Pattern L1 L2 L3 Overall
Adversarial Review Debate 7.8 8.2 8.0 8.0
Constrained Problem Debate 6.8 7.8 8.0 7.5
Research Synthesis Sequential 7.8 6.4 8.4 7.3
Creative Collab Debate 6.0 7.8 8.1 7.3
Knowledge Transfer Sequential 7.2 6.8 8.0 7.3
Conflict Resolution Consensus 7.3 7.5 7.2 7.3
Realtime Collab Parallel 7.3 6.8 8.4 7.2
Scalability Test Hierarchical 7.5 7.2 7.6 7.1
Doc Sprint Peer Review 6.2 7.5 7.6 7.1
Error Recovery Hierarchical 6.4 7.4 7.8 7.1
Collab Code Review Peer Review 6.8 7.0 7.8 7.0
Emergent Behavior Parallel 7.4 5.0 8.7 6.6

Advanced Evaluation Patterns (Phase 7)

Four evaluation modes with different characteristics:

Basic Mode
Avg: 6.69 | Range: 5.8-7.8
Standard LLM-as-a-Judge. Best for quick assessment.
Progressive Mode
Avg: 6.79 | Range: 6.1-7.4
4 depth levels with fail-fast. Best for large-scale screening.
Jury Mode
Avg: 6.83 | Range: 6.2-7.8
4 personas (Skeptic, Literalist, Optimist, Pragmatist).
Calibrated Mode
Avg: 6.66 | Range: 5.5-7.7
Calibration + jury. Best for novel/ambiguous tasks.

Phase 6: Visualization & Dashboard

📈 Interactive Visualization
Plotly charts for performance analysis, pattern comparison, and metric breakdown.
🔍 Anomaly Detection
Z-score (threshold: 2.0) and IQR-based detection for identifying evaluation outliers.
🎛️ REST API Dashboard
FastAPI server on port 8000 for real-time evaluation monitoring.
📊 Export Formats
HTML reports, JSON data, and CSV exports for all evaluation results.

Development Roadmap

Conclusion

APEE represents a significant step toward rigorous, reproducible LLM evaluation. By implementing multiple evaluation paradigms (basic, progressive, jury, calibrated), researchers and practitioners can choose the right balance between evaluation depth and computational cost.

Key Takeaways

Get Involved
APEE is an active research project. Check the full documentation or view on GitHub.
Multi-Agent AI LLM Evaluation Ollama Benchmarking Python