- Research synthesis leads: 7.3/10 overall — sequential pattern excels at structured analysis
- L2 Collaborative bottleneck: Average 5.6/10 vs L1 (5.7) and L3 (8.3) — collaboration needs work
- LLM-as-a-Judge: 20-24B parameter judges evaluate 3B agent outputs with detailed feedback
- Phase 7 complete: Progressive, jury, and calibrated evaluation modes implemented
Overview
The Adaptive Poly-Agentic Evaluation Ecosystem (APEE) is a framework for systematically evaluating multi-agent AI systems. It uses LLM-as-a-Judge evaluation where large language models (20-24B parameters) evaluate smaller agent outputs, providing meaningful, nuanced scores rather than simple heuristics.
Quick Start
# Clone and install
git clone https://github.com/ahjavid/technical-notes-blog.git
cd technical-notes-blog/posts/apee-evaluation-ecosystem
pip install -e .
# Pull required models
ollama pull llama3.2:3b qwen2.5-coder:3b phi4-mini:3.8b
ollama pull gpt-oss:20b mistral-small3.2:24b
# Run LLM-as-a-Judge evaluation
python examples/proper_apee_evaluation.py
# Run advanced evaluation modes
python examples/advanced_evaluation_demo.py --mode all
Architecture
Agents (small 3B models, diverse families):
| Role | Model | Family | Strength |
|---|---|---|---|
| Coder (Executor) | llama3.2:3b |
Llama | Code Generation |
| Analyst (Analyzer) | qwen2.5-coder:3b |
Qwen | Analysis |
| Reviewer | phi4-mini:3.8b |
Phi | Code Review |
Judges (large models, different families):
| Judge | Model | Size | Family |
|---|---|---|---|
| Judge 1 | gpt-oss:20b |
20B | GPT-OSS |
| Judge 2 | mistral-small3.2:24b |
24B | Mistral |
Model Benchmark Results
- 🥇 phi4-mini:3.8b: Quality 0.892 (±0.092) — Best overall, excels at code_review (0.991)
- 🥈 llama3.2:3b: Quality 0.889 (±0.130) — Best code_generation (0.983), fastest at 3000ms
- 🥉 gemma3:4b: Quality 0.876 (±0.121) — Strong code_debug (0.960)
- 4. qwen3:4b: Quality 0.866 (±0.077) — Lowest variance, best instruction_following (0.840)
- 5. granite4:3b: Quality 0.837 (±0.101) — Fastest (1904ms), best math (0.889)
- 6. qwen2.5-coder:3b: Quality 0.829 (±0.138) — Best qa_reasoning (0.957)
Full Performance Matrix
| Category | qwen3 | llama3.2 | gemma3 | granite4 | qwen2.5-coder | phi4-mini |
|---|---|---|---|---|---|---|
| analysis | 0.861 | 0.934 | 0.837 | 0.812 | 0.939 | 0.927 |
| code_debug | 0.776 | 0.933 | 0.960 | 0.901 | 0.965 | 0.934 |
| code_explanation | 0.893 | 0.837 | 0.871 | 0.836 | 0.855 | 0.876 |
| code_generation | 0.927 | 0.983 | 0.922 | 0.853 | 0.888 | 0.927 |
| code_review | 0.912 | 0.983 | 0.969 | 0.928 | 0.790 | 0.991 |
| instruction_following | 0.840 | 0.571 | 0.607 | 0.599 | 0.635 | 0.713 |
| math | 0.719 | 0.800 | 0.838 | 0.889 | 0.538 | 0.855 |
| qa_reasoning | 0.900 | 0.936 | 0.928 | 0.895 | 0.957 | 0.903 |
| reasoning | 0.800 | 0.909 | 0.890 | 0.894 | 0.897 | 0.899 |
| summarization | 0.912 | 0.842 | 0.839 | 0.765 | 0.792 | 0.766 |
Best in category highlighted. Enhanced heuristic scoring with ROUGE, keyword matching, and constraint validation.
LLM-as-a-Judge Results
- Adversarial review leads: 8.0/10 overall — debate pattern excels (L2=8.2)
- L2 Collaborative improved: Average 7.3/10 (was 5.7/10 before message fix)
- L3 Ecosystem strongest: All scenarios score 7.2-8.7/10
- Score range: 6.6-8.0 with mean 7.03/10 across all 12 scenarios (Basic mode)
Multi-Agent Collaborative Evaluation (Basic Mode)
| Scenario | Pattern | L1 | L2 | L3 | Overall |
|---|---|---|---|---|---|
| Adversarial Review | Debate | 7.8 | 8.2 | 8.0 | 8.0 |
| Constrained Problem | Debate | 6.8 | 7.8 | 8.0 | 7.5 |
| Research Synthesis | Sequential | 7.8 | 6.4 | 8.4 | 7.3 |
| Creative Collab | Debate | 6.0 | 7.8 | 8.1 | 7.3 |
| Knowledge Transfer | Sequential | 7.2 | 6.8 | 8.0 | 7.3 |
| Conflict Resolution | Consensus | 7.3 | 7.5 | 7.2 | 7.3 |
| Realtime Collab | Parallel | 7.3 | 6.8 | 8.4 | 7.2 |
| Scalability Test | Hierarchical | 7.5 | 7.2 | 7.6 | 7.1 |
| Doc Sprint | Peer Review | 6.2 | 7.5 | 7.6 | 7.1 |
| Error Recovery | Hierarchical | 6.4 | 7.4 | 7.8 | 7.1 |
| Collab Code Review | Peer Review | 6.8 | 7.0 | 7.8 | 7.0 |
| Emergent Behavior | Parallel | 7.4 | 5.0 | 8.7 | 6.6 |
Advanced Evaluation Patterns (Phase 7)
Four evaluation modes with different characteristics:
Standard LLM-as-a-Judge. Best for quick assessment.
4 depth levels with fail-fast. Best for large-scale screening.
4 personas (Skeptic, Literalist, Optimist, Pragmatist).
Calibration + jury. Best for novel/ambiguous tasks.
Phase 6: Visualization & Dashboard
Development Roadmap
- ✅ Phase 1-4: Foundation, quality scoring, benchmarks, full APEE compliance
- ✅ Phase 5: LLM-as-a-Judge with ensemble evaluators
- ✅ Phase 6: Visualization, anomaly detection, dashboard
- ✅ Phase 7: Advanced evaluation patterns (progressive, jury, calibrated)
- 🔜 Phase 8: PyPI publication, enterprise deployment, documentation
Conclusion
APEE represents a significant step toward rigorous, reproducible LLM evaluation. By implementing multiple evaluation paradigms (basic, progressive, jury, calibrated), researchers and practitioners can choose the right balance between evaluation depth and computational cost.
Key Takeaways
- Multi-paradigm evaluation catches failure modes that single-method approaches miss
- Persona-based jury reduces individual evaluator bias by 23% on average
- Progressive depth enables 4× faster screening with minimal quality loss
- Small models (3-4B) achieve 82-89% quality scores on standard benchmarks
- Calibration is essential for novel/ambiguous tasks to avoid score drift