Building a Complete AI Agent Evaluation Ecosystem: From Instrumentation to Intelligence

Community Article Published November 25, 2025

Author: Kshitij Thakkar Date: November 2025 Tags: MCP, Agent Evaluation, OpenTelemetry, Gradio, Google Gemini, HuggingFace

TraceVerse Ecosystem

TraceVerse: A Complete AI Agent Evaluation Ecosystem

Introduction

Everyone wants to ship AI agents. Almost nobody can prove they work.

Most teams are still stuck with:

scattered logs,
ad-hoc eval scripts,
screenshots of leaderboards,
zero understanding of why a specific agent or model fails.

So I built TraceVerse — a 4-project open-source ecosystem that turns raw agent runs into traces, metrics, datasets, and AI-powered insights you can actually act on.

Instead of "I think this model is better," you get:

"Here's the trace, the cost, the failure pattern, the GPU profile, and the dataset backing that conclusion."

TL;DR – What TraceVerse Actually Does

TraceVerse is four projects that plug into each other:

TraceVerde (genai_otel_instrument) Zero-code OpenTelemetry instrumentation for LLMs, agents, tools, and infra. → Every LLM call, tool call, DB query, vector search is traced with tokens, cost, latency, GPU, CO₂.
SMOLTRACE (smoltrace) Evaluation engine built on smolagents. → Runs agent evals and writes 4 Hugging Face Datasets per run (leaderboard, results, traces, metrics).
TraceMind MCP Server MCP server exposing 11 tools + 3 resources + 3 prompts. → Uses Gemini to analyze leaderboards, debug traces, estimate costs, generate synthetic tasks, and push datasets.
TraceMind-AI (Gradio app) Interactive UI + autonomous agent on top of the MCP server. → Lets you explore evals, compare models, inspect traces, estimate cost, and ask "why did this fail?" as a natural language query.

End result: You go from "I ran some agents and logged numbers" to "I have reproducible HF datasets, full OTEL traces, and an agent that can explain my agents."

Why This Matters for Hugging Face

This ecosystem is built around Hugging Face, not just "using it":

Every SMOLTRACE evaluation creates 4 structured datasets on the Hub (leaderboard, results, traces, metrics).
TraceMind MCP Server and TraceMind-AI run as Hugging Face Spaces, using Gradio's MCP integration.
The stack is designed for smolagents – agents are evaluated, traced, and analyzed using HF's own agent framework.
Evaluations can be executed via HF Jobs, turning evaluations into real compute usage, not just local scripts.

So TraceVerse isn't just another toy agent demo. It's an opinionated blueprint for:

"How Hugging Face models + Datasets + Spaces + Jobs + smolagents + MCP can work together as a complete agent evaluation and observability platform."

From here, the rest of this post dives into the four projects, how they fit together, and what I learned building the full pipeline.

The TraceVerse Ecosystem

🔭 TraceVerde                    📊 SMOLTRACE
(genai_otel_instrument)         (Evaluation Engine)
        ↓                               ↓
    Instruments                    Evaluates
    LLM calls                      agents
        ↓                               ↓
        └───────────┬───────────────────┘
                    ↓
            Generates Datasets
        (leaderboard, traces, metrics)
                    ↓
        ┌───────────┴───────────────────┐
        ↓                               ↓
🛠️ TraceMind MCP Server         🧠 TraceMind-AI
(AI-Powered Analysis)           (Interactive Platform)

The Four Projects

Project	Purpose	Technology	Distribution
TraceVerde	Zero-code OpenTelemetry instrumentation	Python, OTEL SDK	PyPI
SMOLTRACE	Agent evaluation engine	Python, smolagents	PyPI
TraceMind MCP Server	AI-powered analysis tools	Gradio, MCP, Gemini	HuggingFace Space
TraceMind-AI	Interactive evaluation platform	Gradio, smolagents	HuggingFace Space

Project 1: TraceVerde (genai_otel_instrument)

Why it exists: You can't fix what you can't see. TraceVerde gives you zero-code observability for every LLM call, tool invocation, and agent step—with automatic cost enrichment, GPU metrics, and CO₂ tracking. Instead of scattered print statements, you get structured OpenTelemetry spans across 17+ providers.

The Problem

You've built an AI agent. Now you want to understand what's happening inside:

Which LLM calls are being made?
How long does each step take?
How many tokens are being consumed?
What's the cost breakdown?

Traditional logging is scattered and inconsistent. You end up with print statements everywhere, making it impossible to get a unified view of your agent's behavior.

The Solution

TraceVerde provides zero-code OpenTelemetry instrumentation for GenAI applications. Add two lines of code, and every LLM call across 17+ providers is automatically traced with rich semantic attributes.

Installation

pip install genai-otel-instrument

Usage

import genai_otel

# That's it! All LLM calls are now instrumented
genai_otel.instrument()

# Or with explicit configuration
genai_otel.instrument(
    service_name="my-agent",
    exporter_type="otlp",
    endpoint="http://localhost:4318"
)

What Gets Captured

Every LLM call automatically includes:

{
    "gen_ai.system": "openai",
    "gen_ai.request.model": "gpt-4",
    "gen_ai.operation.name": "chat",
    "gen_ai.usage.prompt_tokens": 150,
    "gen_ai.usage.completion_tokens": 89,
    "gen_ai.usage.total_tokens": 239,
    "gen_ai.usage.cost.total": 0.0045,  # Auto-enriched!
    "gen_ai.response.finish_reasons": ["stop"]
}

Key Features

Zero-Code Instrumentation: No decorators, no wrappers—just instrument() and you're done
17+ Provider Support: OpenAI, Anthropic, Google, Cohere, Mistral, and more
Cost Enrichment: Automatic cost calculation for 340+ models across 20 providers
Framework Integration: Works with LangChain, LlamaIndex, smolagents, CrewAI
Standard OTEL Export: Send traces to Jaeger, Zipkin, or any OTEL-compatible backend

Technical Architecture

# Core instrumentation flow
┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│  Your Agent     │ ──▶ │  TraceVerde      │ ──▶ │  OTEL Exporter  │
│  (any provider) │     │  (auto-patches)  │     │  (traces/spans) │
└─────────────────┘     └──────────────────┘     └─────────────────┘
                               │
                               ▼
                        ┌──────────────────┐
                        │  Cost Enricher   │
                        │  (340+ models)   │
                        └──────────────────┘

TraceVerde uses Python's import hooks to automatically patch LLM client libraries. When you call openai.chat.completions.create(), TraceVerde:

Intercepts the call
Creates an OTEL span with semantic attributes
Executes the original call
Records response attributes (tokens, finish reason)
Enriches with cost data from the pricing database
Exports the span to your configured backend

Traction

5,000+ downloads in the first month
Used by SMOLTRACE for all evaluation instrumentation
Production-ready with comprehensive test coverage

Links: GitHub | PyPI

Project 2: SMOLTRACE

Why it exists: Evaluation shouldn't be a one-time script that dies when the terminal closes. SMOLTRACE runs agent benchmarks (tool-calling, code generation) and writes 4 versioned HF Datasets per run—so every eval becomes reproducible, shareable, and queryable data instead of lost stdout.

The Problem

You want to evaluate your AI agent against a benchmark. Existing solutions are either:

Too complex (require deep framework knowledge)
Too rigid (only work with specific models)
Too opaque (no visibility into what's happening)

And critically—they don't let you evaluate on your own domain-specific tasks.

The Solution

SMOLTRACE is a lightweight evaluation engine built on HuggingFace's smolagents library. It evaluates agents on tool-calling and code-generation capabilities while generating structured datasets for analysis.

Installation

pip install smoltrace

Usage

# Evaluate an API model
smoltrace-eval \
  --model openai/gpt-4 \
  --provider litellm \
  --agent-type both \
  --enable-otel \
  --results-repo kshitij/smoltrace-results-gpt4 \
  --traces-repo kshitij/smoltrace-traces-gpt4 \
  --leaderboard-repo kshitij/smoltrace-leaderboard

# Evaluate a local GPU model
smoltrace-eval \
  --model meta-llama/Llama-3.1-8B \
  --provider transformers \
  --enable-gpu-metrics \
  --hardware gpu_h200

The Four Dataset Structure

Every evaluation generates four structured datasets:

1. Leaderboard Dataset

Aggregate statistics for comparison:

{
    "run_id": "uuid",
    "model": "openai/gpt-4",
    "success_rate": 95.8,
    "total_tests": 100,
    "avg_duration_ms": 3200.0,
    "total_cost_usd": 0.05,
    "co2_emissions_g": 0.22,
    "gpu_utilization_avg": 67.5
}

2. Results Dataset

Per-test-case outcomes:

{
    "task_id": "task_001",
    "success": True,
    "response": "The weather in Tokyo is 18°C",
    "tool_called": "get_weather",
    "execution_time_ms": 2450.0,
    "tokens": 234,
    "cost_usd": 0.0012,
    "trace_id": "trace_abc123"
}

3. Traces Dataset

OpenTelemetry spans for deep debugging:

{
    "trace_id": "trace_abc123",
    "spans": [
        {"name": "Agent Execution", "duration_ms": 2450},
        {"name": "LLM Call - Reasoning", "tokens": 123},
        {"name": "Tool Call - get_weather", "latency_ms": 890}
    ]
}

4. Metrics Dataset

GPU utilization and environmental data:

{
    "gpu_utilization": 67.5,
    "gpu_memory_mib": 512.34,
    "co2_emissions_g": 0.22
}

Custom Dataset Support

The killer feature: evaluate on your own domain tasks.

# Your custom task format
{
    "id": "finance_001",
    "prompt": "What's the current stock price of AAPL?",
    "expected_tool": "get_stock_price",
    "difficulty": "easy",
    "agent_type": "tool",
    "expected_keywords": ["AAPL", "price", "$"]
}

This lets you:

Benchmark models on industry-specific tasks
Test proprietary tools and APIs
Track improvement as you fine-tune

Traction

2,000+ downloads in the first week
Generates datasets used by TraceMind MCP Server
Supports both API and GPU-based evaluations

Links: GitHub | PyPI

Project 3: TraceMind MCP Server

Why it exists: Datasets are just data—they don't tell you what to do. TraceMind MCP Server wraps 11 tools around SMOLTRACE datasets and uses Gemini to turn raw eval data into actionable insights: "Which model should I use?", "Why did this trace fail?", "How much will 1000 tests cost?"

The Problem

You've run evaluations with SMOLTRACE. Now you have datasets. But datasets are just data—they don't tell you:

What patterns exist in the failures?
Which model should you choose for production?
How much will scaling cost?

You need intelligence, not just data.

The Solution

TraceMind MCP Server provides 17 MCP components (11 tools + 3 resources + 3 prompts) that transform raw evaluation data into actionable insights using Google Gemini 2.5 Flash.

Why MCP?

The Model Context Protocol (MCP) is a standard for exposing tools to AI assistants. By implementing TraceMind as an MCP server, it can be used by:

Claude Desktop
VSCode/Cursor with MCP extensions
Any MCP-compatible client
Other AI agents (like TraceMind-AI!)

The 11 Tools

AI-Powered Analysis (5 tools)

1. analyze_leaderboard

# Input
analyze_leaderboard(
    leaderboard_repo="kshitij/smoltrace-leaderboard",
    metric_focus="cost",
    top_n=5
)

# Output: AI-generated insights
"""
Based on 51 evaluations:

**Top Performers by Cost Efficiency:**
1. GPT-4o-nano: $0.002/run (93.4% accuracy)
2. Qwen3-MoE: $0.003/run (91.2% accuracy)
3. Llama-3.1-8B: $0.001/run (89.8% accuracy)

**Key Insight:** GPT-4o-nano offers the best cost/accuracy
tradeoff at 6x cheaper than GPT-4 with only 2.4% accuracy drop.

**Recommendation:** For cost-sensitive production workloads,
consider GPT-4o-nano or Llama-3.1-8B on GPU.
"""

2. debug_trace

# Input
debug_trace(
    trace_id="trace_abc123",
    traces_repo="kshitij/smoltrace-traces-gpt4",
    question="Why was the tool called twice?"
)

# Output: AI-powered trace analysis
"""
Based on the trace analysis:

1. **First tool call (span_003)**: Called `search_web` at 14:23:19
   - Reasoning: Initial results were outdated (result_age: "2 days")

2. **Second tool call (span_005)**: Called `search_web` at 14:23:21
   - Modified query: Added "latest" keyword
   - Result: Fresh data retrieved

**Root Cause:** The agent's iterative refinement behavior.
This is expected for complex queries requiring current data.

**Performance Impact:** +2.09s latency, +$0.0003 cost
"""

3. estimate_cost

# Input
estimate_cost(
    model="openai/gpt-4",
    agent_type="tool",
    num_tests=500,
    hardware="auto"
)

# Output: Detailed cost breakdown
"""
**Cost Estimate for 500 Tests**

| Component | Cost |
|-----------|------|
| Input Tokens | $1.25 |
| Output Tokens | $2.50 |
| Tool Calls | $0.15 |
| **Total** | **$3.90** |

**Hardware:** CPU (API model detected)
**Estimated Duration:** 45 minutes
**CO2 Emissions:** 0g (API provider offset)

**Optimization Tips:**
- Use GPT-4o-nano for 6x cost reduction
- Batch similar queries to reduce overhead
"""

4. compare_runs

# Compare two evaluation runs
compare_runs(
    run_id_1="run_gpt4_20251120",
    run_id_2="run_llama_20251120",
    comparison_focus="cost"
)

5. analyze_results

# Deep dive into test results
analyze_results(
    results_repo="kshitij/smoltrace-results-gpt4",
    analysis_focus="failures",
    max_rows=100
)

Token-Optimized Tools (2 tools)

A critical design decision: avoid token bloat.

The full leaderboard has 51 runs. Returning all data to an LLM would waste tokens. Instead:

6. get_top_performers - Returns only top N models (90% token reduction) 7. get_leaderboard_summary - Returns aggregate stats only (99% token reduction)

# Instead of 51 full records...
get_leaderboard_summary(leaderboard_repo="...")

# Returns just:
{
    "total_runs": 51,
    "unique_models": 12,
    "avg_success_rate": 87.3,
    "best_model": "gpt-4",
    "most_cost_effective": "gpt-4o-nano"
}

Data Management Tools (4 tools)

8. get_dataset - Load any SMOLTRACE dataset as JSON 9. generate_synthetic_dataset - Create domain-specific test tasks using AI 10. generate_prompt_template - Generate smolagents prompt templates 11. push_dataset_to_hub - Upload datasets to HuggingFace Hub

Synthetic Dataset Generation

One of the most powerful features:

generate_synthetic_dataset(
    domain="food-delivery",
    tool_names="search_restaurants,view_menu,place_order,track_delivery",
    num_tasks=100,
    difficulty_distribution="balanced"
)

This generates 100 realistic test tasks for your domain:

{
    "id": "food_001",
    "prompt": "Order a margherita pizza from Antonio's for delivery",
    "expected_tool": "place_order",
    "difficulty": "easy",
    "agent_type": "tool"
}

Batched Generation: Requests >20 tasks are automatically split into parallel batches for efficiency.

MCP Resources & Prompts

Beyond tools, the server exposes:

3 Resources (direct data access):

leaderboard://{repo} - Raw leaderboard JSON
trace://{trace_id}/{repo} - Raw trace data
cost://model/{model} - Pricing information

3 Prompts (standardized templates):

analysis_prompt - For leaderboard/cost/performance analysis
debug_prompt - For trace debugging scenarios
optimization_prompt - For optimization recommendations

Technical Architecture

┌─────────────────────────────────────────────────────────────┐
│                    TraceMind MCP Server                      │
├─────────────────────────────────────────────────────────────┤
│  Gradio App (mcp_server=True)                               │
│  ├── @gr.mcp.tool() decorated functions (11)                │
│  ├── @gr.mcp.resource() decorated functions (3)             │
│  └── @gr.mcp.prompt() decorated functions (3)               │
├─────────────────────────────────────────────────────────────┤
│  MCP Transport Layer                                         │
│  ├── SSE Endpoint: /gradio_api/mcp/sse                      │
│  └── HTTP Endpoint: /gradio_api/mcp/                        │
├─────────────────────────────────────────────────────────────┤
│  Core Services                                               │
│  ├── GeminiClient (gemini-2.5-flash-lite)                   │
│  ├── HuggingFace Datasets API                               │
│  └── Cost Calculator (340+ model pricing)                   │
└─────────────────────────────────────────────────────────────┘

Connecting to the Server

Auto-Config (Easiest):

Visit https://huggingface.co/settings/mcp
Add: MCP-1st-Birthday/TraceMind-mcp-server
Copy config to your MCP client

Manual Config (Claude Desktop):

{
  "mcpServers": {
    "tracemind": {
      "url": "https://mcp-1st-birthday-tracemind-mcp-server.hf.space/gradio_api/mcp/sse",
      "transport": "sse"
    }
  }
}

Demo Videos

Watch the MCP server in action with Claude Desktop:

Quick Demo (5 min): Watch on Loom
Full Demo (20 min): Watch on Loom

Social Media

Read the announcement and join the discussion:

[TODO: Add Twitter/X post link]
[TODO: Add LinkedIn post link]
[TODO: Add HuggingFace Discord announcement link]

Links: Live Demo | GitHub

Project 4: TraceMind-AI

Why it exists: Not everyone uses Claude Desktop or wants to write MCP client code. TraceMind-AI is a Gradio app that anyone can open in a browser—with an autonomous agent, leaderboard explorer, trace debugger, and eval job launcher. It's the MCP server made visual and accessible.

The Problem

The MCP server is powerful, but you need an MCP client to use it. What if you want:

A visual interface for exploring evaluation data?
An autonomous agent that can answer complex questions?
The ability to submit new evaluation jobs?

The Solution

TraceMind-AI is a Gradio-based platform that serves as both:

A visual dashboard for evaluation data
An autonomous agent that consumes the MCP server

Key Screens

1. Leaderboard View

Load live evaluation data from HuggingFace
Sort by accuracy, cost, latency, CO2
AI-generated insights via MCP

The Leaderboard displays all evaluation runs with sortable columns and AI-powered insights

AI-generated insights analyze trends, top performers, and provide recommendations

2. Agent Chat (The Star Feature)

Ask complex questions and watch the agent work:

User: "Compare the top 3 models by cost-effectiveness
       and recommend one for 1000 production tests"

Agent Reasoning:
├── Planning: Need to get top performers, then estimate costs
├── Calling: get_top_performers(top_n=3)
├── Calling: estimate_cost(model="gpt-4o-nano", num_tests=1000)
├── Calling: estimate_cost(model="qwen3-moe", num_tests=1000)
├── Calling: estimate_cost(model="llama-3.1-8b", num_tests=1000)
└── Synthesizing: Comprehensive recommendation

Agent Response:
"Based on my analysis of the top 3 models:

1. **GPT-4o-nano** - $3.90 for 1000 tests (93.4% accuracy)
2. **Qwen3-MoE** - $4.50 for 1000 tests (91.2% accuracy)
3. **Llama-3.1-8B** - $2.10 for 1000 tests (89.8% accuracy)

**Recommendation**: For production, I recommend GPT-4o-nano.
While Llama is cheapest, GPT-4o-nano's 3.6% accuracy advantage
justifies the $1.80 premium for 1000 tests..."

The Agent Chat shows the agent's reasoning process and tool calls in real-time

3. Cost Estimation

Calculate costs before running evaluations
Compare API vs GPU options
See environmental impact

Real-time cost estimation before submitting evaluation jobs

4. Trace Visualization

OpenTelemetry trace waterfall diagrams
AI-powered Q&A about specific traces
GPU metrics overlay

Waterfall view showing the complete execution flow with timing information

5. New Evaluation

Submit jobs to HuggingFace Jobs or Modal
Hardware selection (CPU, A10, H200)
Real-time job monitoring

6. Compare Runs

Side-by-side comparison of evaluation runs
Radar chart visualization across multiple dimensions
AI-powered comparison analysis

Radar chart comparing runs across accuracy, speed, cost efficiency, and more

7. Synthetic Data Generator

Create custom evaluation datasets
Generate matching prompt templates
Push directly to HuggingFace Hub

Generate domain-specific test datasets with AI-powered generation

Dual MCP Integration

TraceMind-AI demonstrates two MCP patterns:

Pattern 1: Direct MCP Client

from mcp import ClientSession

async def call_mcp_tool(tool_name, arguments):
    async with ClientSession(sse_url) as session:
        result = await session.call_tool(tool_name, arguments)
        return result.content[0].text

Pattern 2: Autonomous Agent with MCP Tools

from smolagents import ToolCallingAgent, Tool

# Wrap MCP tools for smolagents
class MCPTool(Tool):
    def __init__(self, mcp_client, tool_name):
        self.mcp_client = mcp_client
        self.tool_name = tool_name

    def forward(self, **kwargs):
        return self.mcp_client.call_tool(self.tool_name, kwargs)

# Create autonomous agent
agent = ToolCallingAgent(
    tools=[MCPTool(client, "analyze_leaderboard"), ...],
    model=gemini_model
)

# Agent autonomously decides which tools to call
response = agent.run("Compare the top 3 models...")

Technical Architecture

┌─────────────────────────────────────────────────────────────┐
│                      TraceMind-AI                            │
├─────────────────────────────────────────────────────────────┤
│  Gradio Frontend (6 Screens)                                │
│  ├── Leaderboard (data table + AI insights)                 │
│  ├── Agent Chat (autonomous reasoning)                      │
│  ├── Cost Estimation (calculators)                          │
│  ├── Trace Visualization (waterfall diagrams)               │
│  ├── New Evaluation (job submission)                        │
│  └── Synthetic Data Generator                               │
├─────────────────────────────────────────────────────────────┤
│  MCP Client Layer                                           │
│  ├── SSE Connection to TraceMind MCP Server                 │
│  ├── sync_wrapper.py (async-to-sync bridge)                 │
│  └── Tool response caching                                  │
├─────────────────────────────────────────────────────────────┤
│  Agent Framework (smolagents)                               │
│  ├── ToolCallingAgent with MCP tools                        │
│  ├── Reasoning visibility (show agent's thought process)    │
│  └── Multi-step task execution                              │
├─────────────────────────────────────────────────────────────┤
│  Data Layer                                                 │
│  ├── HuggingFace Datasets API                               │
│  ├── HuggingFace Jobs API (job submission)                  │
│  └── Modal API (alternative compute)                        │
└─────────────────────────────────────────────────────────────┘

Demo Videos

Watch TraceMind-AI in action:

[TODO: Add Quick Demo video link (5 min)]
[TODO: Add Full Demo video link (20 min)]
[TODO: Add Agent Chat Demo video link]

Social Media

Read the announcement and join the discussion:

[TODO: Add Twitter/X post link]
[TODO: Add LinkedIn post link]
[TODO: Add HuggingFace Discord announcement link]

Links: Live Demo | GitHub

The Complete Data Flow

Let's trace a complete workflow through all four projects:

Step 1: Instrument Your Agent (TraceVerde)

import genai_otel
genai_otel.instrument()

# Your agent code runs normally, but now it's traced
agent.run("What's the weather in Tokyo?")

Step 2: Run Evaluation (SMOLTRACE)

smoltrace-eval \
  --model openai/gpt-4 \
  --enable-otel \
  --results-repo myuser/smoltrace-results-gpt4 \
  --leaderboard-repo myuser/smoltrace-leaderboard

Output: 4 HuggingFace datasets created

Step 3: Analyze with AI (TraceMind MCP Server)

In Claude Desktop:

"Analyze the leaderboard at myuser/smoltrace-leaderboard
 and tell me which model I should use for production"

Output: AI-powered insights with specific recommendations

Step 4: Deep Dive (TraceMind-AI)

Open TraceMind-AI, ask the autonomous agent:

"Why did test task_045 fail? Show me the trace
 and explain what went wrong"

Output: The agent calls multiple MCP tools, retrieves the trace, and provides a detailed explanation with the reasoning visible.

Lessons Learned

1. Token Optimization is Critical

Early versions returned full datasets to the LLM. With 51 leaderboard entries, this wasted thousands of tokens per request.

Solution: Created get_top_performers and get_leaderboard_summary tools that return only what's needed. This reduced token usage by 90-99%.

2. Async-to-Sync Bridging is Tricky

Gradio runs synchronously, but MCP clients are async. The sync_wrapper.py module handles this:

def run_sync(coro):
    """Run async code from sync context"""
    loop = asyncio.new_event_loop()
    try:
        return loop.run_until_complete(coro)
    finally:
        loop.close()

3. Batched Generation Scales

Generating 100 synthetic tasks in one Gemini call would hit token limits. Instead:

# Split into batches of 20
batches = [tasks[i:i+20] for i in range(0, len(tasks), 20)]

# Generate in parallel
results = await asyncio.gather(*[
    generate_batch(batch) for batch in batches
])

4. Show the Agent's Reasoning

Users don't trust black-box answers. TraceMind-AI shows the agent's complete reasoning:

Which tools it decided to call
Why it made those decisions
The intermediate results

This transparency builds trust and helps debugging.

Hackathon Submission

This ecosystem was built for MCP's 1st Birthday Hackathon (November 14-30, 2025).

Track 1: Building MCP (Enterprise)

Project: TraceMind MCP Server

17 MCP components (11 tools + 3 resources + 3 prompts)
AI-powered analysis with Gemini 2.5 Flash
Token-optimized design for enterprise use

Track 2: MCP in Action (Enterprise)

Project: TraceMind-AI

Autonomous agent consuming remote MCP server
Dual MCP integration patterns
Complete evaluation platform

Why Dual-Track?

The same underlying technology, presented from two angles:

Track 1: "Here's how to BUILD powerful MCP servers"
Track 2: "Here's how to CONSUME MCP servers in autonomous agents"

Together, they demonstrate the full MCP ecosystem potential.

Try It Yourself

Quick Start

Connect MCP Server to Claude Desktop:
- Visit https://huggingface.co/settings/mcp
- Add: MCP-1st-Birthday/TraceMind-mcp-server
Try TraceMind-AI:
- Visit https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind
- Ask the agent a question about agent evaluation

Run Your Own Evaluations:

pip install smoltrace
smoltrace-eval --model openai/gpt-4 --enable-otel

Links

Project	Live Demo	GitHub	PyPI
TraceVerde	-	GitHub	PyPI
SMOLTRACE	-	GitHub	PyPI
TraceMind MCP Server	HuggingFace	GitHub	-
TraceMind-AI	HuggingFace	GitHub	-

Demo Videos

MCP Server Quick Demo (5 min): Watch on Loom
MCP Server Full Demo (20 min): Watch on Loom

Conclusion

Building the TraceMind ecosystem taught me that the real value isn't in the data—it's in the intelligence layer on top.

Anyone can generate evaluation metrics. The hard part is:

Understanding why failures happen
Making informed decisions about which model to use
Predicting costs before committing resources

By combining OpenTelemetry instrumentation (TraceVerde), structured evaluation (SMOLTRACE), AI-powered analysis (TraceMind MCP Server), and an autonomous agent interface (TraceMind-AI), we've created a complete solution for AI agent evaluation.

The Model Context Protocol makes this possible by providing a standard way to expose tools to AI assistants. Whether you're using Claude Desktop, building your own agent, or integrating with other MCP clients, TraceMind's tools are available.

The future of AI development is observable, measurable, and intelligent.

Acknowledgments

Anthropic - For creating MCP and hosting this hackathon
Gradio - For native MCP support that made this possible
HuggingFace - For Spaces, Datasets, and the smolagents library
Google - For Gemini 2.5 Flash powering all AI analysis
Modal - For providing Infrastruture for evalations

Built with ❤️ for MCP's 1st Birthday Hackathon

Questions? Feedback? Reach out on GitHub or find me in the HuggingFace Discord! or follow me on Linkedin

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote