Evaluation Harness

Test your multi-turn conversation system at scale. The cyclic conversation graph becomes a node inside an evaluation DAG.

Why This Example?

This showcases the flip side of the natural hierarchy: a cycle (conversation) nested inside a DAG (evaluation).

Same graph. Different context. Build once, reuse everywhere.

The Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    EVALUATION PIPELINE (DAG)                    │
│                                                                 │
│  load_test_cases → conversation → score → aggregate → report   │
│                         │                                       │
│                         ▼                                       │
│              ┌─────────────────────┐                           │
│              │  CONVERSATION LOOP  │                           │
│              │     (cyclic)        │                           │
│              │                     │                           │
│              │  rag → accumulate   │                           │
│              │   ↑         ↓       │                           │
│              │   └── continue? ────┘                           │
│              └─────────────────────┘                           │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Complete Implementation

Test Data Format

Key Patterns Demonstrated

1. Cycle Inside DAG

The conversation graph (cyclic) runs inside the evaluation_pipeline (DAG):

2. Same Graph, Different Context

The conversation graph is the same one used in production. We're just running it in an evaluation context:

3. Parallel Test Execution

Run multiple test conversations concurrently:

4. Pure Scoring Functions

Scoring is a pure function — easy to test:

Extending the Pattern

A/B Testing

Compare two conversation implementations:

Regression Testing

Compare against baseline responses:

CI Integration

What's Next?

Last updated