Skip to main content

Archetype: Benchmark evaluation product

Use when

Buyers must pick among vendors, models, or strategies using apples-to-apples evidence.

Common users

  • ML platform leads
  • Procurement science teams
  • Competition governance boards

Minimal MVP

  • Frozen evaluation dataset(s) with licenses noted
  • Runner that logs hardware + software fingerprint
  • Leaderboard with confidence intervals
  • Regression tests when baselines move

Required files

  • product.yaml
  • product-brief.md
  • data-contract.md
  • evaluation.md
  • demo.md

Evaluation

  • Statistical significance on primary metric
  • Robustness sweeps (noise, latency)
  • Reproducibility score (rerun variance)

Reference products

  • GLUE-style leaderboards adapted to private data

Common mistakes

  • Leaderboards without versioned data
  • Cherry-picked slices
  • Ignoring cost-to-serve

Agent prompt

You are building a benchmark microproduct. Freeze datasets, document licensing, script the harness, and only then publish comparative tables with uncertainty.