Edit on GitHub
Guide10 min read

Eval Pipelines

Evaluation pipelines let you score model outputs systematically. Define scoring functions, run them against a prompt dataset, and track quality across model versions and prompt changes.

Overview

An eval pipeline has three parts: a dataset of inputs, a set of scoring functions, and a runner that maps inputs through the model and scores each output.

DatasetA list of prompt inputs, optionally with expected outputs for reference scoring.
ScorersFunctions that accept a prompt + output and return a numeric score (0–1) or a pass/fail.
RunnerCalls vault.infer() for each item and collects scorer results into a results object.

Defining Evals

A scorer is a function that takes the prompt and the model output and returns a score. Keep scorers pure and synchronous where possible.

evals/scorers.ts
export type Scorer = (prompt: string, output: string) => number;

// Score by output length — penalise over-verbose responses
export const brevityScorer: Scorer = (_prompt, output) => {
  const words = output.trim().split(/s+/).length;
  return words <= 80 ? 1 : Math.max(0, 1 - (words - 80) / 200);
};

// Score by keyword presence
export function keywordScorer(keywords: string[]): Scorer {
  return (_prompt, output) => {
    const found = keywords.filter((kw) =>
      output.toLowerCase().includes(kw.toLowerCase())
    );
    return found.length / keywords.length;
  };
}

// LLM-as-judge scorer (async)
export async function llmJudgeScorer(
  vault: VaultClient,
  prompt: string,
  output: string,
): Promise<number> {
  const result = await vault.infer({
    model: 'vault-3-turbo',
    prompt: `Rate this response 0-10. Prompt: ${prompt}\nResponse: ${output}\nReturn only the number.`,
    maxTokens: 4,
  });
  const score = parseInt(result.text.trim(), 10);
  return isNaN(score) ? 0 : score / 10;
}
LLM-as-judge scorers are powerful but slow and costly. Use them selectively. Run them on a sample of outputs rather than the full dataset.

Running a Pipeline

The runner maps your dataset through the model and collects scores. Run with concurrency control to stay within rate limits.

evals/run.ts
import { vault } from '@/lib/vault';
import { brevityScorer, keywordScorer } from './scorers';

const dataset = [
  { prompt: 'Summarize the Vault SDK in one sentence.' },
  { prompt: 'What is a workspace in Vault?' },
  { prompt: 'How do I stream a response?' },
];

const scorers = [
  brevityScorer,
  keywordScorer(['Vault', 'SDK', 'API']),
];

async function runEvals() {
  const results = [];

  for (const item of dataset) {
    const result = await vault.infer({
      model: 'vault-3-turbo',
      prompt: item.prompt,
      maxTokens: 128,
    });

    const scores = scorers.map((scorer) =>
      scorer(item.prompt, result.text)
    );

    results.push({
      prompt:  item.prompt,
      output:  result.text,
      scores,
      average: scores.reduce((a, b) => a + b, 0) / scores.length,
    });
  }

  return results;
}

runEvals().then(console.log);

Analyzing Results

Once you have results, compare across runs. Track the average score per scorer, flag any items below a threshold, and diff scores between model versions.

avg_scoreMean across all scorers and dataset items. The headline metric.
pass_ratePercentage of items with average score above your threshold (e.g. 0.7).
scorer_breakdownPer-scorer averages to identify which dimensions are regressing.
failuresItems with score below threshold. Review these manually.