Eval Pipelines
Evaluation pipelines let you score model outputs systematically. Define scoring functions, run them against a prompt dataset, and track quality across model versions and prompt changes.
Overview
An eval pipeline has three parts: a dataset of inputs, a set of scoring functions, and a runner that maps inputs through the model and scores each output.
Defining Evals
A scorer is a function that takes the prompt and the model output and returns a score. Keep scorers pure and synchronous where possible.
export type Scorer = (prompt: string, output: string) => number;
// Score by output length — penalise over-verbose responses
export const brevityScorer: Scorer = (_prompt, output) => {
const words = output.trim().split(/s+/).length;
return words <= 80 ? 1 : Math.max(0, 1 - (words - 80) / 200);
};
// Score by keyword presence
export function keywordScorer(keywords: string[]): Scorer {
return (_prompt, output) => {
const found = keywords.filter((kw) =>
output.toLowerCase().includes(kw.toLowerCase())
);
return found.length / keywords.length;
};
}
// LLM-as-judge scorer (async)
export async function llmJudgeScorer(
vault: VaultClient,
prompt: string,
output: string,
): Promise<number> {
const result = await vault.infer({
model: 'vault-3-turbo',
prompt: `Rate this response 0-10. Prompt: ${prompt}\nResponse: ${output}\nReturn only the number.`,
maxTokens: 4,
});
const score = parseInt(result.text.trim(), 10);
return isNaN(score) ? 0 : score / 10;
}Running a Pipeline
The runner maps your dataset through the model and collects scores. Run with concurrency control to stay within rate limits.
import { vault } from '@/lib/vault';
import { brevityScorer, keywordScorer } from './scorers';
const dataset = [
{ prompt: 'Summarize the Vault SDK in one sentence.' },
{ prompt: 'What is a workspace in Vault?' },
{ prompt: 'How do I stream a response?' },
];
const scorers = [
brevityScorer,
keywordScorer(['Vault', 'SDK', 'API']),
];
async function runEvals() {
const results = [];
for (const item of dataset) {
const result = await vault.infer({
model: 'vault-3-turbo',
prompt: item.prompt,
maxTokens: 128,
});
const scores = scorers.map((scorer) =>
scorer(item.prompt, result.text)
);
results.push({
prompt: item.prompt,
output: result.text,
scores,
average: scores.reduce((a, b) => a + b, 0) / scores.length,
});
}
return results;
}
runEvals().then(console.log);Analyzing Results
Once you have results, compare across runs. Track the average score per scorer, flag any items below a threshold, and diff scores between model versions.
avg_scoreMean across all scorers and dataset items. The headline metric.pass_ratePercentage of items with average score above your threshold (e.g. 0.7).scorer_breakdownPer-scorer averages to identify which dimensions are regressing.failuresItems with score below threshold. Review these manually.