Observability
Instrument your Vault integration with structured logging, distributed tracing, and cost tracking. Know what your models are doing in production before your users notice something is wrong.
Request Logging
Log every inference request with timing and token counts. Use structured JSON so logs are queryable.
lib/vault.ts
import { VaultClient } from '@vault/sdk';
const client = new VaultClient({
apiKey: process.env.VAULT_API_KEY!,
hooks: {
onRequest({ model, prompt }) {
console.log(JSON.stringify({
event: 'vault.request',
model,
promptTokens: prompt.length,
ts: Date.now(),
}));
},
onResponse({ model, usage, latencyMs }) {
console.log(JSON.stringify({
event: 'vault.response',
model,
inputTokens: usage.inputTokens,
outputTokens: usage.outputTokens,
latencyMs,
ts: Date.now(),
}));
},
},
});
export { client as vault };Avoid logging prompt content in production. It may contain PII. Log token counts and model names only.
Tracing
Attach a trace ID to each inference call to correlate it with the upstream request in your observability platform.
app/api/infer/route.ts
import { vault } from '@/lib/vault';
import { randomUUID } from 'crypto';
export async function POST(req: Request) {
const traceId = req.headers.get('x-trace-id') ?? randomUUID();
const result = await vault.infer({
model: 'vault-3-turbo',
prompt: await req.text(),
metadata: { traceId },
});
return Response.json({ text: result.text }, {
headers: { 'x-trace-id': traceId },
});
}Cost Tracking
Use token counts from the response to calculate cost per request. Aggregate daily to track spend trends.
vault-3-turbo$0.50 / 1M input$1.50 / 1M outputvault-3-pro$3.00 / 1M input$15.00 / 1M outputvault-3-mini$0.10 / 1M input$0.30 / 1M outputAlerts
Set up alerts on these signals to catch problems before they impact users.
p95 latency > 5sCheck model load, consider switching to vault-3-mini for this endpoint.error_rate > 1%Inspect error codes. 429 means rate limit, 5xx means model instability.daily_cost > thresholdAudit token usage. Common cause: prompts growing unbounded with context.output_tokens spikeCheck maxTokens cap. A missing cap allows unbounded completions.