Skip to content

getsentry/vitest-evals

vitest-evals

End-to-end evaluation framework for AI agents, built on Vitest.

Installation

npm install -D vitest-evals

Quick Start

import { describeEval } from "vitest-evals";

describeEval("deploy agent", {
  data: async () => [
    { input: "Deploy the latest release to production", expected: "deployed" },
    { input: "Roll back the last deploy", expected: "rolled back" },
  ],
  task: async (input) => {
    const response = await myAgent.run(input);
    return response;
  },
  scorers: [
    async ({ output, expected }) => ({
      score: output.toLowerCase().includes(expected.toLowerCase()) ? 1.0 : 0.0,
    }),
  ],
  threshold: 0.8,
});

Tasks

Tasks process inputs and return outputs. Two formats are supported:

// Simple: just return a string
const task = async (input) => "response";

// With tool tracking: return a TaskResult
const task = async (input) => ({
  result: "response",
  toolCalls: [
    { name: "search", arguments: { query: "..." }, result: {...} }
  ]
});

Test Data

Each test case requires an input field. Use name to give tests a descriptive label:

data: async () => [
  { name: "simple deploy", input: "Deploy to staging" },
  { name: "deploy with rollback", input: "Deploy to prod, roll back if errors" },
],

Additional fields (like expected, expectedTools) are passed through to scorers.

Lifecycle Hooks

Use beforeEach and afterEach for setup and teardown:

describeEval("agent with database", {
  beforeEach: async () => {
    await db.seed();
  },
  afterEach: async () => {
    await db.clean();
  },
  data: async () => [{ input: "Find recent errors" }],
  task: myAgentTask,
  scorers: [async ({ output }) => ({ score: output.includes("error") ? 1.0 : 0.0 })],
});

Scorers

Scorers evaluate outputs and return a score (0-1). Use built-in scorers or create your own.

ToolCallScorer

Evaluates if the expected tools were called with correct arguments.

import { ToolCallScorer } from "vitest-evals";

describeEval("tool usage", {
  data: async () => [
    {
      input: "Find Italian restaurants",
      expectedTools: [
        { name: "search", arguments: { type: "restaurant" } },
        { name: "filter", arguments: { cuisine: "italian" } },
      ],
    },
  ],
  task: myTask,
  scorers: [ToolCallScorer()],
});

// Strict order and parameters
scorers: [ToolCallScorer({ ordered: true, params: "strict" })];

// Flexible evaluation
scorers: [ToolCallScorer({ requireAll: false, allowExtras: false })];

Default behavior:

  • Strict parameter matching (exact equality required)
  • Any order allowed
  • Extra tools allowed
  • All expected tools required

StructuredOutputScorer

Evaluates if the output matches expected structured data (JSON).

import { StructuredOutputScorer } from "vitest-evals";

describeEval("query generation", {
  data: async () => [
    {
      input: "Show me errors from today",
      expected: {
        dataset: "errors",
        query: "",
        sort: "-timestamp",
        timeRange: { statsPeriod: "24h" },
      },
    },
  ],
  task: myTask,
  scorers: [StructuredOutputScorer()],
});

// Fuzzy matching
scorers: [StructuredOutputScorer({ match: "fuzzy" })];

// Custom validation
scorers: [
  StructuredOutputScorer({
    match: (expected, actual, key) => {
      if (key === "age") return actual >= 18 && actual <= 100;
      return expected === actual;
    },
  }),
];

Custom Scorers

// Inline scorer
const LengthScorer = async ({ output }) => ({
  score: output.length > 50 ? 1.0 : 0.0,
});

// TypeScript scorer with custom options
import { type ScoreFn, type BaseScorerOptions } from "vitest-evals";

interface CustomOptions extends BaseScorerOptions {
  minLength: number;
}

const TypedScorer: ScoreFn<CustomOptions> = async (opts) => ({
  score: opts.output.length >= opts.minLength ? 1.0 : 0.0,
});

AI SDK Integration

See src/ai-sdk-integration.test.ts for a complete example with the Vercel AI SDK.

Transform provider responses to our format:

const { text, steps } = await generateText({
  model: openai("gpt-4o"),
  prompt: input,
  tools: { myTool: myToolDefinition },
});

return {
  result: text,
  toolCalls: steps
    .flatMap((step) => step.toolCalls)
    .map((call) => ({
      name: call.toolName,
      arguments: call.args,
    })),
};

Advanced Usage

Using autoevals

For evaluation using the autoevals library:

import { Factuality, ClosedQA } from "autoevals";

scorers: [
  Factuality,
  ClosedQA.partial({
    criteria: "Does the answer mention Paris?",
  }),
];

Skip Tests Conditionally

describeEval("gpt-4 tests", {
  skipIf: () => !process.env.OPENAI_API_KEY,
  // ...
});

Existing Test Suites

For integration with existing Vitest test suites, you can use the .toEval() matcher:

Deprecated: The .toEval() helper is deprecated. Use describeEval() instead for better test organization and multiple scorers support.

import "vitest-evals";

test("capital check", () => {
  const simpleFactuality = async ({ output, expected }) => ({
    score: output.toLowerCase().includes(expected.toLowerCase()) ? 1.0 : 0.0,
  });

  expect("What is the capital of France?").toEval(
    "Paris",
    answerQuestion,
    simpleFactuality,
    0.8
  );
});

Configuration

Separate Eval Configuration

Create vitest.evals.config.ts:

import { defineConfig } from "vitest/config";
import defaultConfig from "./vitest.config";

export default defineConfig({
  ...defaultConfig,
  test: {
    ...defaultConfig.test,
    include: ["src/**/*.eval.{js,ts}"],
  },
});

Run evals separately:

vitest --config=vitest.evals.config.ts

Development

pnpm install
pnpm test

About

A vitest extension for running evals.

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Sponsor this project

Packages

 
 
 

Contributors