vitest-evals

End-to-end evaluation framework for AI agents, built on Vitest.

Installation

npm install -D vitest-evals

Quick Start

import { describeEval } from "vitest-evals";

describeEval("deploy agent", {
  data: async () => [
    { input: "Deploy the latest release to production", expected: "deployed" },
    { input: "Roll back the last deploy", expected: "rolled back" },
  ],
  task: async (input) => {
    const response = await myAgent.run(input);
    return response;
  },
  scorers: [
    async ({ output, expected }) => ({
      score: output.toLowerCase().includes(expected.toLowerCase()) ? 1.0 : 0.0,
    }),
  ],
  threshold: 0.8,
});

Tasks

Tasks process inputs and return outputs. Two formats are supported:

// Simple: just return a string
const task = async (input) => "response";

// With tool tracking: return a TaskResult
const task = async (input) => ({
  result: "response",
  toolCalls: [
    { name: "search", arguments: { query: "..." }, result: {...} }
  ]
});

Test Data

Each test case requires an input field. Use name to give tests a descriptive label:

data: async () => [
  { name: "simple deploy", input: "Deploy to staging" },
  { name: "deploy with rollback", input: "Deploy to prod, roll back if errors" },
],

Additional fields (like expected, expectedTools) are passed through to scorers.

Lifecycle Hooks

Use beforeEach and afterEach for setup and teardown:

describeEval("agent with database", {
  beforeEach: async () => {
    await db.seed();
  },
  afterEach: async () => {
    await db.clean();
  },
  data: async () => [{ input: "Find recent errors" }],
  task: myAgentTask,
  scorers: [async ({ output }) => ({ score: output.includes("error") ? 1.0 : 0.0 })],
});

Scorers

Scorers evaluate outputs and return a score (0-1). Use built-in scorers or create your own.

ToolCallScorer

Evaluates if the expected tools were called with correct arguments.

import { ToolCallScorer } from "vitest-evals";

describeEval("tool usage", {
  data: async () => [
    {
      input: "Find Italian restaurants",
      expectedTools: [
        { name: "search", arguments: { type: "restaurant" } },
        { name: "filter", arguments: { cuisine: "italian" } },
      ],
    },
  ],
  task: myTask,
  scorers: [ToolCallScorer()],
});

// Strict order and parameters
scorers: [ToolCallScorer({ ordered: true, params: "strict" })];

// Flexible evaluation
scorers: [ToolCallScorer({ requireAll: false, allowExtras: false })];

Default behavior:

Strict parameter matching (exact equality required)
Any order allowed
Extra tools allowed
All expected tools required

StructuredOutputScorer

Evaluates if the output matches expected structured data (JSON).

import { StructuredOutputScorer } from "vitest-evals";

describeEval("query generation", {
  data: async () => [
    {
      input: "Show me errors from today",
      expected: {
        dataset: "errors",
        query: "",
        sort: "-timestamp",
        timeRange: { statsPeriod: "24h" },
      },
    },
  ],
  task: myTask,
  scorers: [StructuredOutputScorer()],
});

// Fuzzy matching
scorers: [StructuredOutputScorer({ match: "fuzzy" })];

// Custom validation
scorers: [
  StructuredOutputScorer({
    match: (expected, actual, key) => {
      if (key === "age") return actual >= 18 && actual <= 100;
      return expected === actual;
    },
  }),
];

Custom Scorers

// Inline scorer
const LengthScorer = async ({ output }) => ({
  score: output.length > 50 ? 1.0 : 0.0,
});

// TypeScript scorer with custom options
import { type ScoreFn, type BaseScorerOptions } from "vitest-evals";

interface CustomOptions extends BaseScorerOptions {
  minLength: number;
}

const TypedScorer: ScoreFn<CustomOptions> = async (opts) => ({
  score: opts.output.length >= opts.minLength ? 1.0 : 0.0,
});

AI SDK Integration

See src/ai-sdk-integration.test.ts for a complete example with the Vercel AI SDK.

Transform provider responses to our format:

const { text, steps } = await generateText({
  model: openai("gpt-4o"),
  prompt: input,
  tools: { myTool: myToolDefinition },
});

return {
  result: text,
  toolCalls: steps
    .flatMap((step) => step.toolCalls)
    .map((call) => ({
      name: call.toolName,
      arguments: call.args,
    })),
};

Advanced Usage

Using autoevals

For evaluation using the autoevals library:

import { Factuality, ClosedQA } from "autoevals";

scorers: [
  Factuality,
  ClosedQA.partial({
    criteria: "Does the answer mention Paris?",
  }),
];

Skip Tests Conditionally

describeEval("gpt-4 tests", {
  skipIf: () => !process.env.OPENAI_API_KEY,
  // ...
});

Existing Test Suites

For integration with existing Vitest test suites, you can use the .toEval() matcher:

Deprecated: The .toEval() helper is deprecated. Use describeEval() instead for better test organization and multiple scorers support.

import "vitest-evals";

test("capital check", () => {
  const simpleFactuality = async ({ output, expected }) => ({
    score: output.toLowerCase().includes(expected.toLowerCase()) ? 1.0 : 0.0,
  });

  expect("What is the capital of France?").toEval(
    "Paris",
    answerQuestion,
    simpleFactuality,
    0.8
  );
});

Configuration

Separate Eval Configuration

Create vitest.evals.config.ts:

import { defineConfig } from "vitest/config";
import defaultConfig from "./vitest.config";

export default defineConfig({
  ...defaultConfig,
  test: {
    ...defaultConfig.test,
    include: ["src/**/*.eval.{js,ts}"],
  },
});

Run evals separately:

vitest --config=vitest.evals.config.ts

Development

pnpm install
pnpm test

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
.github/workflows		.github/workflows
.vscode		.vscode
docs		docs
scripts		scripts
src		src
.craft.yml		.craft.yml
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
biome.json		biome.json
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
tsconfig.json		tsconfig.json
tsup.config.ts		tsup.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

vitest-evals

Installation

Quick Start

Tasks

Test Data

Lifecycle Hooks

Scorers

ToolCallScorer

StructuredOutputScorer

Custom Scorers

AI SDK Integration

Advanced Usage

Using autoevals

Skip Tests Conditionally

Existing Test Suites

Configuration

Separate Eval Configuration

Development

About

Uh oh!

Releases 6

Sponsor this project

Uh oh!

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

vitest-evals

Installation

Quick Start

Tasks

Test Data

Lifecycle Hooks

Scorers

ToolCallScorer

StructuredOutputScorer

Custom Scorers

AI SDK Integration

Advanced Usage

Using autoevals

Skip Tests Conditionally

Existing Test Suites

Configuration

Separate Eval Configuration

Development

About

Topics

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 6

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages