MCP Evals Documentation

Overview

MCP Evals is a Node.js package and GitHub Action for evaluating MCP (Model Context Protocol) tool implementations using LLM-based scoring. This helps ensure your MCP server's tools are working correctly and performing well.

Installation

As a Node.js Package

bash
1
npm install mcp-evals

As a GitHub Action

Add the following to your workflow file:

yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
name: Run MCP Evaluations on: pull_request: types: [opened, synchronize, reopened] jobs: evaluate: runs-on: ubuntu-latest permissions: contents: read pull-requests: write steps: - uses: actions/checkout@v4 - name: Setup Node.js uses: actions/setup-node@v4 with: node-version: '20' - name: Install dependencies run: npm install - name: Run MCP Evaluations uses: mclenhard/mcp-evals@v1.0.9 with: evals_path: 'src/evals/evals.ts' server_path: 'src/index.ts' openai_api_key: ${{ secrets.OPENAI_API_KEY }} model: 'gpt-4' # Optional, defaults to gpt-4

Usage

1. Create Your Evaluation File

Create a file (e.g., `evals.ts`) that exports your evaluation configuration:

typescript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import { EvalConfig } from 'mcp-evals'; import { openai } from "@ai-sdk/openai"; import { grade, EvalFunction} from "mcp-evals"; const weatherEval: EvalFunction = { name: 'Weather Tool Evaluation', description: 'Evaluates the accuracy and completeness of weather information retrieval', run: async () => { const result = await grade(openai("gpt-4"), "What is the weather in New York?"); return JSON.parse(result); } }; const config: EvalConfig = { model: openai("gpt-4"), evals: [weatherEval] }; export default config; export const evals = [ weatherEval, // add other evals here ];

2. Run the Evaluations

As a Node.js Package

You can run the evaluations using the CLI:

bash
1
npx mcp-eval path/to/your/evals.ts path/to/your/server.ts

As a GitHub Action

The action will automatically:

  • Run your evaluations
  • Post the results as a comment on the PR
  • Update the comment if the PR is updated

Evaluation Results

Each evaluation returns an object with the following structure:

typescript
1
2
3
4
5
6
7
8
interface EvalResult { accuracy: number; // Score from 1-5 completeness: number; // Score from 1-5 relevance: number; // Score from 1-5 clarity: number; // Score from 1-5 reasoning: number; // Score from 1-5 overall_comments: string; // Summary of strengths and weaknesses }

Configuration

Environment Variables

  • OPENAI_API_KEY: Your OpenAI API key (required)

Evaluation Configuration

The EvalConfig interface requires:

  • model: The language model to use for evaluation (e.g., GPT-4)
  • evals: Array of evaluation functions to run

Each evaluation function must implement:

  • name: Name of the evaluation
  • description: Description of what the evaluation tests
  • run: Async function that takes a model and returns an EvalResult

License

MIT