MCP Evals Documentation
Overview
MCP Evals is a Node.js package and GitHub Action for evaluating MCP (Model Context Protocol) tool implementations using LLM-based scoring. This helps ensure your MCP server's tools are working correctly and performing well.
Installation
As a Node.js Package
bash
1npm install mcp-evals
As a GitHub Action
Add the following to your workflow file:
yaml
12345678910111213141516171819202122232425262728name: Run MCP Evaluations on: pull_request: types: [opened, synchronize, reopened] jobs: evaluate: runs-on: ubuntu-latest permissions: contents: read pull-requests: write steps: - uses: actions/checkout@v4 - name: Setup Node.js uses: actions/setup-node@v4 with: node-version: '20' - name: Install dependencies run: npm install - name: Run MCP Evaluations uses: mclenhard/mcp-evals@v1.0.9 with: evals_path: 'src/evals/evals.ts' server_path: 'src/index.ts' openai_api_key: ${{ secrets.OPENAI_API_KEY }} model: 'gpt-4' # Optional, defaults to gpt-4
Usage
1. Create Your Evaluation File
Create a file (e.g., `evals.ts`) that exports your evaluation configuration:
typescript
123456789101112131415161718192021222324import { EvalConfig } from 'mcp-evals'; import { openai } from "@ai-sdk/openai"; import { grade, EvalFunction} from "mcp-evals"; const weatherEval: EvalFunction = { name: 'Weather Tool Evaluation', description: 'Evaluates the accuracy and completeness of weather information retrieval', run: async () => { const result = await grade(openai("gpt-4"), "What is the weather in New York?"); return JSON.parse(result); } }; const config: EvalConfig = { model: openai("gpt-4"), evals: [weatherEval] }; export default config; export const evals = [ weatherEval, // add other evals here ];
2. Run the Evaluations
As a Node.js Package
You can run the evaluations using the CLI:
bash
1npx mcp-eval path/to/your/evals.ts path/to/your/server.ts
As a GitHub Action
The action will automatically:
- Run your evaluations
- Post the results as a comment on the PR
- Update the comment if the PR is updated
Evaluation Results
Each evaluation returns an object with the following structure:
typescript
12345678interface EvalResult { accuracy: number; // Score from 1-5 completeness: number; // Score from 1-5 relevance: number; // Score from 1-5 clarity: number; // Score from 1-5 reasoning: number; // Score from 1-5 overall_comments: string; // Summary of strengths and weaknesses }
Configuration
Environment Variables
OPENAI_API_KEY
: Your OpenAI API key (required)
Evaluation Configuration
The EvalConfig
interface requires:
model
: The language model to use for evaluation (e.g., GPT-4)evals
: Array of evaluation functions to run
Each evaluation function must implement:
name
: Name of the evaluationdescription
: Description of what the evaluation testsrun
: Async function that takes a model and returns anEvalResult
License
MIT