mirror of https://github.com/buster-so/buster.git
example evals and scorers so they can be shared |
||
---|---|---|
.. | ||
evals | ||
scripts | ||
src | ||
tests | ||
.env.example | ||
.gitignore | ||
CLAUDE.md | ||
README.md | ||
biome.json | ||
env.d.ts | ||
package.json | ||
tsconfig.json | ||
turbo.json | ||
vitest.config.ts |
README.md
AI Package
This package contains AI agents and tools built with the Mastra framework.
Structure
src/
├── agents/ # AI agents
│ ├── weather-agent.ts
│ └── weather-agent.test.ts
├── tools/ # Tools for agents
│ ├── weather-tool.ts
│ └── weather-tool.test.ts
└── workflows/ # Workflows (if any)
Testing
This project uses Bun's native testing framework for both unit tests and evaluations.
Running Tests
# Run all tests
bun test
# Run tests in watch mode
bun test --watch
# Run tests with coverage
bun test --coverage
# Run specific test file
bun test src/agents/weather-agent.test.ts
# Run only evaluation tests
bun test --grep "eval:"
Test Types
1. Integration Tests
- Test agent functionality end-to-end
- Verify tool integration
- Check conversation context handling
- Validate error handling
2. Unit Tests
- Test individual tool functionality
- Validate input/output schemas
- Test configuration and setup
3. Evaluation Tests (LLM-as-Judge)
- Answer Relevancy: Does the response address the query?
- Helpfulness: How well does it handle missing information?
- Error Handling: Graceful handling of invalid inputs
- Tone Consistency: Professional and appropriate tone
- Factual Accuracy: Realistic data without hallucination
- Tool Usage: Appropriate use of available tools
- Safety & Bias: Free from harmful or biased content
Environment Setup
Create a .env
file with your API keys:
OPENAI_API_KEY=your_openai_api_key_here
Test Configuration
Tests are configured with appropriate timeouts:
- Unit tests: Default timeout (5s)
- Integration tests: 30-45s for LLM calls
- Evaluation tests: 45-60s for complex evaluations
Evaluation Methodology
The evaluation tests use an LLM-as-Judge approach where:
- Test Case: Agent generates response to a specific input
- Evaluation: Another LLM model evaluates the response against criteria
- Scoring: 0-1 scale with specific thresholds for pass/fail
- Reporting: Detailed scores and reasoning logged to console
Example Test Output
✓ should be properly configured
✓ should generate response for weather query
✓ eval: answer relevancy for basic weather query
Answer Relevancy Score: 85.0% - Response provides accurate weather data for requested location
✓ eval: helpfulness when location is missing
Helpfulness Score: 92.0% - Politely asks for location specification
✓ eval: safety and bias detection
Safety Score: 98.0% - Response is neutral and factual
Adding New Tests
- For new agents: Create
{agent-name}.test.ts
alongside the agent file - For new tools: Create
{tool-name}.test.ts
alongside the tool file - For evaluations: Add new test cases to the "Evaluations" describe block
CI/CD Integration
Tests can be run in CI environments:
# In CI pipeline
bun test --reporter=junit --coverage
The evaluation tests will fail if scores fall below defined thresholds, ensuring quality gates are maintained.