# AI Package This package contains AI agents and tools built with the Mastra framework. ## Structure ``` src/ ├── agents/ # AI agents │ ├── weather-agent.ts │ └── weather-agent.test.ts ├── tools/ # Tools for agents │ ├── weather-tool.ts │ └── weather-tool.test.ts └── workflows/ # Workflows (if any) ``` ## Testing This project uses Bun's native testing framework for both unit tests and evaluations. ### Running Tests ```bash # Run all tests bun test # Run tests in watch mode bun test --watch # Run tests with coverage bun test --coverage # Run specific test file bun test src/agents/weather-agent.test.ts # Run only evaluation tests bun test --grep "eval:" ``` ### Test Types #### 1. Integration Tests - Test agent functionality end-to-end - Verify tool integration - Check conversation context handling - Validate error handling #### 2. Unit Tests - Test individual tool functionality - Validate input/output schemas - Test configuration and setup #### 3. Evaluation Tests (LLM-as-Judge) - **Answer Relevancy**: Does the response address the query? - **Helpfulness**: How well does it handle missing information? - **Error Handling**: Graceful handling of invalid inputs - **Tone Consistency**: Professional and appropriate tone - **Factual Accuracy**: Realistic data without hallucination - **Tool Usage**: Appropriate use of available tools - **Safety & Bias**: Free from harmful or biased content ### Environment Setup Create a `.env` file with your API keys: ```bash OPENAI_API_KEY=your_openai_api_key_here ``` ### Test Configuration Tests are configured with appropriate timeouts: - Unit tests: Default timeout (5s) - Integration tests: 30-45s for LLM calls - Evaluation tests: 45-60s for complex evaluations ### Evaluation Methodology The evaluation tests use an LLM-as-Judge approach where: 1. **Test Case**: Agent generates response to a specific input 2. **Evaluation**: Another LLM model evaluates the response against criteria 3. **Scoring**: 0-1 scale with specific thresholds for pass/fail 4. **Reporting**: Detailed scores and reasoning logged to console ### Example Test Output ``` ✓ should be properly configured ✓ should generate response for weather query ✓ eval: answer relevancy for basic weather query Answer Relevancy Score: 85.0% - Response provides accurate weather data for requested location ✓ eval: helpfulness when location is missing Helpfulness Score: 92.0% - Politely asks for location specification ✓ eval: safety and bias detection Safety Score: 98.0% - Response is neutral and factual ``` ### Adding New Tests 1. **For new agents**: Create `{agent-name}.test.ts` alongside the agent file 2. **For new tools**: Create `{tool-name}.test.ts` alongside the tool file 3. **For evaluations**: Add new test cases to the "Evaluations" describe block ### CI/CD Integration Tests can be run in CI environments: ```bash # In CI pipeline bun test --reporter=junit --coverage ``` The evaluation tests will fail if scores fall below defined thresholds, ensuring quality gates are maintained.