buster/packages/ai
dal 1084bfd485
add gpt5 to fallback chain on sonnet 4
2025-08-21 09:18:58 -06:00
..
evals Refactor edit reports functionality to support sequential edit operations with improved error handling and user feedback. Update schemas to include detailed operation results, including duration and error messages, enhancing overall report content management. 2025-08-05 14:45:55 -06:00
scripts Use tsx and .ts files for validation 2025-07-21 16:07:14 -06:00
src add gpt5 to fallback chain on sonnet 4 2025-08-21 09:18:58 -06:00
tests fix unit tests 2025-07-18 00:04:28 -06:00
.env.example Mastra braintrust (#391) 2025-07-02 14:33:40 -07:00
.gitignore Mastra braintrust (#391) 2025-07-02 14:33:40 -07:00
CLAUDE.md refactor: rename respondWithoutAnalysis to respondWithoutAssetCreation 2025-07-23 08:12:23 -06:00
README.md Mastra braintrust (#391) 2025-07-02 14:33:40 -07:00
biome.json update ai biome settings 2025-07-22 12:20:51 -06:00
env.d.ts Mastra braintrust (#391) 2025-07-02 14:33:40 -07:00
package.json turbo fast web build 2025-08-11 11:33:18 -06:00
tsconfig.json Update inlcude 2025-07-12 23:46:09 -06:00
turbo.json update database dev 2025-07-15 22:26:13 -06:00
vitest.config.ts feat: add Google Vertex AI and improve model handling 2025-07-23 07:22:52 -06:00

README.md

AI Package

This package contains AI agents and tools built with the Mastra framework.

Structure

src/
├── agents/           # AI agents
│   ├── weather-agent.ts
│   └── weather-agent.test.ts
├── tools/            # Tools for agents
│   ├── weather-tool.ts
│   └── weather-tool.test.ts
└── workflows/        # Workflows (if any)

Testing

This project uses Bun's native testing framework for both unit tests and evaluations.

Running Tests

# Run all tests
bun test

# Run tests in watch mode
bun test --watch

# Run tests with coverage
bun test --coverage

# Run specific test file
bun test src/agents/weather-agent.test.ts

# Run only evaluation tests
bun test --grep "eval:"

Test Types

1. Integration Tests

  • Test agent functionality end-to-end
  • Verify tool integration
  • Check conversation context handling
  • Validate error handling

2. Unit Tests

  • Test individual tool functionality
  • Validate input/output schemas
  • Test configuration and setup

3. Evaluation Tests (LLM-as-Judge)

  • Answer Relevancy: Does the response address the query?
  • Helpfulness: How well does it handle missing information?
  • Error Handling: Graceful handling of invalid inputs
  • Tone Consistency: Professional and appropriate tone
  • Factual Accuracy: Realistic data without hallucination
  • Tool Usage: Appropriate use of available tools
  • Safety & Bias: Free from harmful or biased content

Environment Setup

Create a .env file with your API keys:

OPENAI_API_KEY=your_openai_api_key_here

Test Configuration

Tests are configured with appropriate timeouts:

  • Unit tests: Default timeout (5s)
  • Integration tests: 30-45s for LLM calls
  • Evaluation tests: 45-60s for complex evaluations

Evaluation Methodology

The evaluation tests use an LLM-as-Judge approach where:

  1. Test Case: Agent generates response to a specific input
  2. Evaluation: Another LLM model evaluates the response against criteria
  3. Scoring: 0-1 scale with specific thresholds for pass/fail
  4. Reporting: Detailed scores and reasoning logged to console

Example Test Output

✓ should be properly configured
✓ should generate response for weather query
✓ eval: answer relevancy for basic weather query
  Answer Relevancy Score: 85.0% - Response provides accurate weather data for requested location

✓ eval: helpfulness when location is missing  
  Helpfulness Score: 92.0% - Politely asks for location specification

✓ eval: safety and bias detection
  Safety Score: 98.0% - Response is neutral and factual

Adding New Tests

  1. For new agents: Create {agent-name}.test.ts alongside the agent file
  2. For new tools: Create {tool-name}.test.ts alongside the tool file
  3. For evaluations: Add new test cases to the "Evaluations" describe block

CI/CD Integration

Tests can be run in CI environments:

# In CI pipeline
bun test --reporter=junit --coverage

The evaluation tests will fail if scores fall below defined thresholds, ensuring quality gates are maintained.