History

dal 1084bfd485 add gpt5 to fallback chain on sonnet 4		2025-08-21 09:18:58 -06:00
..
evals	Refactor edit reports functionality to support sequential edit operations with improved error handling and user feedback. Update schemas to include detailed operation results, including duration and error messages, enhancing overall report content management.	2025-08-05 14:45:55 -06:00
scripts	Use tsx and .ts files for validation	2025-07-21 16:07:14 -06:00
src	add gpt5 to fallback chain on sonnet 4	2025-08-21 09:18:58 -06:00
tests	fix unit tests	2025-07-18 00:04:28 -06:00
.env.example	Mastra braintrust (#391 )	2025-07-02 14:33:40 -07:00
.gitignore	Mastra braintrust (#391 )	2025-07-02 14:33:40 -07:00
CLAUDE.md	refactor: rename respondWithoutAnalysis to respondWithoutAssetCreation	2025-07-23 08:12:23 -06:00
README.md	Mastra braintrust (#391 )	2025-07-02 14:33:40 -07:00
biome.json	update ai biome settings	2025-07-22 12:20:51 -06:00
env.d.ts	Mastra braintrust (#391 )	2025-07-02 14:33:40 -07:00
package.json	turbo fast web build	2025-08-11 11:33:18 -06:00
tsconfig.json	Update inlcude	2025-07-12 23:46:09 -06:00
turbo.json	update database dev	2025-07-15 22:26:13 -06:00
vitest.config.ts	feat: add Google Vertex AI and improve model handling	2025-07-23 07:22:52 -06:00

README.md

AI Package

This package contains AI agents and tools built with the Mastra framework.

Structure

src/
├── agents/           # AI agents
│   ├── weather-agent.ts
│   └── weather-agent.test.ts
├── tools/            # Tools for agents
│   ├── weather-tool.ts
│   └── weather-tool.test.ts
└── workflows/        # Workflows (if any)

Testing

This project uses Bun's native testing framework for both unit tests and evaluations.

Running Tests

# Run all tests
bun test

# Run tests in watch mode
bun test --watch

# Run tests with coverage
bun test --coverage

# Run specific test file
bun test src/agents/weather-agent.test.ts

# Run only evaluation tests
bun test --grep "eval:"

Test Types

1. Integration Tests

Test agent functionality end-to-end
Verify tool integration
Check conversation context handling
Validate error handling

2. Unit Tests

Test individual tool functionality
Validate input/output schemas
Test configuration and setup

3. Evaluation Tests (LLM-as-Judge)

Answer Relevancy: Does the response address the query?
Helpfulness: How well does it handle missing information?
Error Handling: Graceful handling of invalid inputs
Tone Consistency: Professional and appropriate tone
Factual Accuracy: Realistic data without hallucination
Tool Usage: Appropriate use of available tools
Safety & Bias: Free from harmful or biased content

Environment Setup

Create a .env file with your API keys:

OPENAI_API_KEY=your_openai_api_key_here

Test Configuration

Tests are configured with appropriate timeouts:

Unit tests: Default timeout (5s)
Integration tests: 30-45s for LLM calls
Evaluation tests: 45-60s for complex evaluations

Evaluation Methodology

The evaluation tests use an LLM-as-Judge approach where:

Test Case: Agent generates response to a specific input
Evaluation: Another LLM model evaluates the response against criteria
Scoring: 0-1 scale with specific thresholds for pass/fail
Reporting: Detailed scores and reasoning logged to console

Example Test Output

✓ should be properly configured
✓ should generate response for weather query
✓ eval: answer relevancy for basic weather query
  Answer Relevancy Score: 85.0% - Response provides accurate weather data for requested location

✓ eval: helpfulness when location is missing  
  Helpfulness Score: 92.0% - Politely asks for location specification

✓ eval: safety and bias detection
  Safety Score: 98.0% - Response is neutral and factual

Adding New Tests

For new agents: Create {agent-name}.test.ts alongside the agent file
For new tools: Create {tool-name}.test.ts alongside the tool file
For evaluations: Add new test cases to the "Evaluations" describe block

CI/CD Integration

Tests can be run in CI environments:

# In CI pipeline
bun test --reporter=junit --coverage

The evaluation tests will fail if scores fall below defined thresholds, ensuring quality gates are maintained.