buster/packages/ai
dal 2f4aeba817
feat: enhance documentation and testing capabilities for docs agent workflow
- Added section in CLAUDE.md for direct database access during integration testing.
- Updated `maxSteps` in `docs-agent` to allow for more complex tasks.
- Improved validation in `docs-agent-context` for sandbox instances.
- Enhanced `create-docs-todos` step to handle todos more effectively.
- Introduced comprehensive integration tests for the docs agent workflow, covering various scenarios and edge cases.
- Added test helpers for creating mock dbt projects and managing sandboxes.
- Implemented error handling and logging improvements in the workflow execution process.
2025-07-28 11:56:59 -06:00
..
evals Update packages/ai/evals/agents/analyst-agent/metrics/think-and-prep-updates.ts 2025-07-23 08:28:38 -06:00
scripts Use tsx and .ts files for validation 2025-07-21 16:07:14 -06:00
src feat: enhance documentation and testing capabilities for docs agent workflow 2025-07-28 11:56:59 -06:00
.env.example Mastra braintrust (#391) 2025-07-02 14:33:40 -07:00
.gitignore Mastra braintrust (#391) 2025-07-02 14:33:40 -07:00
CLAUDE.md refactor: rename respondWithoutAnalysis to respondWithoutAssetCreation 2025-07-23 08:12:23 -06:00
README.md Mastra braintrust (#391) 2025-07-02 14:33:40 -07:00
biome.json update ai biome settings 2025-07-22 12:20:51 -06:00
env.d.ts Mastra braintrust (#391) 2025-07-02 14:33:40 -07:00
package.json dry run 2025-07-25 18:29:35 -06:00
tsconfig.json Update inlcude 2025-07-12 23:46:09 -06:00
turbo.json update database dev 2025-07-15 22:26:13 -06:00
vitest.config.ts feat: add Google Vertex AI and improve model handling 2025-07-23 07:22:52 -06:00

README.md

AI Package

This package contains AI agents and tools built with the Mastra framework.

Structure

src/
├── agents/           # AI agents
│   ├── weather-agent.ts
│   └── weather-agent.test.ts
├── tools/            # Tools for agents
│   ├── weather-tool.ts
│   └── weather-tool.test.ts
└── workflows/        # Workflows (if any)

Testing

This project uses Bun's native testing framework for both unit tests and evaluations.

Running Tests

# Run all tests
bun test

# Run tests in watch mode
bun test --watch

# Run tests with coverage
bun test --coverage

# Run specific test file
bun test src/agents/weather-agent.test.ts

# Run only evaluation tests
bun test --grep "eval:"

Test Types

1. Integration Tests

  • Test agent functionality end-to-end
  • Verify tool integration
  • Check conversation context handling
  • Validate error handling

2. Unit Tests

  • Test individual tool functionality
  • Validate input/output schemas
  • Test configuration and setup

3. Evaluation Tests (LLM-as-Judge)

  • Answer Relevancy: Does the response address the query?
  • Helpfulness: How well does it handle missing information?
  • Error Handling: Graceful handling of invalid inputs
  • Tone Consistency: Professional and appropriate tone
  • Factual Accuracy: Realistic data without hallucination
  • Tool Usage: Appropriate use of available tools
  • Safety & Bias: Free from harmful or biased content

Environment Setup

Create a .env file with your API keys:

OPENAI_API_KEY=your_openai_api_key_here

Test Configuration

Tests are configured with appropriate timeouts:

  • Unit tests: Default timeout (5s)
  • Integration tests: 30-45s for LLM calls
  • Evaluation tests: 45-60s for complex evaluations

Evaluation Methodology

The evaluation tests use an LLM-as-Judge approach where:

  1. Test Case: Agent generates response to a specific input
  2. Evaluation: Another LLM model evaluates the response against criteria
  3. Scoring: 0-1 scale with specific thresholds for pass/fail
  4. Reporting: Detailed scores and reasoning logged to console

Example Test Output

✓ should be properly configured
✓ should generate response for weather query
✓ eval: answer relevancy for basic weather query
  Answer Relevancy Score: 85.0% - Response provides accurate weather data for requested location

✓ eval: helpfulness when location is missing  
  Helpfulness Score: 92.0% - Politely asks for location specification

✓ eval: safety and bias detection
  Safety Score: 98.0% - Response is neutral and factual

Adding New Tests

  1. For new agents: Create {agent-name}.test.ts alongside the agent file
  2. For new tools: Create {tool-name}.test.ts alongside the tool file
  3. For evaluations: Add new test cases to the "Evaluations" describe block

CI/CD Integration

Tests can be run in CI environments:

# In CI pipeline
bun test --reporter=junit --coverage

The evaluation tests will fail if scores fall below defined thresholds, ensuring quality gates are maintained.