suna/backend/ARCHITECTURE.md

349 lines
8.0 KiB
Markdown

# Backend Architecture for Agent Workflow System
## Overview
This document describes the scalable, modular, and extensible backend architecture for the agent workflow system, designed to work like Zapier but optimized for AI agent workflows.
## Architecture Principles
1. **Scalability**: Horizontal scaling through microservices and queue-based processing
2. **Modularity**: Clear separation of concerns with pluggable components
3. **Extensibility**: Easy to add new triggers, nodes, and integrations
4. **Reliability**: Fault tolerance, retries, and graceful degradation
5. **Performance**: Async processing, caching, and efficient resource usage
## System Components
### 1. API Gateway Layer
- **Load Balancer**: Distributes traffic across API instances
- **Authentication**: JWT-based auth with role-based access control
- **Rate Limiting**: Per-user and per-IP rate limits
- **Request Routing**: Routes to appropriate services
### 2. Trigger System
#### Supported Trigger Types:
- **Webhook Triggers**
- Unique URLs per workflow
- HMAC signature validation
- Custom header validation
- Request/response transformation
- **Schedule Triggers**
- Cron-based scheduling
- Timezone support
- Execution windows
- Missed execution handling
- **Event Triggers**
- Real-time event bus (Redis Pub/Sub)
- Event filtering and routing
- Event replay capability
- **Polling Triggers**
- Configurable intervals
- Change detection
- Rate limiting
- **Manual Triggers**
- UI-based execution
- API-based execution
- Bulk execution support
### 3. Workflow Engine
#### Core Components:
- **Workflow Orchestrator**
- Manages workflow lifecycle
- Handles execution flow
- Manages dependencies
- Error handling and retries
- **Workflow Executor**
- Executes individual nodes
- Manages parallel execution
- Resource allocation
- Performance monitoring
- **State Manager**
- Distributed state management (Redis)
- Execution context persistence
- Checkpoint and recovery
- Real-time status updates
### 4. Node Types
- **Agent Nodes**: AI-powered processing with multiple models
- **Tool Nodes**: Integration with external services
- **Transform Nodes**: Data manipulation and formatting
- **Condition Nodes**: If/else and switch logic
- **Loop Nodes**: For/while iterations
- **Parallel Nodes**: Concurrent execution branches
- **Webhook Nodes**: HTTP requests to external services
- **Delay Nodes**: Time-based delays
### 5. Data Flow
```
Trigger → Queue → Orchestrator → Executor → Node → Output
↓ ↓ ↓
State Manager Tool Service Results
```
### 6. Storage Architecture
- **PostgreSQL**: Workflow definitions, configurations, audit logs
- **Redis**: Execution state, queues, caching, pub/sub
- **S3/Blob Storage**: Large files, logs, execution artifacts
- **TimescaleDB**: Time-series data, metrics, analytics
### 7. Queue System
- **RabbitMQ**: Task queuing, priority queues, dead letter queues
- **Kafka**: Event streaming, audit trail, real-time analytics
## Execution Flow
### 1. Trigger Phase
```python
1. Trigger fires (webhook/schedule/event/etc)
2. Validate trigger configuration
3. Create ExecutionContext
4. Queue workflow for execution
```
### 2. Orchestration Phase
```python
1. Load workflow definition
2. Build execution graph
3. Determine execution order
4. Initialize state management
```
### 3. Execution Phase
```python
1. Execute nodes in topological order
2. Handle parallel branches
3. Manage data flow between nodes
4. Update execution state
```
### 4. Completion Phase
```python
1. Aggregate results
2. Execute post-processing
3. Trigger downstream workflows
4. Clean up resources
```
## Scalability Features
### Horizontal Scaling
- Stateless API servers
- Distributed queue workers
- Shared state via Redis
- Database read replicas
### Performance Optimization
- Connection pooling
- Result caching
- Batch processing
- Async I/O throughout
### Resource Management
- Worker pool management
- Memory limits per execution
- CPU throttling
- Concurrent execution limits
## Security
### Authentication & Authorization
- JWT tokens with refresh
- API key authentication
- OAuth2 integration
- Role-based permissions
### Data Security
- Encryption at rest
- TLS for all communications
- Secret management (Vault)
- Audit logging
### Webhook Security
- HMAC signature validation
- IP whitelisting
- Rate limiting
- Request size limits
## Monitoring & Observability
### Metrics
- Prometheus metrics
- Custom business metrics
- Performance tracking
- Resource utilization
### Logging
- Structured logging
- Centralized log aggregation
- Log levels and filtering
- Correlation IDs
### Tracing
- Distributed tracing (OpenTelemetry)
- LLM monitoring (Langfuse)
- Execution visualization
- Performance profiling
### Alerting
- Error rate monitoring
- SLA tracking
- Resource alerts
- Custom alerts
## Error Handling
### Retry Strategies
- Exponential backoff
- Circuit breakers
- Dead letter queues
- Manual retry options
### Failure Modes
- Node-level failures
- Workflow-level failures
- System-level failures
- Graceful degradation
## API Endpoints
### Workflow Management
```
POST /api/workflows # Create workflow
GET /api/workflows/:id # Get workflow
PUT /api/workflows/:id # Update workflow
DELETE /api/workflows/:id # Delete workflow
POST /api/workflows/:id/activate # Activate workflow
POST /api/workflows/:id/pause # Pause workflow
```
### Execution Management
```
POST /api/workflows/:id/execute # Manual execution
GET /api/executions/:id # Get execution status
POST /api/executions/:id/cancel # Cancel execution
GET /api/executions/:id/logs # Get execution logs
```
### Trigger Management
```
GET /api/workflows/:id/triggers # List triggers
POST /api/workflows/:id/triggers # Add trigger
PUT /api/triggers/:id # Update trigger
DELETE /api/triggers/:id # Remove trigger
```
### Webhook Endpoints
```
POST /webhooks/:path # Webhook receiver
GET /api/webhooks # List webhooks
```
## Database Schema
### Core Tables
```sql
-- Workflows table
CREATE TABLE workflows (
id UUID PRIMARY KEY,
name VARCHAR(255),
description TEXT,
project_id UUID,
status VARCHAR(50),
definition JSONB,
created_at TIMESTAMP,
updated_at TIMESTAMP
);
-- Workflow executions
CREATE TABLE workflow_executions (
id UUID PRIMARY KEY,
workflow_id UUID,
status VARCHAR(50),
started_at TIMESTAMP,
completed_at TIMESTAMP,
context JSONB,
result JSONB,
error TEXT
);
-- Triggers
CREATE TABLE triggers (
id UUID PRIMARY KEY,
workflow_id UUID,
type VARCHAR(50),
config JSONB,
is_active BOOLEAN
);
-- Webhook registrations
CREATE TABLE webhook_registrations (
id UUID PRIMARY KEY,
workflow_id UUID,
path VARCHAR(255) UNIQUE,
secret VARCHAR(255),
config JSONB
);
```
## Deployment
### Docker Compose (Development)
```yaml
services:
api:
build: .
ports:
- "8000:8000"
depends_on:
- postgres
- redis
- rabbitmq
worker:
build: .
command: python -m workflow_engine.worker
depends_on:
- postgres
- redis
- rabbitmq
scheduler:
build: .
command: python -m workflow_engine.scheduler
depends_on:
- postgres
- redis
```
### Kubernetes (Production)
- Deployment manifests for each service
- Horizontal Pod Autoscaling
- Service mesh (Istio)
- Persistent volume claims
## Future Enhancements
1. **Workflow Versioning**: Track and manage workflow versions
2. **A/B Testing**: Test different workflow variations
3. **Workflow Templates**: Pre-built workflow templates
4. **Advanced Analytics**: Detailed execution analytics
5. **Multi-tenancy**: Full isolation between projects
6. **Workflow Marketplace**: Share and monetize workflows
7. **Visual Debugging**: Step-through debugging
8. **Performance Optimization**: ML-based optimization