mirror of https://github.com/kortix-ai/suna.git
349 lines
8.0 KiB
Markdown
349 lines
8.0 KiB
Markdown
|
# Backend Architecture for Agent Workflow System
|
||
|
|
||
|
## Overview
|
||
|
|
||
|
This document describes the scalable, modular, and extensible backend architecture for the agent workflow system, designed to work like Zapier but optimized for AI agent workflows.
|
||
|
|
||
|
## Architecture Principles
|
||
|
|
||
|
1. **Scalability**: Horizontal scaling through microservices and queue-based processing
|
||
|
2. **Modularity**: Clear separation of concerns with pluggable components
|
||
|
3. **Extensibility**: Easy to add new triggers, nodes, and integrations
|
||
|
4. **Reliability**: Fault tolerance, retries, and graceful degradation
|
||
|
5. **Performance**: Async processing, caching, and efficient resource usage
|
||
|
|
||
|
## System Components
|
||
|
|
||
|
### 1. API Gateway Layer
|
||
|
|
||
|
- **Load Balancer**: Distributes traffic across API instances
|
||
|
- **Authentication**: JWT-based auth with role-based access control
|
||
|
- **Rate Limiting**: Per-user and per-IP rate limits
|
||
|
- **Request Routing**: Routes to appropriate services
|
||
|
|
||
|
### 2. Trigger System
|
||
|
|
||
|
#### Supported Trigger Types:
|
||
|
|
||
|
- **Webhook Triggers**
|
||
|
- Unique URLs per workflow
|
||
|
- HMAC signature validation
|
||
|
- Custom header validation
|
||
|
- Request/response transformation
|
||
|
|
||
|
- **Schedule Triggers**
|
||
|
- Cron-based scheduling
|
||
|
- Timezone support
|
||
|
- Execution windows
|
||
|
- Missed execution handling
|
||
|
|
||
|
- **Event Triggers**
|
||
|
- Real-time event bus (Redis Pub/Sub)
|
||
|
- Event filtering and routing
|
||
|
- Event replay capability
|
||
|
|
||
|
- **Polling Triggers**
|
||
|
- Configurable intervals
|
||
|
- Change detection
|
||
|
- Rate limiting
|
||
|
|
||
|
- **Manual Triggers**
|
||
|
- UI-based execution
|
||
|
- API-based execution
|
||
|
- Bulk execution support
|
||
|
|
||
|
### 3. Workflow Engine
|
||
|
|
||
|
#### Core Components:
|
||
|
|
||
|
- **Workflow Orchestrator**
|
||
|
- Manages workflow lifecycle
|
||
|
- Handles execution flow
|
||
|
- Manages dependencies
|
||
|
- Error handling and retries
|
||
|
|
||
|
- **Workflow Executor**
|
||
|
- Executes individual nodes
|
||
|
- Manages parallel execution
|
||
|
- Resource allocation
|
||
|
- Performance monitoring
|
||
|
|
||
|
- **State Manager**
|
||
|
- Distributed state management (Redis)
|
||
|
- Execution context persistence
|
||
|
- Checkpoint and recovery
|
||
|
- Real-time status updates
|
||
|
|
||
|
### 4. Node Types
|
||
|
|
||
|
- **Agent Nodes**: AI-powered processing with multiple models
|
||
|
- **Tool Nodes**: Integration with external services
|
||
|
- **Transform Nodes**: Data manipulation and formatting
|
||
|
- **Condition Nodes**: If/else and switch logic
|
||
|
- **Loop Nodes**: For/while iterations
|
||
|
- **Parallel Nodes**: Concurrent execution branches
|
||
|
- **Webhook Nodes**: HTTP requests to external services
|
||
|
- **Delay Nodes**: Time-based delays
|
||
|
|
||
|
### 5. Data Flow
|
||
|
|
||
|
```
|
||
|
Trigger → Queue → Orchestrator → Executor → Node → Output
|
||
|
↓ ↓ ↓
|
||
|
State Manager Tool Service Results
|
||
|
```
|
||
|
|
||
|
### 6. Storage Architecture
|
||
|
|
||
|
- **PostgreSQL**: Workflow definitions, configurations, audit logs
|
||
|
- **Redis**: Execution state, queues, caching, pub/sub
|
||
|
- **S3/Blob Storage**: Large files, logs, execution artifacts
|
||
|
- **TimescaleDB**: Time-series data, metrics, analytics
|
||
|
|
||
|
### 7. Queue System
|
||
|
|
||
|
- **RabbitMQ**: Task queuing, priority queues, dead letter queues
|
||
|
- **Kafka**: Event streaming, audit trail, real-time analytics
|
||
|
|
||
|
## Execution Flow
|
||
|
|
||
|
### 1. Trigger Phase
|
||
|
```python
|
||
|
1. Trigger fires (webhook/schedule/event/etc)
|
||
|
2. Validate trigger configuration
|
||
|
3. Create ExecutionContext
|
||
|
4. Queue workflow for execution
|
||
|
```
|
||
|
|
||
|
### 2. Orchestration Phase
|
||
|
```python
|
||
|
1. Load workflow definition
|
||
|
2. Build execution graph
|
||
|
3. Determine execution order
|
||
|
4. Initialize state management
|
||
|
```
|
||
|
|
||
|
### 3. Execution Phase
|
||
|
```python
|
||
|
1. Execute nodes in topological order
|
||
|
2. Handle parallel branches
|
||
|
3. Manage data flow between nodes
|
||
|
4. Update execution state
|
||
|
```
|
||
|
|
||
|
### 4. Completion Phase
|
||
|
```python
|
||
|
1. Aggregate results
|
||
|
2. Execute post-processing
|
||
|
3. Trigger downstream workflows
|
||
|
4. Clean up resources
|
||
|
```
|
||
|
|
||
|
## Scalability Features
|
||
|
|
||
|
### Horizontal Scaling
|
||
|
- Stateless API servers
|
||
|
- Distributed queue workers
|
||
|
- Shared state via Redis
|
||
|
- Database read replicas
|
||
|
|
||
|
### Performance Optimization
|
||
|
- Connection pooling
|
||
|
- Result caching
|
||
|
- Batch processing
|
||
|
- Async I/O throughout
|
||
|
|
||
|
### Resource Management
|
||
|
- Worker pool management
|
||
|
- Memory limits per execution
|
||
|
- CPU throttling
|
||
|
- Concurrent execution limits
|
||
|
|
||
|
## Security
|
||
|
|
||
|
### Authentication & Authorization
|
||
|
- JWT tokens with refresh
|
||
|
- API key authentication
|
||
|
- OAuth2 integration
|
||
|
- Role-based permissions
|
||
|
|
||
|
### Data Security
|
||
|
- Encryption at rest
|
||
|
- TLS for all communications
|
||
|
- Secret management (Vault)
|
||
|
- Audit logging
|
||
|
|
||
|
### Webhook Security
|
||
|
- HMAC signature validation
|
||
|
- IP whitelisting
|
||
|
- Rate limiting
|
||
|
- Request size limits
|
||
|
|
||
|
## Monitoring & Observability
|
||
|
|
||
|
### Metrics
|
||
|
- Prometheus metrics
|
||
|
- Custom business metrics
|
||
|
- Performance tracking
|
||
|
- Resource utilization
|
||
|
|
||
|
### Logging
|
||
|
- Structured logging
|
||
|
- Centralized log aggregation
|
||
|
- Log levels and filtering
|
||
|
- Correlation IDs
|
||
|
|
||
|
### Tracing
|
||
|
- Distributed tracing (OpenTelemetry)
|
||
|
- LLM monitoring (Langfuse)
|
||
|
- Execution visualization
|
||
|
- Performance profiling
|
||
|
|
||
|
### Alerting
|
||
|
- Error rate monitoring
|
||
|
- SLA tracking
|
||
|
- Resource alerts
|
||
|
- Custom alerts
|
||
|
|
||
|
## Error Handling
|
||
|
|
||
|
### Retry Strategies
|
||
|
- Exponential backoff
|
||
|
- Circuit breakers
|
||
|
- Dead letter queues
|
||
|
- Manual retry options
|
||
|
|
||
|
### Failure Modes
|
||
|
- Node-level failures
|
||
|
- Workflow-level failures
|
||
|
- System-level failures
|
||
|
- Graceful degradation
|
||
|
|
||
|
## API Endpoints
|
||
|
|
||
|
### Workflow Management
|
||
|
```
|
||
|
POST /api/workflows # Create workflow
|
||
|
GET /api/workflows/:id # Get workflow
|
||
|
PUT /api/workflows/:id # Update workflow
|
||
|
DELETE /api/workflows/:id # Delete workflow
|
||
|
POST /api/workflows/:id/activate # Activate workflow
|
||
|
POST /api/workflows/:id/pause # Pause workflow
|
||
|
```
|
||
|
|
||
|
### Execution Management
|
||
|
```
|
||
|
POST /api/workflows/:id/execute # Manual execution
|
||
|
GET /api/executions/:id # Get execution status
|
||
|
POST /api/executions/:id/cancel # Cancel execution
|
||
|
GET /api/executions/:id/logs # Get execution logs
|
||
|
```
|
||
|
|
||
|
### Trigger Management
|
||
|
```
|
||
|
GET /api/workflows/:id/triggers # List triggers
|
||
|
POST /api/workflows/:id/triggers # Add trigger
|
||
|
PUT /api/triggers/:id # Update trigger
|
||
|
DELETE /api/triggers/:id # Remove trigger
|
||
|
```
|
||
|
|
||
|
### Webhook Endpoints
|
||
|
```
|
||
|
POST /webhooks/:path # Webhook receiver
|
||
|
GET /api/webhooks # List webhooks
|
||
|
```
|
||
|
|
||
|
## Database Schema
|
||
|
|
||
|
### Core Tables
|
||
|
|
||
|
```sql
|
||
|
-- Workflows table
|
||
|
CREATE TABLE workflows (
|
||
|
id UUID PRIMARY KEY,
|
||
|
name VARCHAR(255),
|
||
|
description TEXT,
|
||
|
project_id UUID,
|
||
|
status VARCHAR(50),
|
||
|
definition JSONB,
|
||
|
created_at TIMESTAMP,
|
||
|
updated_at TIMESTAMP
|
||
|
);
|
||
|
|
||
|
-- Workflow executions
|
||
|
CREATE TABLE workflow_executions (
|
||
|
id UUID PRIMARY KEY,
|
||
|
workflow_id UUID,
|
||
|
status VARCHAR(50),
|
||
|
started_at TIMESTAMP,
|
||
|
completed_at TIMESTAMP,
|
||
|
context JSONB,
|
||
|
result JSONB,
|
||
|
error TEXT
|
||
|
);
|
||
|
|
||
|
-- Triggers
|
||
|
CREATE TABLE triggers (
|
||
|
id UUID PRIMARY KEY,
|
||
|
workflow_id UUID,
|
||
|
type VARCHAR(50),
|
||
|
config JSONB,
|
||
|
is_active BOOLEAN
|
||
|
);
|
||
|
|
||
|
-- Webhook registrations
|
||
|
CREATE TABLE webhook_registrations (
|
||
|
id UUID PRIMARY KEY,
|
||
|
workflow_id UUID,
|
||
|
path VARCHAR(255) UNIQUE,
|
||
|
secret VARCHAR(255),
|
||
|
config JSONB
|
||
|
);
|
||
|
```
|
||
|
|
||
|
## Deployment
|
||
|
|
||
|
### Docker Compose (Development)
|
||
|
```yaml
|
||
|
services:
|
||
|
api:
|
||
|
build: .
|
||
|
ports:
|
||
|
- "8000:8000"
|
||
|
depends_on:
|
||
|
- postgres
|
||
|
- redis
|
||
|
- rabbitmq
|
||
|
|
||
|
worker:
|
||
|
build: .
|
||
|
command: python -m workflow_engine.worker
|
||
|
depends_on:
|
||
|
- postgres
|
||
|
- redis
|
||
|
- rabbitmq
|
||
|
|
||
|
scheduler:
|
||
|
build: .
|
||
|
command: python -m workflow_engine.scheduler
|
||
|
depends_on:
|
||
|
- postgres
|
||
|
- redis
|
||
|
```
|
||
|
|
||
|
### Kubernetes (Production)
|
||
|
- Deployment manifests for each service
|
||
|
- Horizontal Pod Autoscaling
|
||
|
- Service mesh (Istio)
|
||
|
- Persistent volume claims
|
||
|
|
||
|
## Future Enhancements
|
||
|
|
||
|
1. **Workflow Versioning**: Track and manage workflow versions
|
||
|
2. **A/B Testing**: Test different workflow variations
|
||
|
3. **Workflow Templates**: Pre-built workflow templates
|
||
|
4. **Advanced Analytics**: Detailed execution analytics
|
||
|
5. **Multi-tenancy**: Full isolation between projects
|
||
|
6. **Workflow Marketplace**: Share and monetize workflows
|
||
|
7. **Visual Debugging**: Step-through debugging
|
||
|
8. **Performance Optimization**: ML-based optimization
|