suna/backend/CHANGELOG_ERROR_HANDLING.md

# Error Handling Enhancement Changelog

## Overview
Extended the existing `AnthropicException - Overloaded` error handling to support comprehensive error detection and fallback strategies for multiple LLM providers.

## Changes Made

### 1. Enhanced `services/llm.py`

**Added:**
- `detect_error_and_suggest_fallback()` function (lines 102-175)
  - Detects specific error types from different LLM providers
  - Suggests appropriate fallback models based on current model and error type
  - Returns tuple: (should_fallback, fallback_model, error_type)

**Modified:**
- `make_llm_api_call()` function (lines 320-340)
  - Enhanced retry logic to use new error detection function
  - Better handling of fallback-eligible errors on final retry attempt

### 2. Updated `agentpress/thread_manager.py`

**Modified:**
- Auto-continue wrapper exception handling (lines 479-495)
  - Replaced hardcoded `AnthropicException - Overloaded` check
  - Integrated `detect_error_and_suggest_fallback()` function
  - Enhanced logging with specific error types
  - Dynamic fallback model selection

**Before:**
```python
if ("AnthropicException - Overloaded" in str(e)):
    logger.error(f"AnthropicException - Overloaded detected - Falling back to OpenRouter: {str(e)}", exc_info=True)
    llm_model = f"openrouter/{llm_model}"
```

**After:**
```python
should_fallback, fallback_model, error_type = detect_error_and_suggest_fallback(e, llm_model)
if should_fallback:
    logger.error(f"{error_type} detected - Falling back to {fallback_model}: {str(e)}", exc_info=True)
    llm_model = fallback_model
```

### 3. Updated `agentpress/response_processor.py`

**Modified:**
- Streaming response processing exception handling (lines 802-820)
  - Replaced hardcoded `AnthropicException - Overloaded` check
  - Integrated `detect_error_and_suggest_fallback()` function
  - Enhanced error logging with specific error types
  - Improved trace event naming

**Before:**
```python
if (not "AnthropicException - Overloaded" in str(e)):
    # Handle non-Anthropic errors
else:
    logger.error(f"AnthropicException - Overloaded detected - Falling back to OpenRouter: {str(e)}", exc_info=True)
```

**After:**
```python
should_fallback, fallback_model, error_type = detect_error_and_suggest_fallback(e, llm_model)
if not should_fallback:
    # Handle non-fallback errors
else:
    logger.error(f"{error_type} detected - Falling back to {fallback_model}: {str(e)}", exc_info=True)
```

### 4. Added Comprehensive Testing

**Created:**
- `tests/test_error_handling.py` - Comprehensive test suite covering:
  - All supported error types (15 test cases)
  - Case insensitivity testing
  - Model-specific fallback strategies
  - Edge cases and error conditions

**Test Coverage:**
- Anthropic-specific errors (overloaded)
- OpenRouter-specific errors (connection, rate limit)
- OpenAI-specific errors (rate limit, connection, service unavailable)
- xAI-specific errors (rate limit, connection)
- Generic errors (connection, rate limit, service unavailable)
- Unknown error handling
- Case insensitivity validation

### 5. Documentation

**Created:**
- `docs/ERROR_HANDLING.md` - Comprehensive documentation covering:
  - System overview and architecture
  - Supported error types and fallback strategies
  - Implementation details and usage examples
  - Testing procedures and benefits

## Supported Error Types

### Provider-Specific Errors
1. **Anthropic:** `AnthropicException - Overloaded`
2. **OpenRouter:** Connection/timeout, rate limit errors
3. **OpenAI:** Rate limit, connection, service unavailable errors
4. **xAI:** Rate limit, connection errors

### Generic Error Patterns
1. **Connection/Timeout:** `"connection"`, `"timeout"`
2. **Rate Limiting:** `"rate limit"`, `"quota"`
3. **Service Issues:** `"service unavailable"`, `"internal server error"`, `"bad gateway"`

## Fallback Strategies

### Hierarchical Approach
1. **Provider-Specific:** Use provider-specific fallback models
2. **OpenRouter Migration:** Switch to OpenRouter versions if not already using them
3. **Model Family:** Within OpenRouter, try different models of the same family
4. **No Fallback:** Return `False` if no appropriate fallback is found

### Model Mapping Examples
- `anthropic/claude-3-sonnet` → `openrouter/anthropic/claude-sonnet-4`
- `gpt-4o` → `openrouter/openai/gpt-4o`
- `xai/grok-4` → `openrouter/x-ai/grok-4`
- `openrouter/anthropic/claude-3-sonnet` → `openrouter/anthropic/claude-sonnet-4` (for connection issues)

## Benefits

1. **Improved Reliability:** Automatic fallback to alternative models
2. **Better User Experience:** Reduced downtime due to provider issues
3. **Comprehensive Coverage:** Handles multiple error types from different providers
4. **Intelligent Fallbacks:** Context-aware fallback suggestions
5. **Enhanced Logging:** Specific error types for better monitoring
6. **Backward Compatibility:** Maintains existing functionality while extending capabilities

## Testing Results

All 15 test cases pass successfully, covering:
- ✅ Anthropic overloaded errors
- ✅ OpenRouter connection and rate limit errors
- ✅ OpenAI rate limit, connection, and service errors
- ✅ xAI rate limit and connection errors
- ✅ Generic error patterns
- ✅ Case insensitivity
- ✅ Unknown error handling

## Future Considerations

1. **Configurable Fallbacks:** Allow user configuration of preferred fallback models
2. **Fallback Chains:** Support multiple sequential fallback attempts
3. **Performance Tracking:** Monitor fallback success rates and response times
4. **Health Monitoring:** Proactive provider health assessment
5. **Cost Optimization:** Consider pricing when suggesting fallbacks