suna/backend/CHANGELOG_ERROR_HANDLING.md

151 lines
5.5 KiB
Markdown

# Error Handling Enhancement Changelog
## Overview
Extended the existing `AnthropicException - Overloaded` error handling to support comprehensive error detection and fallback strategies for multiple LLM providers.
## Changes Made
### 1. Enhanced `services/llm.py`
**Added:**
- `detect_error_and_suggest_fallback()` function (lines 102-175)
- Detects specific error types from different LLM providers
- Suggests appropriate fallback models based on current model and error type
- Returns tuple: (should_fallback, fallback_model, error_type)
**Modified:**
- `make_llm_api_call()` function (lines 320-340)
- Enhanced retry logic to use new error detection function
- Better handling of fallback-eligible errors on final retry attempt
### 2. Updated `agentpress/thread_manager.py`
**Modified:**
- Auto-continue wrapper exception handling (lines 479-495)
- Replaced hardcoded `AnthropicException - Overloaded` check
- Integrated `detect_error_and_suggest_fallback()` function
- Enhanced logging with specific error types
- Dynamic fallback model selection
**Before:**
```python
if ("AnthropicException - Overloaded" in str(e)):
logger.error(f"AnthropicException - Overloaded detected - Falling back to OpenRouter: {str(e)}", exc_info=True)
llm_model = f"openrouter/{llm_model}"
```
**After:**
```python
should_fallback, fallback_model, error_type = detect_error_and_suggest_fallback(e, llm_model)
if should_fallback:
logger.error(f"{error_type} detected - Falling back to {fallback_model}: {str(e)}", exc_info=True)
llm_model = fallback_model
```
### 3. Updated `agentpress/response_processor.py`
**Modified:**
- Streaming response processing exception handling (lines 802-820)
- Replaced hardcoded `AnthropicException - Overloaded` check
- Integrated `detect_error_and_suggest_fallback()` function
- Enhanced error logging with specific error types
- Improved trace event naming
**Before:**
```python
if (not "AnthropicException - Overloaded" in str(e)):
# Handle non-Anthropic errors
else:
logger.error(f"AnthropicException - Overloaded detected - Falling back to OpenRouter: {str(e)}", exc_info=True)
```
**After:**
```python
should_fallback, fallback_model, error_type = detect_error_and_suggest_fallback(e, llm_model)
if not should_fallback:
# Handle non-fallback errors
else:
logger.error(f"{error_type} detected - Falling back to {fallback_model}: {str(e)}", exc_info=True)
```
### 4. Added Comprehensive Testing
**Created:**
- `tests/test_error_handling.py` - Comprehensive test suite covering:
- All supported error types (15 test cases)
- Case insensitivity testing
- Model-specific fallback strategies
- Edge cases and error conditions
**Test Coverage:**
- Anthropic-specific errors (overloaded)
- OpenRouter-specific errors (connection, rate limit)
- OpenAI-specific errors (rate limit, connection, service unavailable)
- xAI-specific errors (rate limit, connection)
- Generic errors (connection, rate limit, service unavailable)
- Unknown error handling
- Case insensitivity validation
### 5. Documentation
**Created:**
- `docs/ERROR_HANDLING.md` - Comprehensive documentation covering:
- System overview and architecture
- Supported error types and fallback strategies
- Implementation details and usage examples
- Testing procedures and benefits
## Supported Error Types
### Provider-Specific Errors
1. **Anthropic:** `AnthropicException - Overloaded`
2. **OpenRouter:** Connection/timeout, rate limit errors
3. **OpenAI:** Rate limit, connection, service unavailable errors
4. **xAI:** Rate limit, connection errors
### Generic Error Patterns
1. **Connection/Timeout:** `"connection"`, `"timeout"`
2. **Rate Limiting:** `"rate limit"`, `"quota"`
3. **Service Issues:** `"service unavailable"`, `"internal server error"`, `"bad gateway"`
## Fallback Strategies
### Hierarchical Approach
1. **Provider-Specific:** Use provider-specific fallback models
2. **OpenRouter Migration:** Switch to OpenRouter versions if not already using them
3. **Model Family:** Within OpenRouter, try different models of the same family
4. **No Fallback:** Return `False` if no appropriate fallback is found
### Model Mapping Examples
- `anthropic/claude-3-sonnet``openrouter/anthropic/claude-sonnet-4`
- `gpt-4o``openrouter/openai/gpt-4o`
- `xai/grok-4``openrouter/x-ai/grok-4`
- `openrouter/anthropic/claude-3-sonnet``openrouter/anthropic/claude-sonnet-4` (for connection issues)
## Benefits
1. **Improved Reliability:** Automatic fallback to alternative models
2. **Better User Experience:** Reduced downtime due to provider issues
3. **Comprehensive Coverage:** Handles multiple error types from different providers
4. **Intelligent Fallbacks:** Context-aware fallback suggestions
5. **Enhanced Logging:** Specific error types for better monitoring
6. **Backward Compatibility:** Maintains existing functionality while extending capabilities
## Testing Results
All 15 test cases pass successfully, covering:
- ✅ Anthropic overloaded errors
- ✅ OpenRouter connection and rate limit errors
- ✅ OpenAI rate limit, connection, and service errors
- ✅ xAI rate limit and connection errors
- ✅ Generic error patterns
- ✅ Case insensitivity
- ✅ Unknown error handling
## Future Considerations
1. **Configurable Fallbacks:** Allow user configuration of preferred fallback models
2. **Fallback Chains:** Support multiple sequential fallback attempts
3. **Performance Tracking:** Monitor fallback success rates and response times
4. **Health Monitoring:** Proactive provider health assessment
5. **Cost Optimization:** Consider pricing when suggesting fallbacks