mirror of https://github.com/kortix-ai/suna.git
151 lines
5.5 KiB
Markdown
151 lines
5.5 KiB
Markdown
# Error Handling Enhancement Changelog
|
|
|
|
## Overview
|
|
Extended the existing `AnthropicException - Overloaded` error handling to support comprehensive error detection and fallback strategies for multiple LLM providers.
|
|
|
|
## Changes Made
|
|
|
|
### 1. Enhanced `services/llm.py`
|
|
|
|
**Added:**
|
|
- `detect_error_and_suggest_fallback()` function (lines 102-175)
|
|
- Detects specific error types from different LLM providers
|
|
- Suggests appropriate fallback models based on current model and error type
|
|
- Returns tuple: (should_fallback, fallback_model, error_type)
|
|
|
|
**Modified:**
|
|
- `make_llm_api_call()` function (lines 320-340)
|
|
- Enhanced retry logic to use new error detection function
|
|
- Better handling of fallback-eligible errors on final retry attempt
|
|
|
|
### 2. Updated `agentpress/thread_manager.py`
|
|
|
|
**Modified:**
|
|
- Auto-continue wrapper exception handling (lines 479-495)
|
|
- Replaced hardcoded `AnthropicException - Overloaded` check
|
|
- Integrated `detect_error_and_suggest_fallback()` function
|
|
- Enhanced logging with specific error types
|
|
- Dynamic fallback model selection
|
|
|
|
**Before:**
|
|
```python
|
|
if ("AnthropicException - Overloaded" in str(e)):
|
|
logger.error(f"AnthropicException - Overloaded detected - Falling back to OpenRouter: {str(e)}", exc_info=True)
|
|
llm_model = f"openrouter/{llm_model}"
|
|
```
|
|
|
|
**After:**
|
|
```python
|
|
should_fallback, fallback_model, error_type = detect_error_and_suggest_fallback(e, llm_model)
|
|
if should_fallback:
|
|
logger.error(f"{error_type} detected - Falling back to {fallback_model}: {str(e)}", exc_info=True)
|
|
llm_model = fallback_model
|
|
```
|
|
|
|
### 3. Updated `agentpress/response_processor.py`
|
|
|
|
**Modified:**
|
|
- Streaming response processing exception handling (lines 802-820)
|
|
- Replaced hardcoded `AnthropicException - Overloaded` check
|
|
- Integrated `detect_error_and_suggest_fallback()` function
|
|
- Enhanced error logging with specific error types
|
|
- Improved trace event naming
|
|
|
|
**Before:**
|
|
```python
|
|
if (not "AnthropicException - Overloaded" in str(e)):
|
|
# Handle non-Anthropic errors
|
|
else:
|
|
logger.error(f"AnthropicException - Overloaded detected - Falling back to OpenRouter: {str(e)}", exc_info=True)
|
|
```
|
|
|
|
**After:**
|
|
```python
|
|
should_fallback, fallback_model, error_type = detect_error_and_suggest_fallback(e, llm_model)
|
|
if not should_fallback:
|
|
# Handle non-fallback errors
|
|
else:
|
|
logger.error(f"{error_type} detected - Falling back to {fallback_model}: {str(e)}", exc_info=True)
|
|
```
|
|
|
|
### 4. Added Comprehensive Testing
|
|
|
|
**Created:**
|
|
- `tests/test_error_handling.py` - Comprehensive test suite covering:
|
|
- All supported error types (15 test cases)
|
|
- Case insensitivity testing
|
|
- Model-specific fallback strategies
|
|
- Edge cases and error conditions
|
|
|
|
**Test Coverage:**
|
|
- Anthropic-specific errors (overloaded)
|
|
- OpenRouter-specific errors (connection, rate limit)
|
|
- OpenAI-specific errors (rate limit, connection, service unavailable)
|
|
- xAI-specific errors (rate limit, connection)
|
|
- Generic errors (connection, rate limit, service unavailable)
|
|
- Unknown error handling
|
|
- Case insensitivity validation
|
|
|
|
### 5. Documentation
|
|
|
|
**Created:**
|
|
- `docs/ERROR_HANDLING.md` - Comprehensive documentation covering:
|
|
- System overview and architecture
|
|
- Supported error types and fallback strategies
|
|
- Implementation details and usage examples
|
|
- Testing procedures and benefits
|
|
|
|
## Supported Error Types
|
|
|
|
### Provider-Specific Errors
|
|
1. **Anthropic:** `AnthropicException - Overloaded`
|
|
2. **OpenRouter:** Connection/timeout, rate limit errors
|
|
3. **OpenAI:** Rate limit, connection, service unavailable errors
|
|
4. **xAI:** Rate limit, connection errors
|
|
|
|
### Generic Error Patterns
|
|
1. **Connection/Timeout:** `"connection"`, `"timeout"`
|
|
2. **Rate Limiting:** `"rate limit"`, `"quota"`
|
|
3. **Service Issues:** `"service unavailable"`, `"internal server error"`, `"bad gateway"`
|
|
|
|
## Fallback Strategies
|
|
|
|
### Hierarchical Approach
|
|
1. **Provider-Specific:** Use provider-specific fallback models
|
|
2. **OpenRouter Migration:** Switch to OpenRouter versions if not already using them
|
|
3. **Model Family:** Within OpenRouter, try different models of the same family
|
|
4. **No Fallback:** Return `False` if no appropriate fallback is found
|
|
|
|
### Model Mapping Examples
|
|
- `anthropic/claude-3-sonnet` → `openrouter/anthropic/claude-sonnet-4`
|
|
- `gpt-4o` → `openrouter/openai/gpt-4o`
|
|
- `xai/grok-4` → `openrouter/x-ai/grok-4`
|
|
- `openrouter/anthropic/claude-3-sonnet` → `openrouter/anthropic/claude-sonnet-4` (for connection issues)
|
|
|
|
## Benefits
|
|
|
|
1. **Improved Reliability:** Automatic fallback to alternative models
|
|
2. **Better User Experience:** Reduced downtime due to provider issues
|
|
3. **Comprehensive Coverage:** Handles multiple error types from different providers
|
|
4. **Intelligent Fallbacks:** Context-aware fallback suggestions
|
|
5. **Enhanced Logging:** Specific error types for better monitoring
|
|
6. **Backward Compatibility:** Maintains existing functionality while extending capabilities
|
|
|
|
## Testing Results
|
|
|
|
All 15 test cases pass successfully, covering:
|
|
- ✅ Anthropic overloaded errors
|
|
- ✅ OpenRouter connection and rate limit errors
|
|
- ✅ OpenAI rate limit, connection, and service errors
|
|
- ✅ xAI rate limit and connection errors
|
|
- ✅ Generic error patterns
|
|
- ✅ Case insensitivity
|
|
- ✅ Unknown error handling
|
|
|
|
## Future Considerations
|
|
|
|
1. **Configurable Fallbacks:** Allow user configuration of preferred fallback models
|
|
2. **Fallback Chains:** Support multiple sequential fallback attempts
|
|
3. **Performance Tracking:** Monitor fallback success rates and response times
|
|
4. **Health Monitoring:** Proactive provider health assessment
|
|
5. **Cost Optimization:** Consider pricing when suggesting fallbacks |