suna/backend/CHANGELOG_ERROR_HANDLING.md

5.5 KiB

Error Handling Enhancement Changelog

Overview

Extended the existing AnthropicException - Overloaded error handling to support comprehensive error detection and fallback strategies for multiple LLM providers.

Changes Made

1. Enhanced services/llm.py

Added:

  • detect_error_and_suggest_fallback() function (lines 102-175)
    • Detects specific error types from different LLM providers
    • Suggests appropriate fallback models based on current model and error type
    • Returns tuple: (should_fallback, fallback_model, error_type)

Modified:

  • make_llm_api_call() function (lines 320-340)
    • Enhanced retry logic to use new error detection function
    • Better handling of fallback-eligible errors on final retry attempt

2. Updated agentpress/thread_manager.py

Modified:

  • Auto-continue wrapper exception handling (lines 479-495)
    • Replaced hardcoded AnthropicException - Overloaded check
    • Integrated detect_error_and_suggest_fallback() function
    • Enhanced logging with specific error types
    • Dynamic fallback model selection

Before:

if ("AnthropicException - Overloaded" in str(e)):
    logger.error(f"AnthropicException - Overloaded detected - Falling back to OpenRouter: {str(e)}", exc_info=True)
    llm_model = f"openrouter/{llm_model}"

After:

should_fallback, fallback_model, error_type = detect_error_and_suggest_fallback(e, llm_model)
if should_fallback:
    logger.error(f"{error_type} detected - Falling back to {fallback_model}: {str(e)}", exc_info=True)
    llm_model = fallback_model

3. Updated agentpress/response_processor.py

Modified:

  • Streaming response processing exception handling (lines 802-820)
    • Replaced hardcoded AnthropicException - Overloaded check
    • Integrated detect_error_and_suggest_fallback() function
    • Enhanced error logging with specific error types
    • Improved trace event naming

Before:

if (not "AnthropicException - Overloaded" in str(e)):
    # Handle non-Anthropic errors
else:
    logger.error(f"AnthropicException - Overloaded detected - Falling back to OpenRouter: {str(e)}", exc_info=True)

After:

should_fallback, fallback_model, error_type = detect_error_and_suggest_fallback(e, llm_model)
if not should_fallback:
    # Handle non-fallback errors
else:
    logger.error(f"{error_type} detected - Falling back to {fallback_model}: {str(e)}", exc_info=True)

4. Added Comprehensive Testing

Created:

  • tests/test_error_handling.py - Comprehensive test suite covering:
    • All supported error types (15 test cases)
    • Case insensitivity testing
    • Model-specific fallback strategies
    • Edge cases and error conditions

Test Coverage:

  • Anthropic-specific errors (overloaded)
  • OpenRouter-specific errors (connection, rate limit)
  • OpenAI-specific errors (rate limit, connection, service unavailable)
  • xAI-specific errors (rate limit, connection)
  • Generic errors (connection, rate limit, service unavailable)
  • Unknown error handling
  • Case insensitivity validation

5. Documentation

Created:

  • docs/ERROR_HANDLING.md - Comprehensive documentation covering:
    • System overview and architecture
    • Supported error types and fallback strategies
    • Implementation details and usage examples
    • Testing procedures and benefits

Supported Error Types

Provider-Specific Errors

  1. Anthropic: AnthropicException - Overloaded
  2. OpenRouter: Connection/timeout, rate limit errors
  3. OpenAI: Rate limit, connection, service unavailable errors
  4. xAI: Rate limit, connection errors

Generic Error Patterns

  1. Connection/Timeout: "connection", "timeout"
  2. Rate Limiting: "rate limit", "quota"
  3. Service Issues: "service unavailable", "internal server error", "bad gateway"

Fallback Strategies

Hierarchical Approach

  1. Provider-Specific: Use provider-specific fallback models
  2. OpenRouter Migration: Switch to OpenRouter versions if not already using them
  3. Model Family: Within OpenRouter, try different models of the same family
  4. No Fallback: Return False if no appropriate fallback is found

Model Mapping Examples

  • anthropic/claude-3-sonnetopenrouter/anthropic/claude-sonnet-4
  • gpt-4oopenrouter/openai/gpt-4o
  • xai/grok-4openrouter/x-ai/grok-4
  • openrouter/anthropic/claude-3-sonnetopenrouter/anthropic/claude-sonnet-4 (for connection issues)

Benefits

  1. Improved Reliability: Automatic fallback to alternative models
  2. Better User Experience: Reduced downtime due to provider issues
  3. Comprehensive Coverage: Handles multiple error types from different providers
  4. Intelligent Fallbacks: Context-aware fallback suggestions
  5. Enhanced Logging: Specific error types for better monitoring
  6. Backward Compatibility: Maintains existing functionality while extending capabilities

Testing Results

All 15 test cases pass successfully, covering:

  • Anthropic overloaded errors
  • OpenRouter connection and rate limit errors
  • OpenAI rate limit, connection, and service errors
  • xAI rate limit and connection errors
  • Generic error patterns
  • Case insensitivity
  • Unknown error handling

Future Considerations

  1. Configurable Fallbacks: Allow user configuration of preferred fallback models
  2. Fallback Chains: Support multiple sequential fallback attempts
  3. Performance Tracking: Monitor fallback success rates and response times
  4. Health Monitoring: Proactive provider health assessment
  5. Cost Optimization: Consider pricing when suggesting fallbacks