Skip to content

Part 2: Failure Handling

Weight: 35%

Add robustness to handle real-world network failures.


Objectives

  • Implement timeout handling for slow/unresponsive servers
  • Add retry logic with exponential backoff
  • Understand at-least-once vs at-most-once semantics

Background

In distributed systems, failures are inevitable:

  • Network delays: Packets get delayed or lost
  • Server overload: Slow response times
  • Partial failures: Some requests succeed, others fail

Your service must handle these gracefully.


Requirements

1. Client-Side Timeouts

Add timeout configuration to your client:

# Create channel with timeout
channel = grpc.insecure_channel(
    'localhost:50051',
    options=[
        ('grpc.max_receive_message_length', -1),
    ]
)

# Make RPC call with deadline
response = stub.Add(
    calculator_pb2.BinaryOperation(a=10, b=5),
    timeout=2.0  # 2 second timeout
)

Requirements:

  • ✅ Default timeout: 2 seconds
  • ✅ Timeout should be configurable
  • ✅ Client logs timeout events

2. Retry Logic

Implement retry with exponential backoff:

def call_with_retry(stub, operation, request, max_retries=3):
    """
    Call RPC with retry logic.

    Args:
        stub: gRPC stub
        operation: RPC method to call
        request: Request message
        max_retries: Maximum number of retry attempts

    Returns:
        Response from successful call

    Raises:
        grpc.RpcError: If all retries fail
    """
    # TODO: Loop through max_retries attempts
    # TODO: Try calling operation(request, timeout=2.0)
    # TODO: On grpc.RpcError, check if error code is DEADLINE_EXCEEDED
    # TODO: Calculate backoff: (2 ** attempt) + random.uniform(0, 1)
    # TODO: Sleep for backoff duration
    # TODO: If not timeout error, raise immediately
    # TODO: If all retries exhausted, raise the error
    pass

Requirements:

  • ✅ Maximum 3 retry attempts
  • ✅ Exponential backoff: 1s, 2s, 4s
  • ✅ Add jitter (random 0-1s) to prevent thundering herd
  • ✅ Log each retry attempt
  • ✅ Only retry on timeout (DEADLINE_EXCEEDED)

3. Simulated Failures

Modify your server to simulate failures for testing:

import random
import time

class CalculatorServicer(calculator_pb2_grpc.CalculatorServicer):
    def __init__(self, failure_rate=0.3, slow_rate=0.2):
        """
        Args:
            failure_rate: Probability of crashing (0.0-1.0)
            slow_rate: Probability of slow response (0.0-1.0)
        """
        self.failure_rate = failure_rate
        self.slow_rate = slow_rate

    def Add(self, request, context):
        # Simulate random crash
        if random.random() < self.failure_rate:
            context.abort(
                grpc.StatusCode.UNAVAILABLE,
                'Simulated server failure'
            )

        # Simulate slow response
        if random.random() < self.slow_rate:
            time.sleep(3)  # Longer than client timeout

        # Normal operation
        result = request.a + request.b
        return calculator_pb2.Result(value=result)

Configuration:

  • --failure-rate 0.3 (30% of requests crash)
  • --slow-rate 0.2 (20% of requests timeout)
  • ✅ Configurable via command-line arguments

Testing Requirements

Test Scenarios

Run your client under different failure conditions:

Scenario Server Config Expected Behavior
Normal No failures All operations succeed
Occasional timeouts --slow-rate 0.3 Retries succeed within 3 attempts
High failure rate --failure-rate 0.5 Some operations fail after 3 retries
Combined --failure-rate 0.3 --slow-rate 0.2 Mix of retries and failures

Required Tests

  1. Timeout Test: Server sleeps 3s, client times out and retries
  2. Retry Success: First attempt fails, second succeeds
  3. All Retries Failed: All 3 attempts timeout, operation fails
  4. No Retry on Invalid Input: Division by zero doesn't retry

Deliverables

📦 Updated Code:

  • server.py - Add failure simulation
  • client.py - Add timeout and retry logic
  • demo_part2.py - Demonstrate failure handling

📊 Test Results:

Create a test report showing:

  • Number of successful operations
  • Number of retries performed
  • Average response time
  • Failure rate observed

📹 Demo Video (2 minutes):

Show:

  1. Normal operation (no failures)
  2. Server with 30% failure rate - retries working
  3. Server with high delay - timeouts occurring
  4. All retries exhausted - graceful failure

Grading Rubric

Criterion Points Description
Timeout Implementation 8 Correct timeout configuration
Retry Logic 12 Exponential backoff with jitter
Failure Simulation 5 Server simulates failures correctly
Retry Selectivity 5 Only retries appropriate errors
Logging 3 Clear logs showing retry behavior
Demo 2 Demonstrates all scenarios
Total 35

Analysis Questions

Answer these in your report:

Question 1

Why shouldn't we retry division-by-zero errors? What category of errors should never be retried?

Question 2

What happens if 1000 clients all retry at the same time after a server crash? How does jitter help?

Question 3

Is your calculator service using at-least-once or at-most-once semantics? Why?


Tips

Testing Tip

Use time.sleep() in your server to simulate slow responses:

time.sleep(3)  # Simulate 3-second delay

Common Mistakes

  • Retrying on all errors (including invalid input)
  • Not adding jitter to backoff
  • Forgetting to log retry attempts
  • Hardcoding timeout values

Understanding At-Least-Once

With retries enabled, the calculator operations exhibit at-least-once semantics:

  • The operation may be executed multiple times
  • For calculator operations (add, multiply, etc.), this is safe because they're read-only on the server
  • But what about stateful operations? (See Part 3!)

Next Steps

Your calculator can now handle transient failures! But what about operations that modify state? Move on to Part 3: Idempotency.


Resources