Part 2: Failure Handling

Weight: 35%

Add robustness to handle real-world network failures.

Objectives

Implement timeout handling for slow/unresponsive servers
Add retry logic with exponential backoff
Understand at-least-once vs at-most-once semantics

Background

In distributed systems, failures are inevitable:

Network delays: Packets get delayed or lost
Server overload: Slow response times
Partial failures: Some requests succeed, others fail

Your service must handle these gracefully.

Requirements

1. Client-Side Timeouts

Add timeout configuration to your client:

# Create channel with timeout
channel = grpc.insecure_channel(
    'localhost:50051',
    options=[
        ('grpc.max_receive_message_length', -1),
    ]
)

# Make RPC call with deadline
response = stub.Add(
    calculator_pb2.BinaryOperation(a=10, b=5),
    timeout=2.0  # 2 second timeout
)

Requirements:

✅ Default timeout: 2 seconds
✅ Timeout should be configurable
✅ Client logs timeout events

2. Retry Logic

Implement retry with exponential backoff:

def call_with_retry(stub, operation, request, max_retries=3):
    """
    Call RPC with retry logic.

    Args:
        stub: gRPC stub
        operation: RPC method to call
        request: Request message
        max_retries: Maximum number of retry attempts

    Returns:
        Response from successful call

    Raises:
        grpc.RpcError: If all retries fail
    """
    # TODO: Loop through max_retries attempts
    # TODO: Try calling operation(request, timeout=2.0)
    # TODO: On grpc.RpcError, check if error code is DEADLINE_EXCEEDED
    # TODO: Calculate backoff: (2 ** attempt) + random.uniform(0, 1)
    # TODO: Sleep for backoff duration
    # TODO: If not timeout error, raise immediately
    # TODO: If all retries exhausted, raise the error
    pass

Requirements:

✅ Maximum 3 retry attempts
✅ Exponential backoff: 1s, 2s, 4s
✅ Add jitter (random 0-1s) to prevent thundering herd
✅ Log each retry attempt
✅ Only retry on timeout (DEADLINE_EXCEEDED)

3. Simulated Failures

Modify your server to simulate failures for testing:

import random
import time

class CalculatorServicer(calculator_pb2_grpc.CalculatorServicer):
    def __init__(self, failure_rate=0.3, slow_rate=0.2):
        """
        Args:
            failure_rate: Probability of crashing (0.0-1.0)
            slow_rate: Probability of slow response (0.0-1.0)
        """
        self.failure_rate = failure_rate
        self.slow_rate = slow_rate

    def Add(self, request, context):
        # Simulate random crash
        if random.random() < self.failure_rate:
            context.abort(
                grpc.StatusCode.UNAVAILABLE,
                'Simulated server failure'
            )

        # Simulate slow response
        if random.random() < self.slow_rate:
            time.sleep(3)  # Longer than client timeout

        # Normal operation
        result = request.a + request.b
        return calculator_pb2.Result(value=result)

Configuration:

✅ --failure-rate 0.3 (30% of requests crash)
✅ --slow-rate 0.2 (20% of requests timeout)
✅ Configurable via command-line arguments

Testing Requirements

Test Scenarios

Run your client under different failure conditions:

Scenario	Server Config	Expected Behavior
Normal	No failures	All operations succeed
Occasional timeouts	`--slow-rate 0.3`	Retries succeed within 3 attempts
High failure rate	`--failure-rate 0.5`	Some operations fail after 3 retries
Combined	`--failure-rate 0.3 --slow-rate 0.2`	Mix of retries and failures

Required Tests

Timeout Test: Server sleeps 3s, client times out and retries
Retry Success: First attempt fails, second succeeds
All Retries Failed: All 3 attempts timeout, operation fails
No Retry on Invalid Input: Division by zero doesn't retry

Deliverables

📦 Updated Code:

server.py - Add failure simulation
client.py - Add timeout and retry logic
demo_part2.py - Demonstrate failure handling

📊 Test Results:

Create a test report showing:

Number of successful operations
Number of retries performed
Average response time
Failure rate observed

📹 Demo Video (2 minutes):

Show:

Normal operation (no failures)
Server with 30% failure rate - retries working
Server with high delay - timeouts occurring
All retries exhausted - graceful failure

Grading Rubric

Criterion	Points	Description
Timeout Implementation	8	Correct timeout configuration
Retry Logic	12	Exponential backoff with jitter
Failure Simulation	5	Server simulates failures correctly
Retry Selectivity	5	Only retries appropriate errors
Logging	3	Clear logs showing retry behavior
Demo	2	Demonstrates all scenarios
Total	35

Analysis Questions

Answer these in your report:

Question 1

Why shouldn't we retry division-by-zero errors? What category of errors should never be retried?

Question 2

What happens if 1000 clients all retry at the same time after a server crash? How does jitter help?

Question 3

Is your calculator service using at-least-once or at-most-once semantics? Why?

Tips

Testing Tip

Use time.sleep() in your server to simulate slow responses:

time.sleep(3)  # Simulate 3-second delay

Common Mistakes

Retrying on all errors (including invalid input)
Not adding jitter to backoff
Forgetting to log retry attempts
Hardcoding timeout values

Understanding At-Least-Once

With retries enabled, the calculator operations exhibit at-least-once semantics:

The operation may be executed multiple times
For calculator operations (add, multiply, etc.), this is safe because they're read-only on the server
But what about stateful operations? (See Part 3!)

Next Steps

Your calculator can now handle transient failures! But what about operations that modify state? Move on to Part 3: Idempotency.