Part 2: Failure Handling
Weight: 35%
Add robustness to handle real-world network failures.
Objectives
- Implement timeout handling for slow/unresponsive servers
- Add retry logic with exponential backoff
- Understand at-least-once vs at-most-once semantics
Background
In distributed systems, failures are inevitable:
- Network delays: Packets get delayed or lost
- Server overload: Slow response times
- Partial failures: Some requests succeed, others fail
Your service must handle these gracefully.
Requirements
1. Client-Side Timeouts
Add timeout configuration to your client:
# Create channel with timeout
channel = grpc.insecure_channel(
'localhost:50051',
options=[
('grpc.max_receive_message_length', -1),
]
)
# Make RPC call with deadline
response = stub.Add(
calculator_pb2.BinaryOperation(a=10, b=5),
timeout=2.0 # 2 second timeout
)
Requirements:
- ✅ Default timeout: 2 seconds
- ✅ Timeout should be configurable
- ✅ Client logs timeout events
2. Retry Logic
Implement retry with exponential backoff:
def call_with_retry(stub, operation, request, max_retries=3):
"""
Call RPC with retry logic.
Args:
stub: gRPC stub
operation: RPC method to call
request: Request message
max_retries: Maximum number of retry attempts
Returns:
Response from successful call
Raises:
grpc.RpcError: If all retries fail
"""
# TODO: Loop through max_retries attempts
# TODO: Try calling operation(request, timeout=2.0)
# TODO: On grpc.RpcError, check if error code is DEADLINE_EXCEEDED
# TODO: Calculate backoff: (2 ** attempt) + random.uniform(0, 1)
# TODO: Sleep for backoff duration
# TODO: If not timeout error, raise immediately
# TODO: If all retries exhausted, raise the error
pass
Requirements:
- ✅ Maximum 3 retry attempts
- ✅ Exponential backoff: 1s, 2s, 4s
- ✅ Add jitter (random 0-1s) to prevent thundering herd
- ✅ Log each retry attempt
- ✅ Only retry on timeout (DEADLINE_EXCEEDED)
3. Simulated Failures
Modify your server to simulate failures for testing:
import random
import time
class CalculatorServicer(calculator_pb2_grpc.CalculatorServicer):
def __init__(self, failure_rate=0.3, slow_rate=0.2):
"""
Args:
failure_rate: Probability of crashing (0.0-1.0)
slow_rate: Probability of slow response (0.0-1.0)
"""
self.failure_rate = failure_rate
self.slow_rate = slow_rate
def Add(self, request, context):
# Simulate random crash
if random.random() < self.failure_rate:
context.abort(
grpc.StatusCode.UNAVAILABLE,
'Simulated server failure'
)
# Simulate slow response
if random.random() < self.slow_rate:
time.sleep(3) # Longer than client timeout
# Normal operation
result = request.a + request.b
return calculator_pb2.Result(value=result)
Configuration:
- ✅
--failure-rate 0.3(30% of requests crash) - ✅
--slow-rate 0.2(20% of requests timeout) - ✅ Configurable via command-line arguments
Testing Requirements
Test Scenarios
Run your client under different failure conditions:
| Scenario | Server Config | Expected Behavior |
|---|---|---|
| Normal | No failures | All operations succeed |
| Occasional timeouts | --slow-rate 0.3 |
Retries succeed within 3 attempts |
| High failure rate | --failure-rate 0.5 |
Some operations fail after 3 retries |
| Combined | --failure-rate 0.3 --slow-rate 0.2 |
Mix of retries and failures |
Required Tests
- Timeout Test: Server sleeps 3s, client times out and retries
- Retry Success: First attempt fails, second succeeds
- All Retries Failed: All 3 attempts timeout, operation fails
- No Retry on Invalid Input: Division by zero doesn't retry
Deliverables
📦 Updated Code:
server.py- Add failure simulationclient.py- Add timeout and retry logicdemo_part2.py- Demonstrate failure handling
📊 Test Results:
Create a test report showing:
- Number of successful operations
- Number of retries performed
- Average response time
- Failure rate observed
📹 Demo Video (2 minutes):
Show:
- Normal operation (no failures)
- Server with 30% failure rate - retries working
- Server with high delay - timeouts occurring
- All retries exhausted - graceful failure
Grading Rubric
| Criterion | Points | Description |
|---|---|---|
| Timeout Implementation | 8 | Correct timeout configuration |
| Retry Logic | 12 | Exponential backoff with jitter |
| Failure Simulation | 5 | Server simulates failures correctly |
| Retry Selectivity | 5 | Only retries appropriate errors |
| Logging | 3 | Clear logs showing retry behavior |
| Demo | 2 | Demonstrates all scenarios |
| Total | 35 |
Analysis Questions
Answer these in your report:
Question 1
Why shouldn't we retry division-by-zero errors? What category of errors should never be retried?
Question 2
What happens if 1000 clients all retry at the same time after a server crash? How does jitter help?
Question 3
Is your calculator service using at-least-once or at-most-once semantics? Why?
Tips
Testing Tip
Use time.sleep() in your server to simulate slow responses:
Common Mistakes
- Retrying on all errors (including invalid input)
- Not adding jitter to backoff
- Forgetting to log retry attempts
- Hardcoding timeout values
Understanding At-Least-Once
With retries enabled, the calculator operations exhibit at-least-once semantics:
- The operation may be executed multiple times
- For calculator operations (add, multiply, etc.), this is safe because they're read-only on the server
- But what about stateful operations? (See Part 3!)
Next Steps
Your calculator can now handle transient failures! But what about operations that modify state? Move on to Part 3: Idempotency.