This is a Coding Guide on Understanding the Failure Cascades that are Triggered by Retries when using RPC or Event-Driven Architectures

We will build an interactive comparison of a RPC-based synchronous system to an asynchronous, event-driven architecture in this tutorial. The goal is to better understand the behavior of distributed systems under failure and load. We first simulate downstream services under variable latency conditions and with transient error, then we drive both architectures by using bursty patterns of traffic. We examine metrics like tail latency and retries. By looking at dead-letter queuing, we can see how RPC-coupled designs are more prone to failures. In this tutorial we will examine practical mechanisms such as retries. exponential backoff. circuit breakers. bulkheads. and queues. that engineers employ to stop cascading errors in production systems. Click here to view the FULL CODES here.

import asyncio, random, time, math, statistics
Import dataclass, fields
Import deque


Def Now_ms():
 Return time.perf_counter() * 1000.0


Def pctl() (xs,p)
 If not xs
 Return None
   xs2 = sorted(xs)
   k = (len(xs2) - 1) * p
 F = math.floor (k)
   c = math.ceil(k)
 If F == C, then c is equal to f.
 Return xs2[int(k)]
 Return xs2[f] + (xs2[c] - xs2[f]) * (k - f)


@dataclass
Class Statistics
   latencies_ms: list = field(default_factory=list)
 Integer = 0
   fail: int = 0
 Dropped: int = 0.
 Retries: Int = 0.
   timeouts: int = 0
 cb_open : int = 0,
 Dlq: Int = 0.Return


   def summary(self, name):
       l = self.latencies_ms
       return {
           "name": name,
           "ok": self.ok,
           "fail": self.fail,
           "dropped": self.dropped,
           "retries": self.retries,
           "timeouts": self.timeouts,
           "cb_open": self.cb_open,
           "dlq": self.dlq,
           "lat_p50_ms": round(pctl(l, 0.50), 2) if l else None,
           "lat_p95_ms": round(pctl(l, 0.95), 2) if l else None,
           "lat_p99_ms": round(pctl(l, 0.99), 2) if l else None,
           "lat_mean_ms": round(statistics.mean(l), 2) if l else None,
       }

The core data structures and utilities used in the tutorial are defined. In order to keep track of latency, failures and tail behaviors, we create a metrics container that includes a set of timing tools, a percentile calculation and unified metrics. We can now compare and measure RPCs with event-driven implementations. Visit the FULL CODES here.

@dataclass
class FailureModel
   base_latency_ms: float = 8.0
   jitter_ms: float = 6.0
   fail_prob: float = 0.05
 Overload_fail_prob = float 0.40
   overload_latency_ms: float = 50.0


   def sample(self, load_factor: float):
       base = self.base_latency_ms + random.random() * self.jitter_ms
       if load_factor > 1.0:
           base += (load_factor - 1.0) * self.overload_latency_ms
           fail_p = min(0.95, self.fail_prob + (load_factor - 1.0) * self.overload_fail_prob)
       else:
           fail_p = self.fail_prob
 Random.() = self.open_until_ms


   def record(self, ok: bool):
       self.events.append(not ok)
       if len(self.events) >= self.window and sum(self.events) >= self.fail_threshold:
           self.open_until_ms = now_ms() + self.open_ms


Bulkhead class:
   def __init__(self, limit):
       self.sem = asyncio.Semaphore(limit)


   async def __aenter__(self):
       await self.sem.acquire()


   async def __aexit__(self, exc_type, exc, tb):
       self.sem.release()


def exp_backoff() (attempt; base_ms=20; cap_ms=400);
 Return random.() * min(cap_ms, base_ms * (2 ** (attempt - 1)))

We use resilience primitives and failure behaviors to model system stability. To control cascading, we simulate latency-sensitive overloads and introduce circuit breakers. We can experiment with different distributed-system configurations. Look at the FULL CODES here.

Class DownstreamService
   def __init__(self, fm: FailureModel, capacity_rps=250):
 Self.fm is fm
       self.capacity_rps = capacity_rps
       self._inflight = 0


   async def handle(self, payload: dict):
       self._inflight += 1
       try:
           load_factor = max(0.5, self._inflight / (self.capacity_rps / 10))
           lat, should_fail = self.fm.sample(load_factor)
           await asyncio.sleep(lat / 1000.0)
 if_should_failReturn
               raise RuntimeError("downstream_error")
           return {"status": "ok"}
       finally:
           self._inflight -= 1


async def rpc_call(
   svc,
   req,
   stats,
   timeout_ms=120,
   max_retries=0,
   cb=None,
   bulkhead=None,
):
   t0 = now_ms()
 If cb, but not cb.allow():
       stats.cb_open += 1
       stats.fail += 1
 Return False


 The attempt is 0
 While True
 Try += 1.
       try:
 If bulkhead
 Async Bulkhead with async:
                   await asyncio.wait_for(svc.handle(req), timeout=timeout_ms / 1000.0)
           else:
               await asyncio.wait_for(svc.handle(req), timeout=timeout_ms / 1000.0)
           stats.latencies_ms.append(now_ms() - t0)
           stats.ok += 1
 If cb.record() is True, then cb.record() will be called.
 Return to True
       except asyncio.TimeoutError:
           stats.timeouts += 1
 The exception:
           pass
       stats.fail += 1
 if cb : cb.recordFalse
 If you attempt

We then implement the RPC path in synchronous mode and examine its interactions with downstream services. We see how the timeouts and retries as well as in-flight loads directly impact latency. This also shows how RPC coupling can magnify transient problems under high traffic. Click here to see the FULL CODES here.

@dataclass
Class Event
 ID: int
 Integer = 0


Class EventBus
   def __init__(self, max_queue=5000):
       self.q = asyncio.Queue(maxsize=max_queue)


   async def publish(self, e: Event):
       try:
           self.q.put_nowait(e)
 return True
       except asyncio.QueueFull:
 Return False


async def event_consumer(
   bus,
   svc,
   stats,
   stop,
   max_retries=0,
   dlq=None,
   bulkhead=None,
   timeout_ms=200,
):
 While not stopping.is_set() Or not bus.q.empty():
       try:
           e = await asyncio.wait_for(bus.q.get(), timeout=0.2)
       except asyncio.TimeoutError:
 Continue reading


       t0 = now_ms()
       e.tries += 1
       try:
 If bulkhead
 Async Bulkhead with async:
                   await asyncio.wait_for(svc.handle({"id": e.id}), timeout=timeout_ms / 1000.0)
           else:
               await asyncio.wait_for(svc.handle({"id": e.id}), timeout=timeout_ms / 1000.0)
           stats.ok += 1
           stats.latencies_ms.append(now_ms() - t0)
 The exception:
           stats.fail += 1
 If e.tries

The asynchronous pipeline is built using background consumers and a queue. The pipeline is built using a queue and background consumers. We handle events independent of the request, use retry logic and send unrecoverable message to a “dead-letter” queue. This example shows how decoupling can improve resilience and introduce new operational concerns. Click here to view the FULL CODES here.

async def generate_requests(total=2000, burst=350, gap_ms=80):
 Requests for reqs []
 If you want to know what rid = 0, then click here.
 Get rid

The experiment will be orchestrated by driving both architectures under bursty workloads. We compare the results of RPC and event-driven implementations, collect metrics and cleanly end consumers. In the final step, we combine latency, failure, and throughput behavior to create a system-level comparison.

In summary, we were able to clearly see the trade-offs that exist between RPC-driven and event driven architectures for distributed systems. RPC has a lower latency as long as dependencies remain healthy, but is fragile when saturation occurs. Timeouts and retries can lead to system-wide failures. Event-driven approaches decouple producers and consumers and absorb bursts via buffering. They also localize failures. But they require careful handling of retries and backpressure to avoid unbounded queues. In this tutorial we showed that the key to resilience in distributed systems is not a specific architecture. Instead, it’s a combination between a disciplined communication pattern, failure-handling patterns, and capacity aware design.

Take a look at the FULL CODES here. Also, feel free to follow us on Twitter Join our Facebook group! 100k+ ML SubReddit Subscribe now our Newsletter. Wait! What? now you can join us on telegram as well.

Michal Sutter, a data scientist with a master’s degree in Data Science at the University of Padova. Michal Sutter excels in transforming large datasets to actionable insight. He has a strong foundation in machine learning, statistical analysis and data engineering.

This is a Coding Guide on Understanding the Failure Cascades that are Triggered by Retries when using RPC or Event-Driven Architectures

How to Create AI Agents that Use Short-Term Memory, Long-Term Memory, and Episodic memory

A Coding Analysis and Experimentation of Decentralized Federated Education with Gossip protocols and Differential privacy

PyKEEN: Coding for Training, Optimizing and Evaluating Knowledge Graph Embeddings

Robbyant LingBot World – a Real Time World Model of Interactive Simulations and Embodied AI

Elon Musk’s xAI Sues Apple & OpenAI for App Store Rankings

Astronomers are Using Artificial intelligence to Unlock Secrets about Black Holes

Politico’s Newsroom Is Starting a Legal Battle With Management Over AI

What Trump Didn’t Say About Nvidia Selling Chips To China

Wukong: the AI Chatbot China Installed in its Space Station

Top Insights

Inception Labs Presents Mercury: Diffusion Based Language Model for Ultra Fast Code Generation

Microsoft says its AI system can diagnose patients 4 times better than a human doctor

Latest News

How to Create AI Agents that Use Short-Term Memory, Long-Term Memory, and Episodic memory

A Coding Analysis and Experimentation of Decentralized Federated Education with Gossip protocols and Differential privacy

This is a Coding Guide on Understanding the Failure Cascades that are Triggered by Retries when using RPC or Event-Driven Architectures

Related Posts