This is a Coding Guide on Understanding the Failure Cascades that are Triggered by Retries when using RPC or Event-Driven Architectures

We will build an interactive comparison of a RPC-based synchronous system to an asynchronous, event-driven architecture in this tutorial. The goal is to better understand the behavior of distributed systems under failure and load. We first simulate downstream services under variable latency conditions and with transient error, then we drive both architectures by using bursty patterns of traffic. We examine metrics like tail latency and retries. By looking at dead-letter queuing, we can see how RPC-coupled designs are more prone to failures. In this tutorial we will examine practical mechanisms such as retries. exponential backoff. circuit breakers. bulkheads. and queues. that engineers employ to stop cascading errors in production systems. Click here to view the FULL CODES here.

import asyncio, random, time, math, statistics
Import dataclass, fields
Import deque


Def Now_ms():
 Return time.perf_counter() * 1000.0


Def pctl() (xs,p)
 If not xs
 Return None
   xs2 = sorted(xs)
   k = (len(xs2) - 1) * p
 F = math.floor (k)
   c = math.ceil(k)
 If F == C, then c is equal to f.
 Return xs2[int(k)]
 Return xs2[f] + (xs2[c] - xs2[f]) * (k - f)


@dataclass
Class Statistics
   latencies_ms: list = field(default_factory=list)
 Integer = 0
   fail: int = 0
 Dropped: int = 0.
 Retries: Int = 0.
   timeouts: int = 0
 cb_open : int = 0,
 Dlq: Int = 0.Return


   def summary(self, name):
       l = self.latencies_ms
       return {
           "name": name,
           "ok": self.ok,
           "fail": self.fail,
           "dropped": self.dropped,
           "retries": self.retries,
           "timeouts": self.timeouts,
           "cb_open": self.cb_open,
           "dlq": self.dlq,
           "lat_p50_ms": round(pctl(l, 0.50), 2) if l else None,
           "lat_p95_ms": round(pctl(l, 0.95), 2) if l else None,
           "lat_p99_ms": round(pctl(l, 0.99), 2) if l else None,
           "lat_mean_ms": round(statistics.mean(l), 2) if l else None,
       }

The core data structures and utilities used in the tutorial are defined. In order to keep track of latency, failures and tail behaviors, we create a metrics container that includes a set of timing tools, a percentile calculation and unified metrics. We can now compare and measure RPCs with event-driven implementations. Visit the FULL CODES here.

@dataclass
class FailureModel
   base_latency_ms: float = 8.0
   jitter_ms: float = 6.0
   fail_prob: float = 0.05
 Overload_fail_prob = float 0.40
   overload_latency_ms: float = 50.0


   def sample(self, load_factor: float):
       base = self.base_latency_ms + random.random() * self.jitter_ms
       if load_factor > 1.0:
           base += (load_factor - 1.0) * self.overload_latency_ms
           fail_p = min(0.95, self.fail_prob + (load_factor - 1.0) * self.overload_fail_prob)
       else:
           fail_p = self.fail_prob
 Random.() = self.open_until_ms


   def record(self, ok: bool):
       self.events.append(not ok)
       if len(self.events) >= self.window and sum(self.events) >= self.fail_threshold:
           self.open_until_ms = now_ms() + self.open_ms


Bulkhead class:
   def __init__(self, limit):
       self.sem = asyncio.Semaphore(limit)


   async def __aenter__(self):
       await self.sem.acquire()


   async def __aexit__(self, exc_type, exc, tb):
       self.sem.release()


def exp_backoff() (attempt; base_ms=20; cap_ms=400);
 Return random.() * min(cap_ms, base_ms * (2 ** (attempt - 1)))

We use resilience primitives and failure behaviors to model system stability. To control cascading, we simulate latency-sensitive overloads and introduce circuit breakers. We can experiment with different distributed-system configurations. Look at the FULL CODES here.

Class DownstreamService
   def __init__(self, fm: FailureModel, capacity_rps=250):
 Self.fm is fm
       self.capacity_rps = capacity_rps
       self._inflight = 0


   async def handle(self, payload: dict):
       self._inflight += 1
       try:
           load_factor = max(0.5, self._inflight / (self.capacity_rps / 10))
           lat, should_fail = self.fm.sample(load_factor)
           await asyncio.sleep(lat / 1000.0)
 if_should_failReturn
               raise RuntimeError("downstream_error")
           return {"status": "ok"}
       finally:
           self._inflight -= 1


async def rpc_call(
   svc,
   req,
   stats,
   timeout_ms=120,
   max_retries=0,
   cb=None,
   bulkhead=None,
):
   t0 = now_ms()
 If cb, but not cb.allow():
       stats.cb_open += 1
       stats.fail += 1
 Return False


 The attempt is 0
 While True
 Try += 1.
       try:
 If bulkhead
 Async Bulkhead with async:
                   await asyncio.wait_for(svc.handle(req), timeout=timeout_ms / 1000.0)
           else:
               await asyncio.wait_for(svc.handle(req), timeout=timeout_ms / 1000.0)
           stats.latencies_ms.append(now_ms() - t0)
           stats.ok += 1
 If cb.record() is True, then cb.record() will be called.
 Return to True
       except asyncio.TimeoutError:
           stats.timeouts += 1
 The exception:
           pass
       stats.fail += 1
 if cb : cb.recordFalse
 If you attempt

We then implement the RPC path in synchronous mode and examine its interactions with downstream services. We see how the timeouts and retries as well as in-flight loads directly impact latency. This also shows how RPC coupling can magnify transient problems under high traffic. Click here to see the FULL CODES here.

@dataclass
Class Event
 ID: int
 Integer = 0


Class EventBus
   def __init__(self, max_queue=5000):
       self.q = asyncio.Queue(maxsize=max_queue)


   async def publish(self, e: Event):
       try:
           self.q.put_nowait(e)
 return True
       except asyncio.QueueFull:
 Return False


async def event_consumer(
   bus,
   svc,
   stats,
   stop,
   max_retries=0,
   dlq=None,
   bulkhead=None,
   timeout_ms=200,
):
 While not stopping.is_set() Or not bus.q.empty():
       try:
           e = await asyncio.wait_for(bus.q.get(), timeout=0.2)
       except asyncio.TimeoutError:
 Continue reading


       t0 = now_ms()
       e.tries += 1
       try:
 If bulkhead
 Async Bulkhead with async:
                   await asyncio.wait_for(svc.handle({"id": e.id}), timeout=timeout_ms / 1000.0)
           else:
               await asyncio.wait_for(svc.handle({"id": e.id}), timeout=timeout_ms / 1000.0)
           stats.ok += 1
           stats.latencies_ms.append(now_ms() - t0)
 The exception:
           stats.fail += 1
 If e.tries

The asynchronous pipeline is built using background consumers and a queue. The pipeline is built using a queue and background consumers. We handle events independent of the request, use retry logic and send unrecoverable message to a “dead-letter” queue. This example shows how decoupling can improve resilience and introduce new operational concerns. Click here to view the FULL CODES here.

async def generate_requests(total=2000, burst=350, gap_ms=80):
 Requests for reqs []
 If you want to know what rid = 0, then click here.
 Get rid

The experiment will be orchestrated by driving both architectures under bursty workloads. We compare the results of RPC and event-driven implementations, collect metrics and cleanly end consumers. In the final step, we combine latency, failure, and throughput behavior to create a system-level comparison.

In summary, we were able to clearly see the trade-offs that exist between RPC-driven and event driven architectures for distributed systems. RPC has a lower latency as long as dependencies remain healthy, but is fragile when saturation occurs. Timeouts and retries can lead to system-wide failures. Event-driven approaches decouple producers and consumers and absorb bursts via buffering. They also localize failures. But they require careful handling of retries and backpressure to avoid unbounded queues. In this tutorial we showed that the key to resilience in distributed systems is not a specific architecture. Instead, it’s a combination between a disciplined communication pattern, failure-handling patterns, and capacity aware design.

Take a look at the FULL CODES here. Also, feel free to follow us on Twitter Join our Facebook group! 100k+ ML SubReddit Subscribe now our Newsletter. Wait! What? now you can join us on telegram as well.

Michal Sutter, a data scientist with a master’s degree in Data Science at the University of Padova. Michal Sutter excels in transforming large datasets to actionable insight. He has a strong foundation in machine learning, statistical analysis and data engineering.

This is a Coding Guide on Understanding the Failure Cascades that are Triggered by Retries when using RPC or Event-Driven Architectures

DeepSeek AI releases DeepSeek V4: Sparse attention and heavily compressed attention enable one-million-token contexts.

OpenMythos Coding Tutorial: Recurrent-Depth Transformers, Depth Extrapolation and Mixture of Experts Routing

OpenAI Releases GPT-5.5, a Absolutely Retrained Agentic Mannequin That Scores 82.7% on Terminal-Bench 2.0 and 84.9% on GDPval

Mend Releases AI Safety Governance Framework: Masking Asset Stock, Danger Tiering, AI Provide Chain Safety, and Maturity Mannequin

AI-Powered dating is a fad. IRL Cruising is the future

“Create a replica of this image. Don’t change anything” AI Trend Takes Off

Jon M. Chu says AI couldn’t have made one of Wicked’s best moments

Jen Easterly, Former CISA Director and Leader of the RSA Conference

United Arab Emirates Releases Tiny but Powerful AI Model

Top Insights

This guide will help you build an intelligent conversational AI agent with agent memory using Cognee models and free hugging face models.

Politico’s Newsroom Is Starting a Legal Battle With Management Over AI

Latest News

DeepSeek AI releases DeepSeek V4: Sparse attention and heavily compressed attention enable one-million-token contexts.

AI-Designed drugs by a DeepMind spinoff are headed to human trials

This is a Coding Guide on Understanding the Failure Cascades that are Triggered by Retries when using RPC or Event-Driven Architectures

Related Posts