When computers give up too fast: The real cost

When unchecked delays transform system sluggishness into complete service disruptions.

Man in easy chair on laptop with a brown analog clock image on the floor

For distributed systems that directly interact with users, prolonged response times frequently indicate a more critical problem than explicit errors. If service responses consistently fall short of user expectations, the difference between a “lagging” system and a “collapsed” one quickly diminishes, regardless of whether all individual services are technically operational.

I’ve observed this recurring issue in various system architectures. A specific incident highlighted to me the significant impact of implicit default settings on production system performance. The critical factor wasn’t the slowdown itself, but rather how an “unlimited waiting” default silently depleted resources well before any standard failure metrics were triggered.

Specific details have been altered to safeguard proprietary data.

The cascade from sluggishness to total service failure

The problem emerged through user support requests, not automated alerts. These reports started surfacing in the early hours:

Product pages fail to load.
The checkout process freezes.
The website is experiencing significant delays.

Concurrently, our monitoring dashboards showed subtle but concerning shifts. CPU utilization rose, memory consumption intensified, and thread pools became saturated, yet reported error rates remained minimal. Product pages started to freeze periodically: certain requests would finish, while others would hang for such extended periods that users resorted to refreshing, opening new browser tabs, and ultimately abandoning their sessions.

I was the on-call engineer that week. A recent deployment had occurred, so I promptly initiated a rollback. This action yielded no change, indicating that the problem wasn’t tied to a particular update, but rather to the system’s inherent behavior when facing persistent performance degradation.

Within hours, the consequences became quantifiable. Product page abandonment saw a dramatic rise. Conversion rates plummeted by double-digit percentages. The influx of support tickets surged. Customers began migrating to rival platforms. By day’s end, the incident culminated in a six-figure financial setback and, more significantly, a noticeable erosion of customer confidence.

The more challenging inquiry wasn’t about the specific component that failed, but rather why user experience degraded so severely before any of our alert systems triggered. The system surpassed the users’ tolerance level significantly before it activated any of our monitoring thresholds. Our alerting mechanisms were configured for explicit failures—such as errors, unhealthy instances, or clear resource saturation—whereas performance latency was primarily relegated to dashboards and not configured to trigger notifications.

The overlooked mode of system failure

Our product pages presented prices denominated in the user’s local currency. This functionality required the Product Service to invoke a downstream currency exchange API. This particular dependency didn’t crash; instead, it intermittently experienced slow responses for a duration sufficient to initiate a cascading failure.

During my deeper investigation of the incident, a specific detail became apparent. The Product Service utilized an HTTP client configured with its default settings, which meant the request timeout was practically boundless. While frontend browsers typically ceased waiting after approximately 30 seconds, on the backend, these requests persisted, consuming resources long after the user had abandoned their attempt.

Violetta Pidvolotska

This discrepancy proved far more critical than anticipated. The initial few stalled currency conversion calls monopolized Product Service worker threads and external connections, causing subsequent incoming requests to backlog behind tasks that no longer had an active user waiting. As soon as the shared resource pools began to reach saturation, the issue expanded beyond “just the currency function.” Even requests that had no need for currency conversion experienced slowdowns due to contention for the same thread pool and overall internal capacity.

By that stage, the dependent service didn’t have to outright fail to bring down our system. Its mere slowness, combined with our unbounded waiting, was sufficient. This wasn’t a failure caused by errors, but rather a capacity breakdown. Stalled concurrent operations amassed quicker than they could be processed, latency spread throughout the system, and overall throughput plummeted, all without a single exception being logged.

Certain temporary remedies offered only brief respite. Restarting service instances or reducing incoming traffic momentarily eased the strain, but the improvement was never sustained. As long as requests were permitted to wait without a defined limit, the system continued to pile up tasks faster than it could process them.

Once we eventually identified the indefinite waiting as the core problem, the immediate solution seemed straightforward: implement a timeout. However, the true insight gained was much more profound.

Implicit settings that subtly influence system operation

Superficially, this appeared to be a mere configuration error. In truth, it demonstrated the profound impact of prevalent default settings on how systems behave in live environments.

Numerous popular libraries and frameworks are pre-configured with either infinite or exceedingly lengthy timeouts. For instance, in Java, standard HTTP clients interpret a zero-value timeout as a command to “wait forever” unless a specific value is provided. Similarly, in Python, requests will block indefinitely unless a timeout duration is explicitly specified. The Fetch API, notably, completely lacks a native timeout mechanism.

These default configurations are not accidental; rather, they are designed to be universally applicable. Libraries prioritize the successful completion of individual requests because they cannot determine the acceptable “slowness” threshold for your particular system. The responsibility for ensuring resilience during partial failures thus falls upon the application developer.

Live production environments seldom experience failures in optimal circumstances. Instead, they typically falter under stress, during intermittent service disruptions, through repeated attempts, and due to authentic user interactions. Under such conditions, indefinite waiting poses a significant risk. Default settings that appear benign during the development phase inadvertently dictate critical architectural choices in a production setting.

During a subsequent team audit of our services, we discovered numerous calls either lacked any timeout configuration or were set with values that no longer aligned with actual production latencies. These default behaviors had been influencing system performance for years, without our conscious decision or oversight.

Understanding the mindset driving extended timeout durations

This incident brought to light more than just an omitted timeout setting. It unveiled a prevalent operational mindset that many teams, including our own at the time, implicitly adopted.

This approach presumes:

Dependent services are typically quick
Performance degradation is an infrequent occurrence
Pre-configured settings are generally adequate
Prolonging the wait duration enhances the likelihood of successful completion

This perspective prioritizes the triumph of single requests, frequently compromising the broader reliability of the entire system. Consequently, teams often remain unaware of their actual effective timeouts, various services employ divergent timeout values, and certain calls completely lack any timeout definition.

Violetta Pidvolotska

Even in instances where timeouts are configured, they frequently exceed what typical user interaction patterns would warrant. For our specific scenario, users would attempt a retry within a few seconds and typically give up entirely after roughly ten. Extending the wait time beyond this point offered no benefit to the outcome; it simply wasted system resources.

Excessive timeouts can also obscure more fundamental architectural flaws. Should a request routinely time out due to retrieving an abundance of items, the root cause isn’t the timeout setting itself, but rather the absence of proper pagination or inefficient request structuring. By solely optimizing for the success of individual requests, development teams inadvertently sacrifice system-wide robustness.

Timeouts as mechanisms for failure isolation

Prior to this event, we largely perceived timeouts as mere configurable parameters. Subsequent to it, we began to view them as essential boundaries for containing failures.

A timeout specifies the point at which a failure must cease. Absent timeouts, even a solitary slow dependency has the potential to silently exhaust threads, network connections, and memory throughout the entire system. Conversely, with carefully selected timeouts, sluggishness remains localized, preventing its proliferation into a comprehensive system outage.

We implemented a series of intentional modifications:

1. Implemented client-side timeout enforcement

The entity initiating the request determines when to cease waiting. Load balancers, proxies, or servers proved insufficient in consistently preventing indefinite hangs, a fact starkly revealed by the incident.

2. Established clear, comprehensive deadlines for user interactions

Subsequent downstream invocations were restricted to using only the remaining time allocated; any waiting past that limit represented unproductive effort with no prospect of yielding a better result.

Violetta Pidvolotska

We rendered these deadlines unambiguous and transferable. For HTTP communication, we conveyed a complete transaction deadline using a singular X-Request-Deadline header, enabling each service to calculate the remaining available time and configure individual call timeouts appropriately. We opted for a full transaction deadline, rather than a hop-by-hop timeout, as it integrates seamlessly across different service layers and retry mechanisms.

In the context of gRPC communications, native deadline features facilitated the propagation of remaining time across distinct service boundaries. We further extended this principle by integrating the same boundary into our internal request context, ensuring that background processes would terminate once their allocated time budget was exhausted.

3. Adopted a meticulous approach to selecting timeout values

Network connection timeouts were maintained at brief durations, directly correlating with network characteristics. Request timeouts, conversely, were determined by observed production latency data, rather than speculative assumptions.

Instead of depending on mean values, our attention centered on the 99th and 99.9th percentiles. In scenarios where the 50th percentile was near the 99th, we deliberately provided additional buffer to prevent minor performance dips from escalating into widespread timeout occurrences. This strategy enabled us to grasp the behavior of slow requests during peak loads and to establish timeouts that safeguarded system capacity without triggering superfluous failures.

For instance, if 99% of requests successfully concluded within 300 milliseconds, establishing a timeout between 350 and 400 milliseconds offered a more appropriate compromise than setting it to several tens of seconds. Any behavior beyond this threshold became a deliberate product choice. In our specific instance, when currency conversion failed due to a timeout, we implemented a fallback to display prices in the default currency. Users consistently indicated a preference for a partial or less-than-perfect response over an interminable wait.

Furthermore, we maintained a cautious approach to retry logic within user-facing workflows. A retry attempt that disregards the overall end-to-end deadline is more detrimental than simply not retrying; it needlessly duplicates effort long after the user has disengaged. This illustrates how seemingly “beneficial” retries can degenerate into overwhelming retry storms when systems experience partial slowdowns.

Collectively, we institutionalized these choices by establishing shared client default configurations and implementing a compulsory review checklist, applied to both new and existing call pathways, thereby preventing the silent reemergence of indefinite waiting.

Ensuring timeout effectiveness

Timeouts should always be transparent. Following the incident, our attention shifted to three key areas:

1. Enhancing timeout visibility

Each timeout event generated a structured log entry, detailing the associated dependency and the remaining time budget. We monitored timeout frequencies as key metrics and configured alerts for persistent upward trends, rather than isolated peaks. Escalating timeout rates transformed into proactive warnings, instead of unexpected discoveries during outages. Crucially, we revised our paging system to incorporate alerts for user-affecting latency and indications of “uncompleted requests,” moving beyond solely monitoring error rates.

2. Ceasing to treat timeout values as static parameters

As traffic volumes expand, external dependencies evolve, and system architectures mature, timeout values deemed appropriate a year prior frequently become obsolete. We consequently undertook reviews of timeout configurations whenever there were shifts in traffic patterns, the integration of new dependencies, or alterations in latency distributions.

3. Proactively verifying timeout behavior to preempt actual incidents

Injecting simulated latency into non-production environments swiftly uncovered stalled calls, exacerbated retry scenarios, and absent fallback mechanisms. This practice also compelled us to differentiate between two distinct inquiries: what fails when under heavy load, and what fails when experiencing prolonged slowness.

Conventional load testing addressed the former. The latter was elucidated through fault-injection and latency experiments, a methodology of deliberate failure sometimes referred to as chaos engineering. Through the deliberate introduction of delays and intermittent freezes, we confirmed that deadlines effectively terminated tasks, queues did not expand indefinitely, and fallback procedures functioned as designed.

Enduring insights gained

This particular incident fundamentally altered my perspective on timeouts.

A timeout represents a judgment call regarding utility. Beyond a specific threshold, extended waiting ceases to enhance the user experience. Instead, it merely amplifies the wasteful processing carried out by a system long after the user has abandoned the interaction.

Furthermore, a timeout is a strategic choice concerning containment. Lacking defined waiting periods, localized failures can escalate into complete system disruptions due to resource depletion, manifesting as blocked threads, overwhelmed resource pools, ever-expanding queues, and widespread latency.

Should there be a singular lesson to extract from this account, it is to consciously establish timeouts and align them with allocated budgets. Base these decisions on observed user behavior. Assess latency at the 99th percentile, moving beyond simple averages. Ensure timeouts are transparently observable and explicitly determine the action taken upon their activation. Furthermore, compartmentalize system capacity to prevent any single slow dependency from debilitating the entire infrastructure.

Indefinite waiting is far from benign; it carries a tangible cost to reliability. If you do not intentionally limit waiting durations, the system itself will eventually impose those limits upon you, often catastrophically.

This piece is featured within the Foundry Expert Contributor Network.
Interested in contributing?

Software DevelopmentDevopsCloud ArchitectureCloud ComputingData ArchitectureData Management

Trending →

Reinventing Your IT Backbone