/* ---- Google Analytics Code Below */

Monday, January 09, 2023

We did some Critical Path Work in the Enterprise

 We did some Critical Path Work in the Enterprise, so this was of Interest,  but we did not include many  latency aspects.  Technical,  but worth a look. 

 Distributed Latency Profiling through Critical Path Tracing   By Brian Eaton, Jeff Stewart, Jon Tedesco, N. Cihan Tas  in CACM

Communications of the ACM, January 2023, Vol. 66 No. 1, Pages 44-51  10.1145/3570522

For complex distributed systems that include services that constantly evolve in functionality and data, keeping overall latency to a minimum is a challenging task. Critical path tracing (CPT) is a new applied mechanism for gathering critical path latency profiles in large-scale distributed applications. It is currently enabled in hundreds of different Google services, which provides valuable day-to-day data for latency analysis.

Fast turnaround time is an essential feature for any online service. In determining the root causes of high latency in a distributed system, the goal is to answer a key optimization question: Given a distributed system and workload, which subcomponents can be optimized to reduce latency?

Low latency is an important feature for many Google applications, such as Search,4 and latency-analysis tools play a critical role in sustaining low latency at scale. The systems evolve constantly because of code and deployment changes, as well as shifting traffic patterns. Parallel execution is essential, both across service boundaries and within individual services. Different slices of traffic have different latency characteristics.

CPT provides detailed and actionable information about which subcomponents of a distributed system are contributing to overall latency. This article presents results and experiences as observed in using CPT in a particular application: Google Search.

Critical path describes the ordered list of steps that directly contribute to the slowest path of request processing through a distributed system so optimizing these steps reduces overall latency. Individual services have many subcomponents, and CPT relies on software frameworks17 to identify which subcomponents are on the critical path. When one service calls another, RPC (remote procedure call) metadata propagate critical path information from the callee back to the caller. The caller then merges critical paths from its dependencies into a unified critical path for the entire request.

The unified critical path is logged with other request metadata. Log analysis is used to select requests of interest, and then critical paths from those requests are aggregated to create critical path profiles. The tracing process is efficient, allowing large numbers of requests to be sampled. The resulting profiles give detailed and precise information about the root causes of latency in distributed systems.

An example system. Consider the distributed system in Figure 1, which consists of three services, each with two subcomponents. The purpose of the system is to receive a request from the user, perform some processing, and return a response. Arrows show the direction of requests where responses are sent back in the opposite direction.

Figure 1. A simple distributed system.

The system divides work across many subcomponents. Requests arrive at Service A, which hands request processing off to subcomponent A1. A1 in turn relies on subcomponents B1, B2, and A2, which have their own dependencies. Some subcomponents can be invoked multiple times during request processing (for example, A2, B2, and C2 all call into C1).

Even though the system architecture is apparent from the figure, the actual latency characteristics of the system are hard to predict. For example, is A1 able to invoke A2 and B1 in parallel, or is there a data dependency so the call to B1 must complete before the call to A2 can proceed? How much internal processing does A2 perform before calling into B2? What about after receiving the response from B2? Are any of these requests repeated? What is the latency distribution of each processing step? And how do the answers to all of these questions change depending on the incoming request?

Without good answers to these questions, efforts to improve overall system latency will be poorly targeted and might go to waste. For example, in Figure 1, to reduce the overall latency of A1 and its downstream subcomponents, you must know which of these subcomponents actually impact the end-to-end system latency. Before deciding to optimize, you need to know whether A1 → A2 actually matters. ... ' 


No comments: