Traffic Control
This article describes the detailed design of Celeborn Worker
's traffic control.
Design Goal
The design goal of Traffic Control is to prevent Worker
OOM without harming performance. At the
same time, Celeborn tries to achieve fairness without harming performance.
Celeborn reaches the goal through Back Pressure
and Congestion Control
.
Data Flow
From the Worker
's perspective, the income data flow comes from two sources:
ShuffleClient
that pushes primary data to the primaryWorker
- Primary
Worker
that sends data replication to the replicaWorker
The buffered memory can be released when the following conditions are satisfied:
- Data is flushed to file
- If replication is on, after primary data is written to wire
The basic idea is that, when Worker
is under high memory pressure, slow down or stop income data, and at same
time force flush to release memory.
Back Pressure
Back Pressure
defines three watermarks:
Pause Receive
watermark (defaults to 0.85). If used direct memory ratio exceeds this,Worker
will pause receiving data fromShuffleClient
, and force flush buffered data into file.Pause Replicate
watermark (defaults to 0.95). If used direct memory ratio exceeds this,Worker
will pause receiving both data fromShuffleClient
and replica data from primaryWorker
, and force flush buffered data into file.Resume
watermark (defaults to 0.7). When eitherPause Receive
orPause Replicate
is triggered, to resume receiving data fromShuffleClient
, the used direct memory ratio should decrease under this watermark.
Worker
high-frequently checks used direct memory ratio, and triggers Pause Receive
, Pause Replicate
and Resume
accordingly. The state machine is as follows:
Back Pressure
is the basic traffic control and can't be disabled. Users can tune the three watermarks through the
following configuration.
celeborn.worker.directMemoryRatio*
Congestion Control
Congestion Control
is an optional mechanism for traffic control, the purpose is to slow down the push rate
from ShuffleClient
when memory is under pressure, and suppress those who occupied the most resources in the
last time window. It defines two watermarks:
Low Watermark
, under which everything goes OKHigh Watermark
, when exceeds this, top users will be Congestion Controlled
Celeborn uses UserIdentifier
to identify users. Worker
collects bytes pushed from each user in the last time
window. When used direct memory exceeds High Watermark
, users who occupied more resources than the average
occupation will receive Congestion Control
message.
ShuffleClient
controls the push ratio in a fashion that is very like TCP Congestion Control
. Initially, it's in
Slow Start
phase, with a low push rate but increases very fast. When threshold is reached, it transfers to
Congestion Avoidance
phase, which slowly increases push rate. Upon receiving Congestion Control
, it goes back
to Slow Start
phase.
Congestion Control
can be enabled and tuned by the following configurations:
celeborn.worker.congestionControl.*