Skip to content

[Bug] BE write path wedges: a BE-to-BE load-stream brpc socket (8060) goes "Broken" and is never revived (4.0.6, coupled mode) #64708

Description

@ojalberts-itc

Search before asking

  • I had searched in the issues and found no similar issues.

Version

We are on Apache Doris 4.0.6 GA (x86_64, AVX2). The running build string reads doris-4.0.6-rc02,
which is misleading -- it is the GA, not a hand-picked release candidate. The official
apache-doris-4.0.6-bin-x64.tar.gz was cut from the 4.0.6-rc02 tag and the embedded build string
was never bumped to drop the -rc02 suffix. We verified the artifact two independent ways so this
is not dismissed as an RC:

  1. Build-commit match. Our binary reports commit 1663f25c16f; the Apache Doris 4.0.6 release is
    commit 1663f25. A binary can only embed that commit if it was built from that exact tree.
  2. Cross-mirror sha512 match. apache-doris-4.0.6-bin-x64.tar.gz.sha512 is byte-identical on the
    official release host and the mirror our deploy pulls from:
    8f869c4399088d3dc34e5ade10047495e42c7c0583fb32156adaf0794a56e5942b8c0142c05fc145d58d4148daf0ee8d0dde73c9aab0224f39b2435f406c8ef8.

MySQL-wire version reported: 5.7.99. We have not yet tested 4.1.x.

What's Wrong?

On a coupled-mode 4.0.6 cluster, the BE write path wedges. INSERT / CREATE TABLE AS SELECT /
MV-refresh hang and then fail with failed to write enough replicas N/M ... due to connection errors, while every node still reports Alive=true and reads keep working. Once wedged, only a
full restart of all BE processes recovers it -- a single-BE restart does not.

We chased this for a full day with live instrumentation and found three distinct causes, not one.
Two were ours and are fixed. The third is the reason for this report: a BE-to-BE load-stream brpc
socket on port 8060 goes "Broken" and is never revived, and we cannot fix it from the 4.0.6 config
surface. We believe it is an upstream defect of the Apache brpc
#1168 class.

# Cause Trigger Status
1 Our security group was missing a BE-to-BE self-ingress on port 8040 (webserver_port, clone snapshot download), so clone REPAIR could never complete and the FE ran an unbounded repair-clone storm replica repair / single-BE restart Fixed -- our IaC bug, not Doris
2 brpc load-stream-open stall on 8060 under heavy multi-replica load heavy multi-replica INSERT Mitigated with experimental_enable_single_replica_insert=true
3 A BE-to-BE load-stream brpc socket (8060) goes "Broken" and is never revived accumulation of ~6--7 BE-to-BE stream opens Open -- suspected upstream defect, this report

Cause #1 was our mistake. We include it so it is clear the residual (cause #3) is independent of it:
after we fixed the security group and clones completed cleanly, cause #3 still reproduces.

The bug (cause #3)

On a healthy cluster, repl=3 writes succeed. After roughly 6--7 successful BE-to-BE load-stream-open
operations, a specific brpc socket on the BE-to-BE load-stream path (8060) enters a state where the
next load-stream-open RPC to the affected peer parks until the RPC timeout (~534 s) and then fails,
taking down the whole write path. The socket is never revived. Reads and the Thrift heartbeat (9050,
a separate threadpool) stay healthy the entire time, so SHOW BACKENDS shows Alive=true
throughout.

Signature (verbatim, BE be.WARNING and FE fe.log)

Coordinator-side open failure -- 60s is the tablet_writer_open_rpc_timeout_sec default:

load_stream_stub.cpp:591 open stream failed: [INTERNAL_ERROR]Failed to connect to backend <id>:
  [E1008]Reached timeout=60000ms @10.0.0.105:8060

The long park -- the RPC itself stalls ~534s before failing:

brpc_closure.h RPC meet failed: [E1008]Reached timeout=533999ms @10.0.0.105:8060
brpc_closure.h RPC meet failed: [E1008]Reached timeout=533999ms @10.0.0.118:8060
brpc_closure.h RPC meet failed: [E1008]Reached timeout=533999ms @10.0.0.155:8060

Loopback proof that this is in-process, not network or security group -- a BE times out opening a
tablet-writer to its own 8060, and cancels a load-stream whose source and destination are the same
BE:

load_id=..., txn_id=6, node=10.0.0.118:8060, open failed, err: ... RPC call is timed out,
  error_text=[E1008]Reached timeout=60000ms @10.0.0.118:8060, host: 10.0.0.118

load_stream_stub ... src_id=...499, dst_id=...499, stream_id=1740 is cancelled ...
  write enough replicas 1/3
brpc_client_cache.h:326 open brpc connection to 10.0.0.105:8060 failed:
  [E1008]Reached timeout=60000ms

User-facing FE error:

failed to open DeltaWriter <id>: failed to write enough replicas 1/3 for tablet <id>
  due to connection errors

At the original bring-up wedge the same error read ... 0/1 ....

In-process capture at a live wedge (pstack + bvar)

We captured the in-process BE state during a live wedge (2026-06-22, doris-4.0.6-rc02, commit
1663f25c16f), before the recovery restart. Full dumps are attached.

A write thread (gstack, BE A) is parked in the brpc load-stream OPEN -- the V2 path:

bthread_id_join
 -> brpc::Channel::CallMethod
 -> doris::FailureDetectChannel::CallMethod        be/src/util/brpc_client_cache.h:121
 -> doris::LoadStreamStub::open                    be/src/vec/sink/load_stream_stub.cpp:195   (txn_id=4442, total_streams=2, idle_timeout_ms=30000)
 -> doris::LoadStreamStubs::open                   be/src/vec/sink/load_stream_stub.cpp:574
 -> doris::vectorized::VTabletWriterV2::_open_streams_to_backend   be/src/vec/sink/writer/vtablet_writer_v2.cpp:317
 -> doris::vectorized::VTabletWriterV2::_open_streams              vtablet_writer_v2.cpp:296
 -> doris::vectorized::VTabletWriterV2::open                       vtablet_writer_v2.cpp:272
 -> doris::vectorized::AsyncResultWriter::process_block            be/src/vec/sink/writer/async_result_writer.cpp:119
 -> doris::vectorized::AsyncResultWriter::start_writer             async_result_writer.cpp:105
 -> doris::ThreadPool::dispatch_thread

On another BE the same root appears via the V1 writer path -- VNodeChannel::open_wait
(vtablet_writer.cpp:704) -> bthread_id_join. Both are parked on the brpc load-stream OPEN RPC to
a peer backend.

It is not worker-pool exhaustion, not compaction, not a stub leak -- brpc /vars (8060) and
/metrics (8040) at the wedge:

BE bthread_worker_usage / count load_channel_count tablet_writer_count brpc_stream_endpoint_stub_count compaction (base+cumulative)
A 0.20 / 256 2 8 4 0
B 54.6 / 256 3 9 4 0

Workers are nowhere near the 256 ceiling -- the write threads are parked on the RPC, not starved.
Load channels and tablet writers are open and stuck; the stub count is the steady-state 4 (no leak);
compaction is fully idle. rpcz was empty (off by default; :8060/rpcz/enable did not enable it at
runtime on this build), so the parked-RPC evidence is the gstack above.

load_stream_stub cancellations appear across all BEs for the same load.

Trigger: accumulation, not a timer or idle decay

  • Across two instrumented runs the wedge fired after 7 OK then wedge, and 6 OK then wedge, repl=3
    write operations. It tracks the number of BE-to-BE load-stream opens, not a wall-clock interval.
  • It fires both during a heavy multi-replica load and ~7--11 minutes after a load while the cluster
    is otherwise idle (no further writes issued).
  • A restart followed by 60 minutes of pure idle with no load did not wedge. So it is load-induced,
    not idle decay.
  • brpc_stream_endpoint_stub_count stayed at 4 across the wedge -- no stub-count leak. It is a
    specific socket going Broken, not stub exhaustion.

Recovery

A full restart of all BE processes clears it. A single-BE restart does not -- the rejoined BE's peers
still hold the broken stub, so it rejoins a wedged mesh.

What You Expected?

When a BE-to-BE load-stream brpc connection breaks, brpc should revive it (or the load-stream-open
RPC should fail fast and the channel reconnect), so the write path recovers on its own. Instead the
open RPC parks ~534s and the entire write path wedges while every node still reports Alive=true, and
only a full BE-fleet restart clears it. A single broken socket should not require dropping all BE
processes to recover.

How to Reproduce?

  1. Coupled-mode 4.0.6 cluster, 3 FE + 4 BE, default replication 3, stock be.conf.
  2. Create native UNIQUE-KEY merge-on-write tables, replication_num=3,
    DISTRIBUTED BY HASH(...) BUCKETS 16.
  3. Run a sequence of multi-replica writes that each open BE-to-BE load streams -- repeated
    INSERT ... SELECT / CREATE TABLE AS SELECT of a few million rows. In our case, four such loads
    plus a handful of UPDATE ... FROM statements per cycle.
  4. After ~6--7 such operations -- during the load, or within ~7--11 minutes after -- a write hangs and
    fails write enough replicas N/3 ... connection errors. be.WARNING shows
    [E1008]Reached timeout ... @<be>:8060, including a loopback @<self>:8060.
  5. SELECT 1 and SHOW BACKENDS (Alive=true) keep working. Only a full BE restart recovers.

We have not reduced this to a minimal standalone reproducer; it reproduces reliably under our normal
multi-replica load. We will run a targeted reproducer if you suggest one.

Anything Else?

Search / prior art

We searched the issue tracker, the load_stream / move-memtable PR history, the 4.1.x changelogs,
and community forums (English and Chinese) before filing, and found no exact match for the full
signature. The closest structural match is Apache brpc #1168 -- after a downstream node fault the
upstream socket enters a "Broken" state and the health check never revives it; recovery requires
restarting the upstream. Adjacent load-stream lifecycle fixes already in 4.0.6: #34883,
#39231 / #39762, #60148, #60285. Possibly related and unconfirmed for 4.0.x: #56120 ("close brpc
stream after load stream is closed"). If a maintainer recognises this as known or already fixed, a
pointer to the PR is the fastest resolution.

Environment

  • Mode: coupled (storage-compute together), FE + BE only. No FoundationDB / Meta Service / Recycler /
    S3 storage vault. Native tablet data lives on BE-local EBS.
  • Topology: 3 FE (HA followers) + 4 BE. Each node 8 vCPU / 64 GiB RAM.
  • BE storage: one dedicated 500 GB gp3 volume per BE, xfs (noatime,nodiratime), mounted
    /var/lib/doris/storage, gp3 baseline 3000 IOPS / 125 MiB/s.
  • OS / JDK: Amazon Linux 2023, Amazon Corretto 17.
  • Replication: Doris default 3, across 4 BEs.
  • Workload: read Apache Iceberg through a Glue / S3 external catalog, then write the aggregated
    result into native Doris UNIQUE-KEY merge-on-write tables -- CREATE TABLE AS SELECT,
    INSERT ... SELECT, and a few UPDATE ... FROM statements. About 5M rows per table, 4 tables.
  • be.conf is effectively stock. The only non-default overrides are mem_limit = 80%,
    storage_root_path, and priority_networks = <self>/32. No brpc / clone / timeout tuning was set
    initially. Full dump at the end.

What we ruled out, with positive evidence

All of the environment-layer suspects below were tested directly, while wedged (raw probes on
2026-06-22).

Hypothesis Verdict Evidence
TCP / network / security group / routing on 8060 Ruled out Raw TCP (/dev/tcp) to :8060 is OPEN to the peer and over loopback to self on all 4 BEs while the brpc RPC on the same port times out; the listener is healthy (LISTEN 0 1024 0.0.0.0:8060). A brpc call failing to its own loopback :8060 while raw TCP to that port succeeds cannot be network/SG/routing.
Host firewall (iptables / nftables / firewalld) Ruled out All 4 BEs: iptables 0 non-policy rules (default-ACCEPT), ip6tables 0, nft ruleset empty, firewalld inactive/absent. No host firewall exists.
SELinux Ruled out getenforce = Permissive on all 4 (policy targeted, mode permissive) -- it logs but cannot block.
ENA bandwidth throttle Ruled out bw_in/out_allowance_exceeded are non-zero cumulative but Δ=0 over a 50s sample during the idle wedge (they moved only during the loads); pps_allowance_exceeded=0, conntrack_allowance_exceeded=0. No active throttle while wedged.
conntrack / ephemeral ports Ruled out nf_conntrack module not loaded; ~53--60 of ~28k ephemeral ports used, 3 TIME-WAIT. Neither is exhausted.
Kernel / OOM / packet drops Ruled out dmesg / journalctl -k show no drop/deny/reject/oom/conntrack/throttle lines for the window.
Deployment / OS-tuning misconfig Ruled out Our install sets all Doris-required kernel tuning (vm.max_map_count=2000000 -- live-confirmed, swap off, nofile 655350, THP madvise) and runs start_be.sh's preflight, which the official apache/doris container deployment skips (SKIP_CHECK_ULIMIT=true). The official FE/BE images add no brpc/network/timeout config we lack -- only priority_networks. So it is not a deployment misconfiguration.
Compaction / merge-on-write delete-bitmap publish Ruled out Captured live at the wedge: every compaction metric is 0 on all 4 BEs -- doris_be_compaction_task_state_total{base,cumulative}=0, doris_be_disks_compaction_score=0, doris_be_compaction_used_permits=0, doris_be_compaction_waitting_permits=0, doris_be_load_channel_count=0, doris_be_tablet_writer_count=0.
Resource exhaustion (CPU / memory / IO) Ruled out At the wedge the BEs are near-idle: load avg ~0.0--0.09, ~55--60 GB RAM free, doris_be at 2--3% CPU. EBS volumes idle (VolumeReadOps=0, under 1 write IOPS, VolumeQueueLength ~0).
BE soft memory limit / flush back-pressure Ruled out Workload-group total_mem_used 0--158 MB against an ~53 GB limit; zero memory-exceed or MemoryGc-cancel lines. Memory would climb if flush stalled.
Crash / auto-restart / kernel OOM Ruled out NRestarts=0, single MainPID for the whole window on every BE; dmesg and journalctl -k empty for the window.
Replication factor (repl=3 itself) Ruled out A fresh-cluster full 4-table build at repl=3 completed cleanly and sustained, 0 errors. Earlier "repl=3 triggers it" readings were confounded by clusters already degraded by prior single-BE-restart experiments.
BE thread-pool exhaustion Not the cause No BE thread pool is pegged at the wedge: EvHttpServer at pool size 128, pipeline schedulers at normal 8/16, no compaction or memtable pool active.

The one mechanism consistent with all of this is a brpc load-stream socket going Broken and never
being revived: raw TCP to :8060 connects (peer and loopback) while every brpc RPC on it times out;
Doris's own health check evicts the stub (remove brpc stub from cache) and recreates it, and the
new stub still times out; the errors are connect/open timeouts (never "Connection refused" or
"reset"); and it clears only when the process is dropped. That is the Apache brpc #1168 class.

Config we tried that did not fix it

Setting Where Result
enable_brpc_connection_check = true be.conf, immutable, rolling restart No effect. This is the mechanism that should periodically check brpc connections and close/recreate broken ones (brpc_connection_check_timeout_ms = 10s default), but it did not revive the broken load-stream socket. Wedged again at +8 minutes. Kept as general hardening.
experimental_enable_single_replica_insert = true FE global var Partial and unreliable. Loads write one replica and clone the rest, so a single load completes instead of hanging, but the idle wedge still fires afterward and a later load still hung despite the setting.

We did not raise tablet_writer_open_rpc_timeout_sec or brpc_socket_max_unwritten_bytes beyond
defaults, because those mask the symptom -- a longer park -- rather than revive the socket. If you
believe a specific brpc knob is the fix, we will test it.

The two causes we fixed ourselves

We are listing these so it is clear the residual is isolated, and because one of them was our
mistake and we would rather name it than route around it.

  • Cause 能公开一些公开数据集上的性能测试数据吗? #1, our security-group bug -- fixed. Our BE security group self-referenced 8060 (brpc) and
    9060 (be_port) but not 8040 (webserver_port, the HTTP port used for clone snapshot download
    between BEs). Loads over brpc 8060 worked, but clone REPAIR over
    http://<be>:8040/api/_tablet/_download timed out
    ([HTTP_ERROR]Connection timed out after 15000 milliseconds), so missing replicas never healed
    and the FE ran an unbounded VERY_HIGH repair-clone storm that saturated the BEs. Adding the 8040
    BE-to-BE self-ingress rule fixed it: clones finish, drain to 0, replicas heal. This was our
    infrastructure error, not a Doris defect. We mention it only because, once fixed, cause Support bulk loading from S3 compatible distributed storage #3 still
    reproduces -- which proves Support bulk loading from S3 compatible distributed storage #3 is independent of it.
  • Cause 注释英文拼写错误 #2, load-stream-open stall under heavy multi-replica load -- mitigated. Distinct from the
    8040 clone path; this is on the 8060 write path. Mitigated, not cured, by
    experimental_enable_single_replica_insert.

Detection and the workaround we run today

  • Detection. The Thrift heartbeat (9050) runs on a separate threadpool from the brpc write path
    (8060), so SHOW BACKENDS ... Alive=true is not a writability signal -- it stayed green for ~2.5
    hours while every write was dead. We added a write-readiness canary, a small bounded INSERT over
    the 8060 path, to our health check, and a wedge now surfaces in seconds instead of hours.
  • Recovery. A full BE-fleet restart. A single-BE restart does not clear it.

Code-level analysis (Doris 4.0.6, bundled brpc 1.4.0)

We traced the captured stacks/logs into the 4.0.6 source. The load-bearing finding, from the
target BE's be.INFO during the wedge:

  • PInternalService::open_load_stream logs "open load stream, load_id=..." (internal_service.cpp:416)
    as the first line of the handler. During the wedge there were 0 such handler-entry lines on the
    target BEs in the wedge window, versus 1700+ historically -- while the BE worker pools sat
    idle (pstack: threads parked in blocking_get, not saturated; a saturated pool would fail
    try_offer fast, not time out at 60s).
  • So the inbound open_load_stream RPC never reaches the Doris service handler. Combined with raw
    TCP to :8060 being OPEN, the stall is between TCP-accept and service-dispatch -- inside brpc
    1.4.0
    , below Doris's load-stream code. Doris's handler is not the stall point; it is never entered.

We did not pin the exact brpc 1.4.0 line -- it is in the bundled submodule
(thirdparty/vars.sh, apache/brpc tag 1.4.0), and the runtime probe that would pin it (brpc
rpcz / socket bvars) could not be enabled at runtime on this build. One secondary, non-root nuance
we found: FailureDetectChannel invalidates a cached channel only on EHOSTDOWN, not on a timeout
(brpc_client_cache.h:80,125) -- but we captured 249 EHOSTDOWN (Host is down) and channel
rebuilds happened anyway and did not recover the wedge, so that is at most a hardening suggestion, not
the cause.

What we have captured and what else we can provide

We have captured the in-process state at a live wedge. Attached:

  • gstack thread dumps of doris_be on the two BEs with parked write threads (full ~1747-thread
    dumps),
  • brpc /vars (94 KB) and /metrics from each, showing the worker / load-channel / compaction
    state,
  • be.WARNING tails with the [E1008] open failures and the FailureDetectChannel probe failures.

We could not get rpcz -- it is off by default and :8060/rpcz/enable did not enable it at runtime
on this build. If there is a flag or build option to turn rpcz on, tell us and we will capture it. We
can also pull a full gdb -p thread apply all bt, more specific brpc bvars, or FE-side state on
request.

One caveat on timing. This is a dev POC and we are moving on with our implementation, so the cluster
will not stay up indefinitely. The reproducer, the captures above, and any candidate-build testing
are only available while the cluster is still running -- so the sooner we can act on this, the
better.

Questions for the maintainers

  1. Our evidence says the open_load_stream RPC never reaches the server handler (0 handler-entry logs,
    idle pools) while raw TCP to :8060 is open -- consistent with a brpc 1.4.0 socket/stream that is
    accepted at TCP but never dispatched, and never revived (the brpc The download link for doris-incubating-thirdparty-20190414 is broken #1168 class). Is this a known
    brpc 1.4.0 defect on the load-stream path, and is there a fixing PR or a brpc version that resolves
    it?
  2. Is the load-stream brpc_client_cache expected to revive a Broken socket automatically? In our
    capture it never did, and enable_brpc_connection_check=true did not help. Is that the intended
    recovery path, and should it have recovered the socket?
  3. Is there a supported config that makes a Broken load-stream socket fail fast and reconnect, rather
    than park ~534s on the open RPC?
  4. Is there evidence that 4.1.x (4.1.2 specifically) contains a relevant brpc / load-stream fix? We
    will run the upgrade test -- restart one BE, drive the load sequence, watch the canary -- and
    report back.

Appendix -- config

be.conf (stock defaults plus these managed overrides only):

JAVA_HOME = /usr/lib/jvm/java-17-amazon-corretto.x86_64
storage_root_path = /var/lib/doris/storage
priority_networks = <node_private_ip>/32
mem_limit = 80%
be_port = 9060            # shipped default
webserver_port = 8040     # shipped default
heartbeat_service_port = 9050   # shipped default
brpc_port = 8060          # shipped default
# added later as hardening; did NOT fix the wedge:
enable_brpc_connection_check = true

fe.conf (stock defaults plus these overrides only):

JAVA_HOME = /usr/lib/jvm/java-17-amazon-corretto.x86_64
meta_dir = /var/lib/doris/fe-meta
priority_networks = <node_private_ip>/32
http_port = 8030          # shipped default
rpc_port = 9020           # shipped default
query_port = 9030         # shipped default
edit_log_port = 9010      # shipped default
# FE global var set at runtime; mitigates cause #2, not cause #3:
experimental_enable_single_replica_insert = true

Table shape (representative):

CREATE TABLE evo_persons (
  identity_hash    varchar(32) NOT NULL,
  id_numbers_hash  varchar(32) NOT NULL,
  ...                          -- aggregated attribute and counter columns
)
UNIQUE KEY(identity_hash, id_numbers_hash)
DISTRIBUTED BY HASH(identity_hash) BUCKETS 16
PROPERTIES ('replication_num'='3', 'enable_unique_key_merge_on_write'='true');

Ports:

Port Service At the wedge
8060 brpc (tablet-writer / load-stream OPEN) timed out, all directions including loopback
8040 webserver (clone snapshot download) timed out until our security-group fix (cause #1); fine after
9050 Thrift heartbeat (separate threadpool) stayed responsive, so SHOW BACKENDS showed Alive=true

wedge.10.0.0.105.tar.gz
wedge.10.0.0.118.tar.gz
wedge.10.0.0.155.tar.gz
wedge.10.0.0.229.tar.gz

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions