You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I had searched in the issues and found no similar issues.
Version
We are on Apache Doris 4.0.6 GA (x86_64, AVX2). The running build string reads doris-4.0.6-rc02,
which is misleading -- it is the GA, not a hand-picked release candidate. The official apache-doris-4.0.6-bin-x64.tar.gz was cut from the 4.0.6-rc02 tag and the embedded build string
was never bumped to drop the -rc02 suffix. We verified the artifact two independent ways so this
is not dismissed as an RC:
Build-commit match. Our binary reports commit 1663f25c16f; the Apache Doris 4.0.6 release is
commit 1663f25. A binary can only embed that commit if it was built from that exact tree.
Cross-mirror sha512 match. apache-doris-4.0.6-bin-x64.tar.gz.sha512 is byte-identical on the
official release host and the mirror our deploy pulls from: 8f869c4399088d3dc34e5ade10047495e42c7c0583fb32156adaf0794a56e5942b8c0142c05fc145d58d4148daf0ee8d0dde73c9aab0224f39b2435f406c8ef8.
MySQL-wire version reported: 5.7.99. We have not yet tested 4.1.x.
What's Wrong?
On a coupled-mode 4.0.6 cluster, the BE write path wedges. INSERT / CREATE TABLE AS SELECT /
MV-refresh hang and then fail with failed to write enough replicas N/M ... due to connection errors, while every node still reports Alive=true and reads keep working. Once wedged, only a
full restart of all BE processes recovers it -- a single-BE restart does not.
We chased this for a full day with live instrumentation and found three distinct causes, not one.
Two were ours and are fixed. The third is the reason for this report: a BE-to-BE load-stream brpc
socket on port 8060 goes "Broken" and is never revived, and we cannot fix it from the 4.0.6 config
surface. We believe it is an upstream defect of the Apache brpc #1168 class.
#
Cause
Trigger
Status
1
Our security group was missing a BE-to-BE self-ingress on port 8040 (webserver_port, clone snapshot download), so clone REPAIR could never complete and the FE ran an unbounded repair-clone storm
replica repair / single-BE restart
Fixed -- our IaC bug, not Doris
2
brpc load-stream-open stall on 8060 under heavy multi-replica load
heavy multi-replica INSERT
Mitigated with experimental_enable_single_replica_insert=true
3
A BE-to-BE load-stream brpc socket (8060) goes "Broken" and is never revived
accumulation of ~6--7 BE-to-BE stream opens
Open -- suspected upstream defect, this report
Cause #1 was our mistake. We include it so it is clear the residual (cause #3) is independent of it:
after we fixed the security group and clones completed cleanly, cause #3 still reproduces.
On a healthy cluster, repl=3 writes succeed. After roughly 6--7 successful BE-to-BE load-stream-open
operations, a specific brpc socket on the BE-to-BE load-stream path (8060) enters a state where the
next load-stream-open RPC to the affected peer parks until the RPC timeout (~534 s) and then fails,
taking down the whole write path. The socket is never revived. Reads and the Thrift heartbeat (9050,
a separate threadpool) stay healthy the entire time, so SHOW BACKENDS shows Alive=true
throughout.
Signature (verbatim, BE be.WARNING and FE fe.log)
Coordinator-side open failure -- 60s is the tablet_writer_open_rpc_timeout_sec default:
load_stream_stub.cpp:591 open stream failed: [INTERNAL_ERROR]Failed to connect to backend <id>:
[E1008]Reached timeout=60000ms @10.0.0.105:8060
The long park -- the RPC itself stalls ~534s before failing:
Loopback proof that this is in-process, not network or security group -- a BE times out opening a
tablet-writer to its own 8060, and cancels a load-stream whose source and destination are the same
BE:
load_id=..., txn_id=6, node=10.0.0.118:8060, open failed, err: ... RPC call is timed out,
error_text=[E1008]Reached timeout=60000ms @10.0.0.118:8060, host: 10.0.0.118
load_stream_stub ... src_id=...499, dst_id=...499, stream_id=1740 is cancelled ...
write enough replicas 1/3
brpc_client_cache.h:326 open brpc connection to 10.0.0.105:8060 failed:
[E1008]Reached timeout=60000ms
User-facing FE error:
failed to open DeltaWriter <id>: failed to write enough replicas 1/3 for tablet <id>
due to connection errors
At the original bring-up wedge the same error read ... 0/1 ....
In-process capture at a live wedge (pstack + bvar)
We captured the in-process BE state during a live wedge (2026-06-22, doris-4.0.6-rc02, commit 1663f25c16f), before the recovery restart. Full dumps are attached.
A write thread (gstack, BE A) is parked in the brpc load-stream OPEN -- the V2 path:
On another BE the same root appears via the V1 writer path -- VNodeChannel::open_wait
(vtablet_writer.cpp:704) -> bthread_id_join. Both are parked on the brpc load-stream OPEN RPC to
a peer backend.
It is not worker-pool exhaustion, not compaction, not a stub leak -- brpc /vars (8060) and /metrics (8040) at the wedge:
BE
bthread_worker_usage / count
load_channel_count
tablet_writer_count
brpc_stream_endpoint_stub_count
compaction (base+cumulative)
A
0.20 / 256
2
8
4
0
B
54.6 / 256
3
9
4
0
Workers are nowhere near the 256 ceiling -- the write threads are parked on the RPC, not starved.
Load channels and tablet writers are open and stuck; the stub count is the steady-state 4 (no leak);
compaction is fully idle. rpcz was empty (off by default; :8060/rpcz/enable did not enable it at
runtime on this build), so the parked-RPC evidence is the gstack above.
load_stream_stub cancellations appear across all BEs for the same load.
Trigger: accumulation, not a timer or idle decay
Across two instrumented runs the wedge fired after 7 OK then wedge, and 6 OK then wedge, repl=3
write operations. It tracks the number of BE-to-BE load-stream opens, not a wall-clock interval.
It fires both during a heavy multi-replica load and ~7--11 minutes after a load while the cluster
is otherwise idle (no further writes issued).
A restart followed by 60 minutes of pure idle with no load did not wedge. So it is load-induced,
not idle decay.
brpc_stream_endpoint_stub_count stayed at 4 across the wedge -- no stub-count leak. It is a
specific socket going Broken, not stub exhaustion.
Recovery
A full restart of all BE processes clears it. A single-BE restart does not -- the rejoined BE's peers
still hold the broken stub, so it rejoins a wedged mesh.
What You Expected?
When a BE-to-BE load-stream brpc connection breaks, brpc should revive it (or the load-stream-open
RPC should fail fast and the channel reconnect), so the write path recovers on its own. Instead the
open RPC parks ~534s and the entire write path wedges while every node still reports Alive=true, and
only a full BE-fleet restart clears it. A single broken socket should not require dropping all BE
processes to recover.
How to Reproduce?
Coupled-mode 4.0.6 cluster, 3 FE + 4 BE, default replication 3, stock be.conf.
Run a sequence of multi-replica writes that each open BE-to-BE load streams -- repeated INSERT ... SELECT / CREATE TABLE AS SELECT of a few million rows. In our case, four such loads
plus a handful of UPDATE ... FROM statements per cycle.
After ~6--7 such operations -- during the load, or within ~7--11 minutes after -- a write hangs and
fails write enough replicas N/3 ... connection errors. be.WARNING shows [E1008]Reached timeout ... @<be>:8060, including a loopback @<self>:8060.
SELECT 1 and SHOW BACKENDS (Alive=true) keep working. Only a full BE restart recovers.
We have not reduced this to a minimal standalone reproducer; it reproduces reliably under our normal
multi-replica load. We will run a targeted reproducer if you suggest one.
Anything Else?
Search / prior art
We searched the issue tracker, the load_stream / move-memtable PR history, the 4.1.x changelogs,
and community forums (English and Chinese) before filing, and found no exact match for the full
signature. The closest structural match is Apache brpc #1168 -- after a downstream node fault the
upstream socket enters a "Broken" state and the health check never revives it; recovery requires
restarting the upstream. Adjacent load-stream lifecycle fixes already in 4.0.6: #34883, #39231 / #39762, #60148, #60285. Possibly related and unconfirmed for 4.0.x: #56120 ("close brpc
stream after load stream is closed"). If a maintainer recognises this as known or already fixed, a
pointer to the PR is the fastest resolution.
Environment
Mode: coupled (storage-compute together), FE + BE only. No FoundationDB / Meta Service / Recycler /
S3 storage vault. Native tablet data lives on BE-local EBS.
Topology: 3 FE (HA followers) + 4 BE. Each node 8 vCPU / 64 GiB RAM.
BE storage: one dedicated 500 GB gp3 volume per BE, xfs (noatime,nodiratime), mounted /var/lib/doris/storage, gp3 baseline 3000 IOPS / 125 MiB/s.
OS / JDK: Amazon Linux 2023, Amazon Corretto 17.
Replication: Doris default 3, across 4 BEs.
Workload: read Apache Iceberg through a Glue / S3 external catalog, then write the aggregated
result into native Doris UNIQUE-KEY merge-on-write tables -- CREATE TABLE AS SELECT, INSERT ... SELECT, and a few UPDATE ... FROM statements. About 5M rows per table, 4 tables.
be.conf is effectively stock. The only non-default overrides are mem_limit = 80%, storage_root_path, and priority_networks = <self>/32. No brpc / clone / timeout tuning was set
initially. Full dump at the end.
What we ruled out, with positive evidence
All of the environment-layer suspects below were tested directly, while wedged (raw probes on
2026-06-22).
Hypothesis
Verdict
Evidence
TCP / network / security group / routing on 8060
Ruled out
Raw TCP (/dev/tcp) to :8060 is OPEN to the peer and over loopback to self on all 4 BEs while the brpc RPC on the same port times out; the listener is healthy (LISTEN 0 1024 0.0.0.0:8060). A brpc call failing to its own loopback :8060 while raw TCP to that port succeeds cannot be network/SG/routing.
Host firewall (iptables / nftables / firewalld)
Ruled out
All 4 BEs: iptables 0 non-policy rules (default-ACCEPT), ip6tables 0, nft ruleset empty, firewalld inactive/absent. No host firewall exists.
SELinux
Ruled out
getenforce = Permissive on all 4 (policy targeted, mode permissive) -- it logs but cannot block.
ENA bandwidth throttle
Ruled out
bw_in/out_allowance_exceeded are non-zero cumulative but Δ=0 over a 50s sample during the idle wedge (they moved only during the loads); pps_allowance_exceeded=0, conntrack_allowance_exceeded=0. No active throttle while wedged.
conntrack / ephemeral ports
Ruled out
nf_conntrack module not loaded; ~53--60 of ~28k ephemeral ports used, 3 TIME-WAIT. Neither is exhausted.
Kernel / OOM / packet drops
Ruled out
dmesg / journalctl -k show no drop/deny/reject/oom/conntrack/throttle lines for the window.
Deployment / OS-tuning misconfig
Ruled out
Our install sets all Doris-required kernel tuning (vm.max_map_count=2000000 -- live-confirmed, swap off, nofile 655350, THP madvise) and runs start_be.sh's preflight, which the official apache/doris container deployment skips (SKIP_CHECK_ULIMIT=true). The official FE/BE images add no brpc/network/timeout config we lack -- only priority_networks. So it is not a deployment misconfiguration.
Compaction / merge-on-write delete-bitmap publish
Ruled out
Captured live at the wedge: every compaction metric is 0 on all 4 BEs -- doris_be_compaction_task_state_total{base,cumulative}=0, doris_be_disks_compaction_score=0, doris_be_compaction_used_permits=0, doris_be_compaction_waitting_permits=0, doris_be_load_channel_count=0, doris_be_tablet_writer_count=0.
Resource exhaustion (CPU / memory / IO)
Ruled out
At the wedge the BEs are near-idle: load avg ~0.0--0.09, ~55--60 GB RAM free, doris_be at 2--3% CPU. EBS volumes idle (VolumeReadOps=0, under 1 write IOPS, VolumeQueueLength ~0).
BE soft memory limit / flush back-pressure
Ruled out
Workload-group total_mem_used 0--158 MB against an ~53 GB limit; zero memory-exceed or MemoryGc-cancel lines. Memory would climb if flush stalled.
Crash / auto-restart / kernel OOM
Ruled out
NRestarts=0, single MainPID for the whole window on every BE; dmesg and journalctl -k empty for the window.
Replication factor (repl=3 itself)
Ruled out
A fresh-cluster full 4-table build at repl=3 completed cleanly and sustained, 0 errors. Earlier "repl=3 triggers it" readings were confounded by clusters already degraded by prior single-BE-restart experiments.
BE thread-pool exhaustion
Not the cause
No BE thread pool is pegged at the wedge: EvHttpServer at pool size 128, pipeline schedulers at normal 8/16, no compaction or memtable pool active.
The one mechanism consistent with all of this is a brpc load-stream socket going Broken and never
being revived: raw TCP to :8060 connects (peer and loopback) while every brpc RPC on it times out;
Doris's own health check evicts the stub (remove brpc stub from cache) and recreates it, and the
new stub still times out; the errors are connect/open timeouts (never "Connection refused" or
"reset"); and it clears only when the process is dropped. That is the Apache brpc #1168 class.
Config we tried that did not fix it
Setting
Where
Result
enable_brpc_connection_check = true
be.conf, immutable, rolling restart
No effect. This is the mechanism that should periodically check brpc connections and close/recreate broken ones (brpc_connection_check_timeout_ms = 10s default), but it did not revive the broken load-stream socket. Wedged again at +8 minutes. Kept as general hardening.
experimental_enable_single_replica_insert = true
FE global var
Partial and unreliable. Loads write one replica and clone the rest, so a single load completes instead of hanging, but the idle wedge still fires afterward and a later load still hung despite the setting.
We did not raise tablet_writer_open_rpc_timeout_sec or brpc_socket_max_unwritten_bytes beyond
defaults, because those mask the symptom -- a longer park -- rather than revive the socket. If you
believe a specific brpc knob is the fix, we will test it.
The two causes we fixed ourselves
We are listing these so it is clear the residual is isolated, and because one of them was our
mistake and we would rather name it than route around it.
Cause 能公开一些公开数据集上的性能测试数据吗? #1, our security-group bug -- fixed. Our BE security group self-referenced 8060 (brpc) and
9060 (be_port) but not 8040 (webserver_port, the HTTP port used for clone snapshot download
between BEs). Loads over brpc 8060 worked, but clone REPAIR over http://<be>:8040/api/_tablet/_download timed out
([HTTP_ERROR]Connection timed out after 15000 milliseconds), so missing replicas never healed
and the FE ran an unbounded VERY_HIGH repair-clone storm that saturated the BEs. Adding the 8040
BE-to-BE self-ingress rule fixed it: clones finish, drain to 0, replicas heal. This was our
infrastructure error, not a Doris defect. We mention it only because, once fixed, cause Support bulk loading from S3 compatible distributed storage #3 still
reproduces -- which proves Support bulk loading from S3 compatible distributed storage #3 is independent of it.
Cause 注释英文拼写错误 #2, load-stream-open stall under heavy multi-replica load -- mitigated. Distinct from the
8040 clone path; this is on the 8060 write path. Mitigated, not cured, by experimental_enable_single_replica_insert.
Detection and the workaround we run today
Detection. The Thrift heartbeat (9050) runs on a separate threadpool from the brpc write path
(8060), so SHOW BACKENDS ... Alive=true is not a writability signal -- it stayed green for ~2.5
hours while every write was dead. We added a write-readiness canary, a small bounded INSERT over
the 8060 path, to our health check, and a wedge now surfaces in seconds instead of hours.
Recovery. A full BE-fleet restart. A single-BE restart does not clear it.
We traced the captured stacks/logs into the 4.0.6 source. The load-bearing finding, from the target BE's be.INFO during the wedge:
PInternalService::open_load_stream logs "open load stream, load_id=..." (internal_service.cpp:416)
as the first line of the handler. During the wedge there were 0 such handler-entry lines on the
target BEs in the wedge window, versus 1700+ historically -- while the BE worker pools sat idle (pstack: threads parked in blocking_get, not saturated; a saturated pool would fail try_offer fast, not time out at 60s).
So the inbound open_load_stream RPC never reaches the Doris service handler. Combined with raw
TCP to :8060 being OPEN, the stall is between TCP-accept and service-dispatch -- inside brpc
1.4.0, below Doris's load-stream code. Doris's handler is not the stall point; it is never entered.
We did not pin the exact brpc 1.4.0 line -- it is in the bundled submodule
(thirdparty/vars.sh, apache/brpc tag 1.4.0), and the runtime probe that would pin it (brpc rpcz / socket bvars) could not be enabled at runtime on this build. One secondary, non-root nuance
we found: FailureDetectChannel invalidates a cached channel only on EHOSTDOWN, not on a timeout
(brpc_client_cache.h:80,125) -- but we captured 249 EHOSTDOWN (Host is down) and channel
rebuilds happened anyway and did not recover the wedge, so that is at most a hardening suggestion, not
the cause.
What we have captured and what else we can provide
We have captured the in-process state at a live wedge. Attached:
gstack thread dumps of doris_be on the two BEs with parked write threads (full ~1747-thread
dumps),
brpc /vars (94 KB) and /metrics from each, showing the worker / load-channel / compaction
state,
be.WARNING tails with the [E1008] open failures and the FailureDetectChannel probe failures.
We could not get rpcz -- it is off by default and :8060/rpcz/enable did not enable it at runtime
on this build. If there is a flag or build option to turn rpcz on, tell us and we will capture it. We
can also pull a full gdb -pthread apply all bt, more specific brpc bvars, or FE-side state on
request.
One caveat on timing. This is a dev POC and we are moving on with our implementation, so the cluster
will not stay up indefinitely. The reproducer, the captures above, and any candidate-build testing
are only available while the cluster is still running -- so the sooner we can act on this, the
better.
Questions for the maintainers
Our evidence says the open_load_stream RPC never reaches the server handler (0 handler-entry logs,
idle pools) while raw TCP to :8060 is open -- consistent with a brpc 1.4.0 socket/stream that is
accepted at TCP but never dispatched, and never revived (the brpc The download link for doris-incubating-thirdparty-20190414 is broken #1168 class). Is this a known
brpc 1.4.0 defect on the load-stream path, and is there a fixing PR or a brpc version that resolves
it?
Is the load-stream brpc_client_cache expected to revive a Broken socket automatically? In our
capture it never did, and enable_brpc_connection_check=true did not help. Is that the intended
recovery path, and should it have recovered the socket?
Is there a supported config that makes a Broken load-stream socket fail fast and reconnect, rather
than park ~534s on the open RPC?
Is there evidence that 4.1.x (4.1.2 specifically) contains a relevant brpc / load-stream fix? We
will run the upgrade test -- restart one BE, drive the load sequence, watch the canary -- and
report back.
Appendix -- config
be.conf (stock defaults plus these managed overrides only):
JAVA_HOME = /usr/lib/jvm/java-17-amazon-corretto.x86_64
storage_root_path = /var/lib/doris/storage
priority_networks = <node_private_ip>/32
mem_limit = 80%
be_port = 9060 # shipped default
webserver_port = 8040 # shipped default
heartbeat_service_port = 9050 # shipped default
brpc_port = 8060 # shipped default
# added later as hardening; did NOT fix the wedge:
enable_brpc_connection_check = true
fe.conf (stock defaults plus these overrides only):
JAVA_HOME = /usr/lib/jvm/java-17-amazon-corretto.x86_64
meta_dir = /var/lib/doris/fe-meta
priority_networks = <node_private_ip>/32
http_port = 8030 # shipped default
rpc_port = 9020 # shipped default
query_port = 9030 # shipped default
edit_log_port = 9010 # shipped default
# FE global var set at runtime; mitigates cause #2, not cause #3:
experimental_enable_single_replica_insert = true
Table shape (representative):
CREATETABLEevo_persons (
identity_hash varchar(32) NOT NULL,
id_numbers_hash varchar(32) NOT NULL,
... -- aggregated attribute and counter columns
)
UNIQUE KEY(identity_hash, id_numbers_hash)
DISTRIBUTED BY HASH(identity_hash) BUCKETS 16
PROPERTIES ('replication_num'='3', 'enable_unique_key_merge_on_write'='true');
Ports:
Port
Service
At the wedge
8060
brpc (tablet-writer / load-stream OPEN)
timed out, all directions including loopback
8040
webserver (clone snapshot download)
timed out until our security-group fix (cause #1); fine after
9050
Thrift heartbeat (separate threadpool)
stayed responsive, so SHOW BACKENDS showed Alive=true
Search before asking
Version
We are on Apache Doris 4.0.6 GA (x86_64, AVX2). The running build string reads
doris-4.0.6-rc02,which is misleading -- it is the GA, not a hand-picked release candidate. The official
apache-doris-4.0.6-bin-x64.tar.gzwas cut from the4.0.6-rc02tag and the embedded build stringwas never bumped to drop the
-rc02suffix. We verified the artifact two independent ways so thisis not dismissed as an RC:
1663f25c16f; the Apache Doris 4.0.6 release iscommit
1663f25. A binary can only embed that commit if it was built from that exact tree.apache-doris-4.0.6-bin-x64.tar.gz.sha512is byte-identical on theofficial release host and the mirror our deploy pulls from:
8f869c4399088d3dc34e5ade10047495e42c7c0583fb32156adaf0794a56e5942b8c0142c05fc145d58d4148daf0ee8d0dde73c9aab0224f39b2435f406c8ef8.MySQL-wire version reported:
5.7.99. We have not yet tested 4.1.x.What's Wrong?
On a coupled-mode 4.0.6 cluster, the BE write path wedges.
INSERT/CREATE TABLE AS SELECT/MV-refresh hang and then fail with
failed to write enough replicas N/M ... due to connection errors, while every node still reportsAlive=trueand reads keep working. Once wedged, only afull restart of all BE processes recovers it -- a single-BE restart does not.
We chased this for a full day with live instrumentation and found three distinct causes, not one.
Two were ours and are fixed. The third is the reason for this report: a BE-to-BE load-stream brpc
socket on port 8060 goes "Broken" and is never revived, and we cannot fix it from the 4.0.6 config
surface. We believe it is an upstream defect of the Apache brpc
#1168 class.
webserver_port, clone snapshot download), so clone REPAIR could never complete and the FE ran an unbounded repair-clone stormINSERTexperimental_enable_single_replica_insert=trueCause #1 was our mistake. We include it so it is clear the residual (cause #3) is independent of it:
after we fixed the security group and clones completed cleanly, cause #3 still reproduces.
The bug (cause #3)
On a healthy cluster, repl=3 writes succeed. After roughly 6--7 successful BE-to-BE load-stream-open
operations, a specific brpc socket on the BE-to-BE load-stream path (8060) enters a state where the
next load-stream-open RPC to the affected peer parks until the RPC timeout (~534 s) and then fails,
taking down the whole write path. The socket is never revived. Reads and the Thrift heartbeat (9050,
a separate threadpool) stay healthy the entire time, so
SHOW BACKENDSshowsAlive=truethroughout.
Signature (verbatim, BE
be.WARNINGand FEfe.log)Coordinator-side open failure -- 60s is the
tablet_writer_open_rpc_timeout_secdefault:The long park -- the RPC itself stalls ~534s before failing:
Loopback proof that this is in-process, not network or security group -- a BE times out opening a
tablet-writer to its own 8060, and cancels a load-stream whose source and destination are the same
BE:
User-facing FE error:
At the original bring-up wedge the same error read
... 0/1 ....In-process capture at a live wedge (pstack + bvar)
We captured the in-process BE state during a live wedge (2026-06-22,
doris-4.0.6-rc02, commit1663f25c16f), before the recovery restart. Full dumps are attached.A write thread (
gstack, BE A) is parked in the brpc load-stream OPEN -- the V2 path:On another BE the same root appears via the V1 writer path --
VNodeChannel::open_wait(
vtablet_writer.cpp:704) ->bthread_id_join. Both are parked on the brpc load-stream OPEN RPC toa peer backend.
It is not worker-pool exhaustion, not compaction, not a stub leak -- brpc
/vars(8060) and/metrics(8040) at the wedge:bthread_worker_usage/ countload_channel_counttablet_writer_countbrpc_stream_endpoint_stub_countWorkers are nowhere near the 256 ceiling -- the write threads are parked on the RPC, not starved.
Load channels and tablet writers are open and stuck; the stub count is the steady-state 4 (no leak);
compaction is fully idle.
rpczwas empty (off by default;:8060/rpcz/enabledid not enable it atruntime on this build), so the parked-RPC evidence is the
gstackabove.load_stream_stubcancellations appear across all BEs for the same load.Trigger: accumulation, not a timer or idle decay
write operations. It tracks the number of BE-to-BE load-stream opens, not a wall-clock interval.
is otherwise idle (no further writes issued).
not idle decay.
brpc_stream_endpoint_stub_countstayed at 4 across the wedge -- no stub-count leak. It is aspecific socket going Broken, not stub exhaustion.
Recovery
A full restart of all BE processes clears it. A single-BE restart does not -- the rejoined BE's peers
still hold the broken stub, so it rejoins a wedged mesh.
What You Expected?
When a BE-to-BE load-stream brpc connection breaks, brpc should revive it (or the load-stream-open
RPC should fail fast and the channel reconnect), so the write path recovers on its own. Instead the
open RPC parks ~534s and the entire write path wedges while every node still reports
Alive=true, andonly a full BE-fleet restart clears it. A single broken socket should not require dropping all BE
processes to recover.
How to Reproduce?
replication_num=3,DISTRIBUTED BY HASH(...) BUCKETS 16.INSERT ... SELECT/CREATE TABLE AS SELECTof a few million rows. In our case, four such loadsplus a handful of
UPDATE ... FROMstatements per cycle.fails
write enough replicas N/3 ... connection errors.be.WARNINGshows[E1008]Reached timeout ... @<be>:8060, including a loopback@<self>:8060.SELECT 1andSHOW BACKENDS(Alive=true) keep working. Only a full BE restart recovers.We have not reduced this to a minimal standalone reproducer; it reproduces reliably under our normal
multi-replica load. We will run a targeted reproducer if you suggest one.
Anything Else?
Search / prior art
We searched the issue tracker, the
load_stream/ move-memtable PR history, the 4.1.x changelogs,and community forums (English and Chinese) before filing, and found no exact match for the full
signature. The closest structural match is Apache brpc #1168 -- after a downstream node fault the
upstream socket enters a "Broken" state and the health check never revives it; recovery requires
restarting the upstream. Adjacent load-stream lifecycle fixes already in 4.0.6: #34883,
#39231 / #39762, #60148, #60285. Possibly related and unconfirmed for 4.0.x: #56120 ("close brpc
stream after load stream is closed"). If a maintainer recognises this as known or already fixed, a
pointer to the PR is the fastest resolution.
Environment
S3 storage vault. Native tablet data lives on BE-local EBS.
noatime,nodiratime), mounted/var/lib/doris/storage, gp3 baseline 3000 IOPS / 125 MiB/s.result into native Doris UNIQUE-KEY merge-on-write tables --
CREATE TABLE AS SELECT,INSERT ... SELECT, and a fewUPDATE ... FROMstatements. About 5M rows per table, 4 tables.mem_limit = 80%,storage_root_path, andpriority_networks = <self>/32. No brpc / clone / timeout tuning was setinitially. Full dump at the end.
What we ruled out, with positive evidence
All of the environment-layer suspects below were tested directly, while wedged (raw probes on
2026-06-22).
/dev/tcp) to:8060is OPEN to the peer and over loopback to self on all 4 BEs while the brpc RPC on the same port times out; the listener is healthy (LISTEN 0 1024 0.0.0.0:8060). A brpc call failing to its own loopback:8060while raw TCP to that port succeeds cannot be network/SG/routing.iptables0 non-policy rules (default-ACCEPT),ip6tables0,nftruleset empty,firewalldinactive/absent. No host firewall exists.getenforce= Permissive on all 4 (policytargeted, mode permissive) -- it logs but cannot block.bw_in/out_allowance_exceededare non-zero cumulative but Δ=0 over a 50s sample during the idle wedge (they moved only during the loads);pps_allowance_exceeded=0,conntrack_allowance_exceeded=0. No active throttle while wedged.nf_conntrackmodule not loaded; ~53--60 of ~28k ephemeral ports used, 3 TIME-WAIT. Neither is exhausted.dmesg/journalctl -kshow no drop/deny/reject/oom/conntrack/throttle lines for the window.vm.max_map_count=2000000-- live-confirmed, swap off,nofile655350, THP madvise) and runsstart_be.sh's preflight, which the officialapache/doriscontainer deployment skips (SKIP_CHECK_ULIMIT=true). The official FE/BE images add no brpc/network/timeout config we lack -- onlypriority_networks. So it is not a deployment misconfiguration.doris_be_compaction_task_state_total{base,cumulative}=0,doris_be_disks_compaction_score=0,doris_be_compaction_used_permits=0,doris_be_compaction_waitting_permits=0,doris_be_load_channel_count=0,doris_be_tablet_writer_count=0.doris_beat 2--3% CPU. EBS volumes idle (VolumeReadOps=0, under 1 write IOPS,VolumeQueueLength~0).total_mem_used0--158 MB against an ~53 GB limit; zero memory-exceed or MemoryGc-cancel lines. Memory would climb if flush stalled.NRestarts=0, single MainPID for the whole window on every BE;dmesgandjournalctl -kempty for the window.The one mechanism consistent with all of this is a brpc load-stream socket going Broken and never
being revived: raw TCP to
:8060connects (peer and loopback) while every brpc RPC on it times out;Doris's own health check evicts the stub (
remove brpc stub from cache) and recreates it, and thenew stub still times out; the errors are connect/open timeouts (never "Connection refused" or
"reset"); and it clears only when the process is dropped. That is the Apache brpc #1168 class.
Config we tried that did not fix it
enable_brpc_connection_check = truebrpc_connection_check_timeout_ms= 10s default), but it did not revive the broken load-stream socket. Wedged again at +8 minutes. Kept as general hardening.experimental_enable_single_replica_insert = trueWe did not raise
tablet_writer_open_rpc_timeout_secorbrpc_socket_max_unwritten_bytesbeyonddefaults, because those mask the symptom -- a longer park -- rather than revive the socket. If you
believe a specific brpc knob is the fix, we will test it.
The two causes we fixed ourselves
We are listing these so it is clear the residual is isolated, and because one of them was our
mistake and we would rather name it than route around it.
9060 (be_port) but not 8040 (
webserver_port, the HTTP port used for clone snapshot downloadbetween BEs). Loads over brpc 8060 worked, but clone REPAIR over
http://<be>:8040/api/_tablet/_downloadtimed out(
[HTTP_ERROR]Connection timed out after 15000 milliseconds), so missing replicas never healedand the FE ran an unbounded VERY_HIGH repair-clone storm that saturated the BEs. Adding the 8040
BE-to-BE self-ingress rule fixed it: clones finish, drain to 0, replicas heal. This was our
infrastructure error, not a Doris defect. We mention it only because, once fixed, cause Support bulk loading from S3 compatible distributed storage #3 still
reproduces -- which proves Support bulk loading from S3 compatible distributed storage #3 is independent of it.
8040 clone path; this is on the 8060 write path. Mitigated, not cured, by
experimental_enable_single_replica_insert.Detection and the workaround we run today
(8060), so
SHOW BACKENDS ... Alive=trueis not a writability signal -- it stayed green for ~2.5hours while every write was dead. We added a write-readiness canary, a small bounded
INSERToverthe 8060 path, to our health check, and a wedge now surfaces in seconds instead of hours.
Code-level analysis (Doris 4.0.6, bundled brpc 1.4.0)
We traced the captured stacks/logs into the 4.0.6 source. The load-bearing finding, from the
target BE's
be.INFOduring the wedge:PInternalService::open_load_streamlogs"open load stream, load_id=..."(internal_service.cpp:416)as the first line of the handler. During the wedge there were 0 such handler-entry lines on the
target BEs in the wedge window, versus 1700+ historically -- while the BE worker pools sat
idle (pstack: threads parked in
blocking_get, not saturated; a saturated pool would failtry_offerfast, not time out at 60s).open_load_streamRPC never reaches the Doris service handler. Combined with rawTCP to
:8060being OPEN, the stall is between TCP-accept and service-dispatch -- inside brpc1.4.0, below Doris's load-stream code. Doris's handler is not the stall point; it is never entered.
We did not pin the exact brpc 1.4.0 line -- it is in the bundled submodule
(
thirdparty/vars.sh,apache/brpctag1.4.0), and the runtime probe that would pin it (brpcrpcz/ socket bvars) could not be enabled at runtime on this build. One secondary, non-root nuancewe found:
FailureDetectChannelinvalidates a cached channel only onEHOSTDOWN, not on a timeout(
brpc_client_cache.h:80,125) -- but we captured 249EHOSTDOWN(Host is down) and channelrebuilds happened anyway and did not recover the wedge, so that is at most a hardening suggestion, not
the cause.
What we have captured and what else we can provide
We have captured the in-process state at a live wedge. Attached:
gstackthread dumps ofdoris_beon the two BEs with parked write threads (full ~1747-threaddumps),
/vars(94 KB) and/metricsfrom each, showing the worker / load-channel / compactionstate,
be.WARNINGtails with the[E1008]open failures and theFailureDetectChannelprobe failures.We could not get
rpcz-- it is off by default and:8060/rpcz/enabledid not enable it at runtimeon this build. If there is a flag or build option to turn rpcz on, tell us and we will capture it. We
can also pull a full
gdb -pthread apply all bt, more specific brpcbvars, or FE-side state onrequest.
One caveat on timing. This is a dev POC and we are moving on with our implementation, so the cluster
will not stay up indefinitely. The reproducer, the captures above, and any candidate-build testing
are only available while the cluster is still running -- so the sooner we can act on this, the
better.
Questions for the maintainers
open_load_streamRPC never reaches the server handler (0 handler-entry logs,idle pools) while raw TCP to
:8060is open -- consistent with a brpc 1.4.0 socket/stream that isaccepted at TCP but never dispatched, and never revived (the brpc The download link for doris-incubating-thirdparty-20190414 is broken #1168 class). Is this a known
brpc 1.4.0 defect on the load-stream path, and is there a fixing PR or a brpc version that resolves
it?
brpc_client_cacheexpected to revive a Broken socket automatically? In ourcapture it never did, and
enable_brpc_connection_check=truedid not help. Is that the intendedrecovery path, and should it have recovered the socket?
than park ~534s on the open RPC?
will run the upgrade test -- restart one BE, drive the load sequence, watch the canary -- and
report back.
Appendix -- config
be.conf (stock defaults plus these managed overrides only):
fe.conf (stock defaults plus these overrides only):
Table shape (representative):
Ports:
SHOW BACKENDSshowed Alive=truewedge.10.0.0.105.tar.gz
wedge.10.0.0.118.tar.gz
wedge.10.0.0.155.tar.gz
wedge.10.0.0.229.tar.gz
Are you willing to submit PR?
Code of Conduct