[Bug] BE write path wedges: a BE-to-BE load-stream brpc socket (8060) goes "Broken" and is never revived (4.0.6, coupled mode)

### Search before asking

- [x] I had searched in the [issues](https://github.com/apache/doris/issues?q=is%3Aissue) and found no similar issues.


### Version

We are on Apache Doris 4.0.6 GA (x86_64, AVX2). The running build string reads `doris-4.0.6-rc02`,
which is misleading -- it is the GA, not a hand-picked release candidate. The official
`apache-doris-4.0.6-bin-x64.tar.gz` was cut from the `4.0.6-rc02` tag and the embedded build string
was never bumped to drop the `-rc02` suffix. We verified the artifact two independent ways so this
is not dismissed as an RC:

1. Build-commit match. Our binary reports commit `1663f25c16f`; the Apache Doris 4.0.6 release is
   commit `1663f25`. A binary can only embed that commit if it was built from that exact tree.
2. Cross-mirror sha512 match. `apache-doris-4.0.6-bin-x64.tar.gz.sha512` is byte-identical on the
   official release host and the mirror our deploy pulls from:
   `8f869c4399088d3dc34e5ade10047495e42c7c0583fb32156adaf0794a56e5942b8c0142c05fc145d58d4148daf0ee8d0dde73c9aab0224f39b2435f406c8ef8`.

MySQL-wire version reported: `5.7.99`. We have not yet tested 4.1.x.

### What's Wrong?

On a coupled-mode 4.0.6 cluster, the BE write path wedges. `INSERT` / `CREATE TABLE AS SELECT` /
MV-refresh hang and then fail with `failed to write enough replicas N/M ... due to connection
errors`, while every node still reports `Alive=true` and reads keep working. Once wedged, only a
full restart of all BE processes recovers it -- a single-BE restart does not.

We chased this for a full day with live instrumentation and found three distinct causes, not one.
Two were ours and are fixed. The third is the reason for this report: a BE-to-BE load-stream brpc
socket on port 8060 goes "Broken" and is never revived, and we cannot fix it from the 4.0.6 config
surface. We believe it is an upstream defect of the Apache brpc
[#1168](https://www.mail-archive.com/dev@brpc.apache.org/msg03092.html) class.

| # | Cause | Trigger | Status |
| - | ----- | ------- | ------ |
| 1 | Our security group was missing a BE-to-BE self-ingress on port 8040 (`webserver_port`, clone snapshot download), so clone REPAIR could never complete and the FE ran an unbounded repair-clone storm | replica repair / single-BE restart | Fixed -- our IaC bug, not Doris |
| 2 | brpc load-stream-open stall on 8060 under heavy multi-replica load | heavy multi-replica `INSERT` | Mitigated with `experimental_enable_single_replica_insert=true` |
| 3 | A BE-to-BE load-stream brpc socket (8060) goes "Broken" and is never revived | accumulation of ~6--7 BE-to-BE stream opens | Open -- suspected upstream defect, this report |

Cause #1 was our mistake. We include it so it is clear the residual (cause #3) is independent of it:
after we fixed the security group and clones completed cleanly, cause #3 still reproduces.

### The bug (cause #3)

On a healthy cluster, repl=3 writes succeed. After roughly 6--7 successful BE-to-BE load-stream-open
operations, a specific brpc socket on the BE-to-BE load-stream path (8060) enters a state where the
next load-stream-open RPC to the affected peer parks until the RPC timeout (~534 s) and then fails,
taking down the whole write path. The socket is never revived. Reads and the Thrift heartbeat (9050,
a separate threadpool) stay healthy the entire time, so `SHOW BACKENDS` shows `Alive=true`
throughout.

#### Signature (verbatim, BE `be.WARNING` and FE `fe.log`)

Coordinator-side open failure -- 60s is the `tablet_writer_open_rpc_timeout_sec` default:

```text
load_stream_stub.cpp:591 open stream failed: [INTERNAL_ERROR]Failed to connect to backend <id>:
  [E1008]Reached timeout=60000ms @10.0.0.105:8060
```

The long park -- the RPC itself stalls ~534s before failing:

```text
brpc_closure.h RPC meet failed: [E1008]Reached timeout=533999ms @10.0.0.105:8060
brpc_closure.h RPC meet failed: [E1008]Reached timeout=533999ms @10.0.0.118:8060
brpc_closure.h RPC meet failed: [E1008]Reached timeout=533999ms @10.0.0.155:8060
```

Loopback proof that this is in-process, not network or security group -- a BE times out opening a
tablet-writer to its own 8060, and cancels a load-stream whose source and destination are the same
BE:

```text
load_id=..., txn_id=6, node=10.0.0.118:8060, open failed, err: ... RPC call is timed out,
  error_text=[E1008]Reached timeout=60000ms @10.0.0.118:8060, host: 10.0.0.118

load_stream_stub ... src_id=...499, dst_id=...499, stream_id=1740 is cancelled ...
  write enough replicas 1/3
```

```text
brpc_client_cache.h:326 open brpc connection to 10.0.0.105:8060 failed:
  [E1008]Reached timeout=60000ms
```

User-facing FE error:

```text
failed to open DeltaWriter <id>: failed to write enough replicas 1/3 for tablet <id>
  due to connection errors
```

At the original bring-up wedge the same error read `... 0/1 ...`.

#### In-process capture at a live wedge (pstack + bvar)

We captured the in-process BE state during a live wedge (2026-06-22, `doris-4.0.6-rc02`, commit
`1663f25c16f`), before the recovery restart. Full dumps are attached.

A write thread (`gstack`, BE A) is parked in the brpc load-stream OPEN -- the V2 path:

```text
bthread_id_join
 -> brpc::Channel::CallMethod
 -> doris::FailureDetectChannel::CallMethod        be/src/util/brpc_client_cache.h:121
 -> doris::LoadStreamStub::open                    be/src/vec/sink/load_stream_stub.cpp:195   (txn_id=4442, total_streams=2, idle_timeout_ms=30000)
 -> doris::LoadStreamStubs::open                   be/src/vec/sink/load_stream_stub.cpp:574
 -> doris::vectorized::VTabletWriterV2::_open_streams_to_backend   be/src/vec/sink/writer/vtablet_writer_v2.cpp:317
 -> doris::vectorized::VTabletWriterV2::_open_streams              vtablet_writer_v2.cpp:296
 -> doris::vectorized::VTabletWriterV2::open                       vtablet_writer_v2.cpp:272
 -> doris::vectorized::AsyncResultWriter::process_block            be/src/vec/sink/writer/async_result_writer.cpp:119
 -> doris::vectorized::AsyncResultWriter::start_writer             async_result_writer.cpp:105
 -> doris::ThreadPool::dispatch_thread
```

On another BE the same root appears via the V1 writer path -- `VNodeChannel::open_wait`
(`vtablet_writer.cpp:704`) -> `bthread_id_join`. Both are parked on the brpc load-stream OPEN RPC to
a peer backend.

It is **not worker-pool exhaustion, not compaction, not a stub leak** -- brpc `/vars` (8060) and
`/metrics` (8040) at the wedge:

| BE | `bthread_worker_usage` / count | `load_channel_count` | `tablet_writer_count` | `brpc_stream_endpoint_stub_count` | compaction (base+cumulative) |
| -- | ------------------------------ | -------------------- | --------------------- | --------------------------------- | ---------------------------- |
| A | 0.20 / 256 | 2 | 8 | 4 | 0 |
| B | 54.6 / 256 | 3 | 9 | 4 | 0 |

Workers are nowhere near the 256 ceiling -- the write threads are parked on the RPC, not starved.
Load channels and tablet writers are open and stuck; the stub count is the steady-state 4 (no leak);
compaction is fully idle. `rpcz` was empty (off by default; `:8060/rpcz/enable` did not enable it at
runtime on this build), so the parked-RPC evidence is the `gstack` above.

`load_stream_stub` cancellations appear across all BEs for the same load.

#### Trigger: accumulation, not a timer or idle decay

- Across two instrumented runs the wedge fired after 7 OK then wedge, and 6 OK then wedge, repl=3
  write operations. It tracks the number of BE-to-BE load-stream opens, not a wall-clock interval.
- It fires both during a heavy multi-replica load and ~7--11 minutes after a load while the cluster
  is otherwise idle (no further writes issued).
- A restart followed by 60 minutes of pure idle with no load did not wedge. So it is load-induced,
  not idle decay.
- `brpc_stream_endpoint_stub_count` stayed at 4 across the wedge -- no stub-count leak. It is a
  specific socket going Broken, not stub exhaustion.

#### Recovery

A full restart of all BE processes clears it. A single-BE restart does not -- the rejoined BE's peers
still hold the broken stub, so it rejoins a wedged mesh.

### What You Expected?

When a BE-to-BE load-stream brpc connection breaks, brpc should revive it (or the load-stream-open
RPC should fail fast and the channel reconnect), so the write path recovers on its own. Instead the
open RPC parks ~534s and the entire write path wedges while every node still reports `Alive=true`, and
only a full BE-fleet restart clears it. A single broken socket should not require dropping all BE
processes to recover.

### How to Reproduce?

1. Coupled-mode 4.0.6 cluster, 3 FE + 4 BE, default replication 3, stock be.conf.
2. Create native UNIQUE-KEY merge-on-write tables, `replication_num=3`,
   `DISTRIBUTED BY HASH(...) BUCKETS 16`.
3. Run a sequence of multi-replica writes that each open BE-to-BE load streams -- repeated
   `INSERT ... SELECT` / `CREATE TABLE AS SELECT` of a few million rows. In our case, four such loads
   plus a handful of `UPDATE ... FROM` statements per cycle.
4. After ~6--7 such operations -- during the load, or within ~7--11 minutes after -- a write hangs and
   fails `write enough replicas N/3 ... connection errors`. `be.WARNING` shows
   `[E1008]Reached timeout ... @<be>:8060`, including a loopback `@<self>:8060`.
5. `SELECT 1` and `SHOW BACKENDS` (`Alive=true`) keep working. Only a full BE restart recovers.

We have not reduced this to a minimal standalone reproducer; it reproduces reliably under our normal
multi-replica load. We will run a targeted reproducer if you suggest one.

### Anything Else?

### Search / prior art

We searched the issue tracker, the `load_stream` / move-memtable PR history, the 4.1.x changelogs,
and community forums (English and Chinese) before filing, and found no exact match for the full
signature. The closest structural match is Apache brpc #1168 -- after a downstream node fault the
upstream socket enters a "Broken" state and the health check never revives it; recovery requires
restarting the upstream. Adjacent load-stream lifecycle fixes already in 4.0.6: #34883,
#39231 / #39762, #60148, #60285. Possibly related and unconfirmed for 4.0.x: #56120 ("close brpc
stream after load stream is closed"). If a maintainer recognises this as known or already fixed, a
pointer to the PR is the fastest resolution.

### Environment

- Mode: coupled (storage-compute together), FE + BE only. No FoundationDB / Meta Service / Recycler /
  S3 storage vault. Native tablet data lives on BE-local EBS.
- Topology: 3 FE (HA followers) + 4 BE. Each node 8 vCPU / 64 GiB RAM.
- BE storage: one dedicated 500 GB gp3 volume per BE, xfs (`noatime,nodiratime`), mounted
  `/var/lib/doris/storage`, gp3 baseline 3000 IOPS / 125 MiB/s.
- OS / JDK: Amazon Linux 2023, Amazon Corretto 17.
- Replication: Doris default 3, across 4 BEs.
- Workload: read Apache Iceberg through a Glue / S3 external catalog, then write the aggregated
  result into native Doris UNIQUE-KEY merge-on-write tables -- `CREATE TABLE AS SELECT`,
  `INSERT ... SELECT`, and a few `UPDATE ... FROM` statements. About 5M rows per table, 4 tables.
- be.conf is effectively stock. The only non-default overrides are `mem_limit = 80%`,
  `storage_root_path`, and `priority_networks = <self>/32`. No brpc / clone / timeout tuning was set
  initially. Full dump at the end.

### What we ruled out, with positive evidence

All of the environment-layer suspects below were tested **directly, while wedged** (raw probes on
2026-06-22).

| Hypothesis | Verdict | Evidence |
| ---------- | ------- | -------- |
| TCP / network / security group / routing on 8060 | Ruled out | Raw TCP (`/dev/tcp`) to `:8060` is **OPEN to the peer and over loopback to self** on all 4 BEs while the brpc RPC on the same port times out; the listener is healthy (`LISTEN 0 1024 0.0.0.0:8060`). A brpc call failing to its own loopback `:8060` while raw TCP to that port succeeds cannot be network/SG/routing. |
| Host firewall (iptables / nftables / firewalld) | Ruled out | All 4 BEs: `iptables` 0 non-policy rules (default-ACCEPT), `ip6tables` 0, `nft` ruleset empty, `firewalld` inactive/absent. No host firewall exists. |
| SELinux | Ruled out | `getenforce` = **Permissive** on all 4 (policy `targeted`, mode permissive) -- it logs but cannot block. |
| ENA bandwidth throttle | Ruled out | `bw_in/out_allowance_exceeded` are non-zero **cumulative** but **Δ=0 over a 50s sample during the idle wedge** (they moved only during the loads); `pps_allowance_exceeded`=0, `conntrack_allowance_exceeded`=0. No active throttle while wedged. |
| conntrack / ephemeral ports | Ruled out | `nf_conntrack` module not loaded; ~53--60 of ~28k ephemeral ports used, 3 TIME-WAIT. Neither is exhausted. |
| Kernel / OOM / packet drops | Ruled out | `dmesg` / `journalctl -k` show no drop/deny/reject/oom/conntrack/throttle lines for the window. |
| Deployment / OS-tuning misconfig | Ruled out | Our install sets all Doris-required kernel tuning (`vm.max_map_count=2000000` -- live-confirmed, swap off, `nofile` 655350, THP madvise) and **runs `start_be.sh`'s preflight**, which the official `apache/doris` container deployment *skips* (`SKIP_CHECK_ULIMIT=true`). The official FE/BE images add no brpc/network/timeout config we lack -- only `priority_networks`. So it is not a deployment misconfiguration. |
| Compaction / merge-on-write delete-bitmap publish | Ruled out | Captured live at the wedge: every compaction metric is 0 on all 4 BEs -- `doris_be_compaction_task_state_total{base,cumulative}=0`, `doris_be_disks_compaction_score=0`, `doris_be_compaction_used_permits=0`, `doris_be_compaction_waitting_permits=0`, `doris_be_load_channel_count=0`, `doris_be_tablet_writer_count=0`. |
| Resource exhaustion (CPU / memory / IO) | Ruled out | At the wedge the BEs are near-idle: load avg ~0.0--0.09, ~55--60 GB RAM free, `doris_be` at 2--3% CPU. EBS volumes idle (`VolumeReadOps=0`, under 1 write IOPS, `VolumeQueueLength` ~0). |
| BE soft memory limit / flush back-pressure | Ruled out | Workload-group `total_mem_used` 0--158 MB against an ~53 GB limit; zero memory-exceed or MemoryGc-cancel lines. Memory would climb if flush stalled. |
| Crash / auto-restart / kernel OOM | Ruled out | `NRestarts=0`, single MainPID for the whole window on every BE; `dmesg` and `journalctl -k` empty for the window. |
| Replication factor (repl=3 itself) | Ruled out | A fresh-cluster full 4-table build at repl=3 completed cleanly and sustained, 0 errors. Earlier "repl=3 triggers it" readings were confounded by clusters already degraded by prior single-BE-restart experiments. |
| BE thread-pool exhaustion | Not the cause | No BE thread pool is pegged at the wedge: EvHttpServer at pool size 128, pipeline schedulers at normal 8/16, no compaction or memtable pool active. |

The one mechanism consistent with all of this is a brpc load-stream socket going Broken and never
being revived: raw TCP to `:8060` connects (peer and loopback) while every brpc RPC on it times out;
Doris's own health check evicts the stub (`remove brpc stub from cache`) and recreates it, and the
new stub still times out; the errors are connect/open *timeouts* (never "Connection refused" or
"reset"); and it clears only when the process is dropped. That is the Apache brpc #1168 class.

### Config we tried that did not fix it

| Setting | Where | Result |
| ------- | ----- | ------ |
| `enable_brpc_connection_check = true` | be.conf, immutable, rolling restart | No effect. This is the mechanism that should periodically check brpc connections and close/recreate broken ones (`brpc_connection_check_timeout_ms` = 10s default), but it did not revive the broken load-stream socket. Wedged again at +8 minutes. Kept as general hardening. |
| `experimental_enable_single_replica_insert = true` | FE global var | Partial and unreliable. Loads write one replica and clone the rest, so a single load completes instead of hanging, but the idle wedge still fires afterward and a later load still hung despite the setting. |

We did not raise `tablet_writer_open_rpc_timeout_sec` or `brpc_socket_max_unwritten_bytes` beyond
defaults, because those mask the symptom -- a longer park -- rather than revive the socket. If you
believe a specific brpc knob is the fix, we will test it.

### The two causes we fixed ourselves

We are listing these so it is clear the residual is isolated, and because one of them was our
mistake and we would rather name it than route around it.

- Cause #1, our security-group bug -- fixed. Our BE security group self-referenced 8060 (brpc) and
  9060 (be_port) but not 8040 (`webserver_port`, the HTTP port used for clone snapshot download
  between BEs). Loads over brpc 8060 worked, but clone REPAIR over
  `http://<be>:8040/api/_tablet/_download` timed out
  (`[HTTP_ERROR]Connection timed out after 15000 milliseconds`), so missing replicas never healed
  and the FE ran an unbounded VERY_HIGH repair-clone storm that saturated the BEs. Adding the 8040
  BE-to-BE self-ingress rule fixed it: clones finish, drain to 0, replicas heal. This was our
  infrastructure error, not a Doris defect. We mention it only because, once fixed, cause #3 still
  reproduces -- which proves #3 is independent of it.
- Cause #2, load-stream-open stall under heavy multi-replica load -- mitigated. Distinct from the
  8040 clone path; this is on the 8060 write path. Mitigated, not cured, by
  `experimental_enable_single_replica_insert`.

### Detection and the workaround we run today

- Detection. The Thrift heartbeat (9050) runs on a separate threadpool from the brpc write path
  (8060), so `SHOW BACKENDS ... Alive=true` is not a writability signal -- it stayed green for ~2.5
  hours while every write was dead. We added a write-readiness canary, a small bounded `INSERT` over
  the 8060 path, to our health check, and a wedge now surfaces in seconds instead of hours.
- Recovery. A full BE-fleet restart. A single-BE restart does not clear it.

### Code-level analysis (Doris 4.0.6, bundled brpc 1.4.0)

We traced the captured stacks/logs into the 4.0.6 source. The load-bearing finding, from the
**target** BE's `be.INFO` during the wedge:

- `PInternalService::open_load_stream` logs `"open load stream, load_id=..."` (internal_service.cpp:416)
  as the first line of the handler. During the wedge there were **0** such handler-entry lines on the
  target BEs in the wedge window, versus **1700+** historically -- while the BE worker pools sat
  **idle** (pstack: threads parked in `blocking_get`, not saturated; a saturated pool would fail
  `try_offer` fast, not time out at 60s).
- So the inbound `open_load_stream` RPC **never reaches the Doris service handler**. Combined with raw
  TCP to `:8060` being OPEN, the stall is between TCP-accept and service-dispatch -- inside **brpc
  1.4.0**, below Doris's load-stream code. Doris's handler is not the stall point; it is never entered.

We did **not** pin the exact brpc 1.4.0 line -- it is in the bundled submodule
(`thirdparty/vars.sh`, `apache/brpc` tag `1.4.0`), and the runtime probe that would pin it (brpc
`rpcz` / socket bvars) could not be enabled at runtime on this build. One secondary, non-root nuance
we found: `FailureDetectChannel` invalidates a cached channel only on `EHOSTDOWN`, not on a timeout
(`brpc_client_cache.h:80,125`) -- but we captured 249 `EHOSTDOWN` (`Host is down`) and channel
rebuilds happened anyway and did not recover the wedge, so that is at most a hardening suggestion, not
the cause.

### What we have captured and what else we can provide

We have captured the in-process state at a live wedge. Attached:

- `gstack` thread dumps of `doris_be` on the two BEs with parked write threads (full ~1747-thread
  dumps),
- brpc `/vars` (94 KB) and `/metrics` from each, showing the worker / load-channel / compaction
  state,
- `be.WARNING` tails with the `[E1008]` open failures and the `FailureDetectChannel` probe failures.

We could not get `rpcz` -- it is off by default and `:8060/rpcz/enable` did not enable it at runtime
on this build. If there is a flag or build option to turn rpcz on, tell us and we will capture it. We
can also pull a full `gdb -p` `thread apply all bt`, more specific brpc `bvar`s, or FE-side state on
request.

One caveat on timing. This is a dev POC and we are moving on with our implementation, so the cluster
will not stay up indefinitely. The reproducer, the captures above, and any candidate-build testing
are only available while the cluster is still running -- so the sooner we can act on this, the
better.

### Questions for the maintainers

1. Our evidence says the `open_load_stream` RPC never reaches the server handler (0 handler-entry logs,
   idle pools) while raw TCP to `:8060` is open -- consistent with a brpc 1.4.0 socket/stream that is
   accepted at TCP but never dispatched, and never revived (the brpc #1168 class). Is this a known
   brpc 1.4.0 defect on the load-stream path, and is there a fixing PR or a brpc version that resolves
   it?
2. Is the load-stream `brpc_client_cache` expected to revive a Broken socket automatically? In our
   capture it never did, and `enable_brpc_connection_check=true` did not help. Is that the intended
   recovery path, and should it have recovered the socket?
3. Is there a supported config that makes a Broken load-stream socket fail fast and reconnect, rather
   than park ~534s on the open RPC?
4. Is there evidence that 4.1.x (4.1.2 specifically) contains a relevant brpc / load-stream fix? We
   will run the upgrade test -- restart one BE, drive the load sequence, watch the canary -- and
   report back.

### Appendix -- config

be.conf (stock defaults plus these managed overrides only):

```text
JAVA_HOME = /usr/lib/jvm/java-17-amazon-corretto.x86_64
storage_root_path = /var/lib/doris/storage
priority_networks = <node_private_ip>/32
mem_limit = 80%
be_port = 9060            # shipped default
webserver_port = 8040     # shipped default
heartbeat_service_port = 9050   # shipped default
brpc_port = 8060          # shipped default
# added later as hardening; did NOT fix the wedge:
enable_brpc_connection_check = true
```

fe.conf (stock defaults plus these overrides only):

```text
JAVA_HOME = /usr/lib/jvm/java-17-amazon-corretto.x86_64
meta_dir = /var/lib/doris/fe-meta
priority_networks = <node_private_ip>/32
http_port = 8030          # shipped default
rpc_port = 9020           # shipped default
query_port = 9030         # shipped default
edit_log_port = 9010      # shipped default
# FE global var set at runtime; mitigates cause #2, not cause #3:
experimental_enable_single_replica_insert = true
```

Table shape (representative):

```sql
CREATE TABLE evo_persons (
  identity_hash    varchar(32) NOT NULL,
  id_numbers_hash  varchar(32) NOT NULL,
  ...                          -- aggregated attribute and counter columns
)
UNIQUE KEY(identity_hash, id_numbers_hash)
DISTRIBUTED BY HASH(identity_hash) BUCKETS 16
PROPERTIES ('replication_num'='3', 'enable_unique_key_merge_on_write'='true');
```

Ports:

| Port | Service | At the wedge |
| ---- | ------- | ------------ |
| 8060 | brpc (tablet-writer / load-stream OPEN) | timed out, all directions including loopback |
| 8040 | webserver (clone snapshot download) | timed out until our security-group fix (cause #1); fine after |
| 9050 | Thrift heartbeat (separate threadpool) | stayed responsive, so `SHOW BACKENDS` showed Alive=true |

[wedge.10.0.0.105.tar.gz](https://github.com/user-attachments/files/29211906/wedge.10.0.0.105.tar.gz)
[wedge.10.0.0.118.tar.gz](https://github.com/user-attachments/files/29211904/wedge.10.0.0.118.tar.gz)
[wedge.10.0.0.155.tar.gz](https://github.com/user-attachments/files/29211903/wedge.10.0.0.155.tar.gz)
[wedge.10.0.0.229.tar.gz](https://github.com/user-attachments/files/29211905/wedge.10.0.0.229.tar.gz)

### Are you willing to submit PR?

- [ ] Yes I am willing to submit a PR!

### Code of Conduct

- [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] BE write path wedges: a BE-to-BE load-stream brpc socket (8060) goes "Broken" and is never revived (4.0.6, coupled mode) #64708

Search before asking

Version

What's Wrong?

The bug (cause #3)

Signature (verbatim, BE `be.WARNING` and FE `fe.log`)

In-process capture at a live wedge (pstack + bvar)

Trigger: accumulation, not a timer or idle decay

Recovery

What You Expected?

How to Reproduce?

Anything Else?

Search / prior art

Environment

What we ruled out, with positive evidence

Config we tried that did not fix it

The two causes we fixed ourselves

Detection and the workaround we run today

Code-level analysis (Doris 4.0.6, bundled brpc 1.4.0)

What we have captured and what else we can provide

Questions for the maintainers

Appendix -- config

Are you willing to submit PR?

Code of Conduct

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

#	Cause	Trigger	Status
1	Our security group was missing a BE-to-BE self-ingress on port 8040 (`webserver_port`, clone snapshot download), so clone REPAIR could never complete and the FE ran an unbounded repair-clone storm	replica repair / single-BE restart	Fixed -- our IaC bug, not Doris
2	brpc load-stream-open stall on 8060 under heavy multi-replica load	heavy multi-replica `INSERT`	Mitigated with `experimental_enable_single_replica_insert=true`
3	A BE-to-BE load-stream brpc socket (8060) goes "Broken" and is never revived	accumulation of ~6--7 BE-to-BE stream opens	Open -- suspected upstream defect, this report

Hypothesis	Verdict	Evidence
TCP / network / security group / routing on 8060	Ruled out	Raw TCP (`/dev/tcp`) to `:8060` is OPEN to the peer and over loopback to self on all 4 BEs while the brpc RPC on the same port times out; the listener is healthy (`LISTEN 0 1024 0.0.0.0:8060`). A brpc call failing to its own loopback `:8060` while raw TCP to that port succeeds cannot be network/SG/routing.
Host firewall (iptables / nftables / firewalld)	Ruled out	All 4 BEs: `iptables` 0 non-policy rules (default-ACCEPT), `ip6tables` 0, `nft` ruleset empty, `firewalld` inactive/absent. No host firewall exists.
SELinux	Ruled out	`getenforce` = Permissive on all 4 (policy `targeted`, mode permissive) -- it logs but cannot block.
ENA bandwidth throttle	Ruled out	`bw_in/out_allowance_exceeded` are non-zero cumulative but Δ=0 over a 50s sample during the idle wedge (they moved only during the loads); `pps_allowance_exceeded`=0, `conntrack_allowance_exceeded`=0. No active throttle while wedged.
conntrack / ephemeral ports	Ruled out	`nf_conntrack` module not loaded; ~53--60 of ~28k ephemeral ports used, 3 TIME-WAIT. Neither is exhausted.
Kernel / OOM / packet drops	Ruled out	`dmesg` / `journalctl -k` show no drop/deny/reject/oom/conntrack/throttle lines for the window.
Deployment / OS-tuning misconfig	Ruled out	Our install sets all Doris-required kernel tuning (`vm.max_map_count=2000000` -- live-confirmed, swap off, `nofile` 655350, THP madvise) and runs `start_be.sh`'s preflight, which the official `apache/doris` container deployment skips (`SKIP_CHECK_ULIMIT=true`). The official FE/BE images add no brpc/network/timeout config we lack -- only `priority_networks`. So it is not a deployment misconfiguration.
Compaction / merge-on-write delete-bitmap publish	Ruled out	Captured live at the wedge: every compaction metric is 0 on all 4 BEs -- `doris_be_compaction_task_state_total{base,cumulative}=0`, `doris_be_disks_compaction_score=0`, `doris_be_compaction_used_permits=0`, `doris_be_compaction_waitting_permits=0`, `doris_be_load_channel_count=0`, `doris_be_tablet_writer_count=0`.
Resource exhaustion (CPU / memory / IO)	Ruled out	At the wedge the BEs are near-idle: load avg ~0.0--0.09, ~55--60 GB RAM free, `doris_be` at 2--3% CPU. EBS volumes idle (`VolumeReadOps=0`, under 1 write IOPS, `VolumeQueueLength` ~0).
BE soft memory limit / flush back-pressure	Ruled out	Workload-group `total_mem_used` 0--158 MB against an ~53 GB limit; zero memory-exceed or MemoryGc-cancel lines. Memory would climb if flush stalled.
Crash / auto-restart / kernel OOM	Ruled out	`NRestarts=0`, single MainPID for the whole window on every BE; `dmesg` and `journalctl -k` empty for the window.
Replication factor (repl=3 itself)	Ruled out	A fresh-cluster full 4-table build at repl=3 completed cleanly and sustained, 0 errors. Earlier "repl=3 triggers it" readings were confounded by clusters already degraded by prior single-BE-restart experiments.
BE thread-pool exhaustion	Not the cause	No BE thread pool is pegged at the wedge: EvHttpServer at pool size 128, pipeline schedulers at normal 8/16, no compaction or memtable pool active.

Setting	Where	Result
`enable_brpc_connection_check = true`	be.conf, immutable, rolling restart	No effect. This is the mechanism that should periodically check brpc connections and close/recreate broken ones (`brpc_connection_check_timeout_ms` = 10s default), but it did not revive the broken load-stream socket. Wedged again at +8 minutes. Kept as general hardening.
`experimental_enable_single_replica_insert = true`	FE global var	Partial and unreliable. Loads write one replica and clone the rest, so a single load completes instead of hanging, but the idle wedge still fires afterward and a later load still hung despite the setting.

Port	Service	At the wedge
8060	brpc (tablet-writer / load-stream OPEN)	timed out, all directions including loopback
8040	webserver (clone snapshot download)	timed out until our security-group fix (cause #1); fine after
9050	Thrift heartbeat (separate threadpool)	stayed responsive, so `SHOW BACKENDS` showed Alive=true

Uh oh!

[Bug] BE write path wedges: a BE-to-BE load-stream brpc socket (8060) goes "Broken" and is never revived (4.0.6, coupled mode) #64708

Description

Search before asking

Version

What's Wrong?

The bug (cause #3)

Signature (verbatim, BE be.WARNING and FE fe.log)

In-process capture at a live wedge (pstack + bvar)

Trigger: accumulation, not a timer or idle decay

Recovery

What You Expected?

How to Reproduce?

Anything Else?

Search / prior art

Environment

What we ruled out, with positive evidence

Config we tried that did not fix it

The two causes we fixed ourselves

Detection and the workaround we run today

Code-level analysis (Doris 4.0.6, bundled brpc 1.4.0)

What we have captured and what else we can provide

Questions for the maintainers

Appendix -- config

Are you willing to submit PR?

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Signature (verbatim, BE `be.WARNING` and FE `fe.log`)