fix(ompi): correct integer `Avg` scaling in `AllReduce` and `ReduceScatter` by GordonYang1 · Pull Request #37 · InfiniTensor/InfiniCCL

GordonYang1 · 2026-06-10T07:20:22Z

Summary

This PR fixes an integer Avg (average) calculation error in the OpenMPI implementations of the reduce-family collectives AllReduce and ReduceScatter.

The host-side averaging step scaled each element with typed_buf[i] *= static_cast<T>(scale), where scale = 1 / world_size. For any world_size > 1 this reciprocal is a fraction in (0, 1); when the element type T is an integer, static_cast<T>(scale) truncates it to 0, so the entire Avg result is zeroed out. The fix performs the scaling in floating point (through double) and then casts back to T for integer types, while leaving the floating-point path unchanged. This mirrors the integer Avg fix already applied to Reduce in #28, restoring consistency across the reduce family.

Changes

Reduce-family Avg correctness fix
- src/ompi/impl/all_reduce.h: guard the host-side Avg scaling with if constexpr (std::is_integral_v<T>); for integer types scale through double (static_cast<T>(static_cast<double>(typed_buf[i]) * scale)) before casting back to T, and keep the existing in-place multiply for floating-point types.
- src/ompi/impl/reduce_scatter.h: apply the identical integer-safe Avg scaling fix.
- This aligns both ops with the Reduce implementation fixed in feat: support Reduce with OpenMPI backend implementation #28, so all three reduce-family collectives now share the same correct averaging behavior.
Includes
- add #include <type_traits> to both files for std::is_integral_v.

Platform and Backend Affected

Platform

N/A- CPU
N/A- NVIDIA GPU
N/A- Iluvatar GPU
N/A- MetaX GPU
N/A- Moore Threads GPU
N/A- Cambricon MLU

Backend

OpenMPI
MPICH

Performance Impact

No performance impact
Performance improved
Performance regression possible

The averaging loop still runs once over the output buffer exactly as before; for integer types each element is now computed through double, a negligible host-side per-element cost, and the floating-point path is byte-for-byte unchanged. For reference, the heterogeneous run (8 ranks, 4 MB per rank, Float32 + Sum) measured AllReduce at 12.352 ms (0.55 GB/s bus BW), Reduce at 5.640 ms (1.21 GB/s bus BW), and ReduceScatter at 77.529 ms (4 MB recv / 32 MB send per rank).

Known Issues & Future Work

The averaging is performed on the host after the sum reduction, using static_cast to convert the floating-point scale result back to T. A unified host-side Cast (the existing TODO(lzm)) would be needed to support CPU custom types cleanly; this remains shared across the reduce family.
For integer dtypes the average truncates toward zero after dividing (e.g. 100 / 16 → 6), consistent with the behavior already shipped in Reduce. NCCL-exact rounding is not attempted.
The fp16/bf16 reduction limitation is unchanged and out of scope here: kFloat16 / kBFloat16 map to MPI_BYTE, so reducing them as raw bytes is incorrect. This is a pre-existing, codebase-wide limitation shared by all reduce-family collectives, pending a unified Cast / typed-reduction path.

Test Results

Validated on a MetaX–NVIDIA heterogeneous cluster over the OpenMPI backend via scripts/run_examples.py:

server: NVIDIA, 4 GPUs, ranks 0–3 (built with Devices [cpu, nvidia], Backends [ompi]).
test: MetaX, 4 GPUs, ranks 4–7 (built with Devices [cpu, metax], Backends [ompi]).
8 ranks total; message size 1,048,576 float32 (4 MB) per rank (ReduceScatter: 4 MB recv / 32 MB send per rank); 2 warm-up + 20 profiled iterations.
All bundled example programs report Correct: YES.

Note: the bundled examples all run Float32 + Sum, which does not exercise the integer Avg path that this PR fixes. The fix was therefore additionally verified with a dedicated int32 / int64 + Avg check driving the real infinicclAllReduce / infinicclReduceScatter: before the fix both ops returned 0 (the entire result zeroed), after the fix both return the correct average. The full example regression above confirms the unaffected Float32 + Sum path is not broken.

Test Involved Platform

Test Involved Backend

OpenMPI
MPICH

all_gather.log
all_reduce.log
all_to_all.log
broadcast.log
gather.log
reduce.log
reduce_scatter.log
scatter.log
send_recv.log

Checklist

Every contributor must verify every item below before requesting
review. Tick each box only after the check has actually been performed —
do not tick speculatively. If an item truly does not apply, replace the
checkbox with N/A and briefly explain why in an inline comment.

Title, Branch, and Commits

PR title follows Conventional Commits (e.g. feat: …, fix(nccl): …).
Branch name follows <type>/xxx-yyyy-zzzz where <type> matches the PR title's Conventional Commits type and words are joined with hyphens (see CONTRIBUTING.md §Branches).
Each commit message follows Conventional Commits.
Small PR is a single squashable commit; or, for a large PR, every commit is meaningful, well-formed, and independently reviewable (see CONTRIBUTING.md §Pull Requests).
No stray merge commits from master — the branch is rebased cleanly on top of the current master.
No fixup! / squash! / wip commits remain.

Scope and Design

Changes are minimal — no unrelated modifications were introduced (CONTRIBUTING.md §Code/General).
No dead code, commented-out blocks, debug prints, printf/std::cout/print(...) left behind, or TODO without an owner and issue link.
No unrelated formatting churn that would obscure the diff.
N/A- Public API changes (if any) are intentional, documented, and reflected in affected callers/tests.

General Code Hygiene

The code is self-explanatory; comments were added only where the intent or rationale is non-obvious (CONTRIBUTING.md §Code/General).
Every modified or added file ends with a single trailing newline (CONTRIBUTING.md §Code/General).
No trailing whitespace, inconsistent indentation, or mixed formatting styles remain.
Identifiers referenced in comments or error messages are wrapped in Markdown backticks (e.g. the `AllReduce` implementation) (CONTRIBUTING.md §Code/General).
All comments and error messages are in English (CONTRIBUTING.md §Code/General).
Comments and error messages are complete sentences — capitalized first letter, terminal punctuation — unless the language/framework convention says otherwise (CONTRIBUTING.md §Code/General; §Python).

C++ Specific (if C++ files changed)

Code follows the Google C++ Style Guide strictly.
clang-format (version 16, per .github/workflows/clang-format.yml) has been run against all modified applicable files; the diff is clean.
No exceptions are thrown. Error paths use assert with messages that include at least __FILE__, __LINE__, and __func__ (CONTRIBUTING.md §C++).
N/A- Error and warning message wording follows the LLVM Coding Standards (CONTRIBUTING.md §C++).
N/A- Constructor initializer list order matches member declaration order (CONTRIBUTING.md §C++).
Exactly one blank line between classes, between classes and functions, and between functions (CONTRIBUTING.md §C++).
Exactly one blank line between members (functions and variables) within a class (CONTRIBUTING.md §C++).
Exactly one blank line before and after the contents of a namespace (CONTRIBUTING.md §C++).

Python Specific (if Python files changed)

N/A- Code is PEP 8 compliant; ruff check passes cleanly on CI (see .github/workflows/ruff.yml).
N/A- ruff format --check passes cleanly — if not, run ruff format and commit the result.
N/A- Comments are complete English sentences, starting with a capital letter and ending with punctuation; Markdown backticks are used for code references (CONTRIBUTING.md §Python).
N/A- Framework-specific conventions (e.g. lowercase pytest.skip messages without terminal period) are honored where applicable (CONTRIBUTING.md §Python).
N/A- No blank line between the function signature and the body when there is no docstring or comment (CONTRIBUTING.md §Python).
N/A- A blank line is present before and after if, for, and similar control-flow statements (CONTRIBUTING.md §Python).
N/A- A blank line appears before each return, except when it directly follows a control-flow statement (CONTRIBUTING.md §Python).
N/A- Docstrings (if any) follow PEP 257 (CONTRIBUTING.md §Python).
N/A- Type hints are added / kept consistent with the surrounding code.

Testing

All applicable example programs have been built and tested successfully on at least one supported heterogeneous cluster setup.

Build, CI, and Tooling

N/A- New backends or devices have been added to auto-detection in CMakeLists.txt under if(AUTO_DETECT_DEVICES) or to if(AUTO_DETECT_BACKENDS) if applicable.
Both CI workflows (clang-format.yml, ruff.yml) are green locally (or expected to be green on CI).

Documentation

N/A- README.md, CONTRIBUTING.md, or inline docs updated when behavior, build flags, or developer workflow changed.
N/A- Any user-visible breaking change is called out explicitly under "Summary" and in the commit/PR title with a ! or BREAKING CHANGE: footer.

Security and Safety

No secrets, access tokens, internal URLs, customer data, or personal hardware identifiers have been committed.
N/A- Third-party code is license-compatible and attributed.
No unsafe pointer arithmetic, uninitialized reads, or missing bounds checks were introduced.

…atter`

fix(ompi): correct integer Avg scaling in AllReduce and `ReduceSc…

5c9cb40

…atter`

GordonYang1 requested a review from Ziminli June 10, 2026 07:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ompi): correct integer `Avg` scaling in `AllReduce` and `ReduceScatter`#37

fix(ompi): correct integer `Avg` scaling in `AllReduce` and `ReduceScatter`#37
GordonYang1 wants to merge 1 commit into
InfiniTensor:masterfrom
GordonYang1:fix/reduce-avg-calculation

GordonYang1 commented Jun 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

GordonYang1 commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Platform and Backend Affected

Platform

Backend

Performance Impact

Known Issues & Future Work

Test Results

Test Involved Platform

Test Involved Backend

Checklist

Title, Branch, and Commits

Scope and Design

General Code Hygiene

C++ Specific (if C++ files changed)

Python Specific (if Python files changed)

Testing

Build, CI, and Tooling

Documentation

Security and Safety

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

GordonYang1 commented Jun 10, 2026 •

edited

Loading