Skip to content

fix(ompi): correct integer Avg scaling in AllReduce and ReduceScatter#37

Open
GordonYang1 wants to merge 1 commit into
InfiniTensor:masterfrom
GordonYang1:fix/reduce-avg-calculation
Open

fix(ompi): correct integer Avg scaling in AllReduce and ReduceScatter#37
GordonYang1 wants to merge 1 commit into
InfiniTensor:masterfrom
GordonYang1:fix/reduce-avg-calculation

Conversation

@GordonYang1

@GordonYang1 GordonYang1 commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Summary

This PR fixes an integer Avg (average) calculation error in the OpenMPI implementations of the reduce-family collectives AllReduce and ReduceScatter.

The host-side averaging step scaled each element with typed_buf[i] *= static_cast<T>(scale), where scale = 1 / world_size. For any world_size > 1 this reciprocal is a fraction in (0, 1); when the element type T is an integer, static_cast<T>(scale) truncates it to 0, so the entire Avg result is zeroed out. The fix performs the scaling in floating point (through double) and then casts back to T for integer types, while leaving the floating-point path unchanged. This mirrors the integer Avg fix already applied to Reduce in #28, restoring consistency across the reduce family.

Changes

  • Reduce-family Avg correctness fix

    • src/ompi/impl/all_reduce.h: guard the host-side Avg scaling with if constexpr (std::is_integral_v<T>); for integer types scale through double (static_cast<T>(static_cast<double>(typed_buf[i]) * scale)) before casting back to T, and keep the existing in-place multiply for floating-point types.
    • src/ompi/impl/reduce_scatter.h: apply the identical integer-safe Avg scaling fix.
    • This aligns both ops with the Reduce implementation fixed in feat: support Reduce with OpenMPI backend implementation #28, so all three reduce-family collectives now share the same correct averaging behavior.
  • Includes

    • add #include <type_traits> to both files for std::is_integral_v.

Platform and Backend Affected

Platform

  • N/A- CPU
  • N/A- NVIDIA GPU
  • N/A- Iluvatar GPU
  • N/A- MetaX GPU
  • N/A- Moore Threads GPU
  • N/A- Cambricon MLU

Backend

  • OpenMPI
  • MPICH

Performance Impact

  • No performance impact
  • Performance improved
  • Performance regression possible

The averaging loop still runs once over the output buffer exactly as before; for integer types each element is now computed through double, a negligible host-side per-element cost, and the floating-point path is byte-for-byte unchanged. For reference, the heterogeneous run (8 ranks, 4 MB per rank, Float32 + Sum) measured AllReduce at 12.352 ms (0.55 GB/s bus BW), Reduce at 5.640 ms (1.21 GB/s bus BW), and ReduceScatter at 77.529 ms (4 MB recv / 32 MB send per rank).

Known Issues & Future Work

  • The averaging is performed on the host after the sum reduction, using static_cast to convert the floating-point scale result back to T. A unified host-side Cast (the existing TODO(lzm)) would be needed to support CPU custom types cleanly; this remains shared across the reduce family.
  • For integer dtypes the average truncates toward zero after dividing (e.g. 100 / 16 → 6), consistent with the behavior already shipped in Reduce. NCCL-exact rounding is not attempted.
  • The fp16/bf16 reduction limitation is unchanged and out of scope here: kFloat16 / kBFloat16 map to MPI_BYTE, so reducing them as raw bytes is incorrect. This is a pre-existing, codebase-wide limitation shared by all reduce-family collectives, pending a unified Cast / typed-reduction path.

Test Results

Validated on a MetaX–NVIDIA heterogeneous cluster over the OpenMPI backend via scripts/run_examples.py:

  • server: NVIDIA, 4 GPUs, ranks 0–3 (built with Devices [cpu, nvidia], Backends [ompi]).
  • test: MetaX, 4 GPUs, ranks 4–7 (built with Devices [cpu, metax], Backends [ompi]).
  • 8 ranks total; message size 1,048,576 float32 (4 MB) per rank (ReduceScatter: 4 MB recv / 32 MB send per rank); 2 warm-up + 20 profiled iterations.
  • All bundled example programs report Correct: YES.

Note: the bundled examples all run Float32 + Sum, which does not exercise the integer Avg path that this PR fixes. The fix was therefore additionally verified with a dedicated int32 / int64 + Avg check driving the real infinicclAllReduce / infinicclReduceScatter: before the fix both ops returned 0 (the entire result zeroed), after the fix both return the correct average. The full example regression above confirms the unaffected Float32 + Sum path is not broken.

Test Involved Platform

  • CPU
  • NVIDIA GPU
  • Iluvatar GPU
  • MetaX GPU
  • Moore Threads GPU
  • Cambricon MLU

Test Involved Backend

  • OpenMPI
  • MPICH

all_gather.log
all_reduce.log
all_to_all.log
broadcast.log
gather.log
reduce.log
reduce_scatter.log
scatter.log
send_recv.log


Checklist

Every contributor must verify every item below before requesting
review. Tick each box only after the check has actually been performed —
do not tick speculatively. If an item truly does not apply, replace the
checkbox with N/A and briefly explain why in an inline comment.

Title, Branch, and Commits

  • PR title follows Conventional Commits (e.g. feat: …, fix(nccl): …).
  • Branch name follows <type>/xxx-yyyy-zzzz where <type> matches the PR title's Conventional Commits type and words are joined with hyphens (see CONTRIBUTING.md §Branches).
  • Each commit message follows Conventional Commits.
  • Small PR is a single squashable commit; or, for a large PR, every commit is meaningful, well-formed, and independently reviewable (see CONTRIBUTING.md §Pull Requests).
  • No stray merge commits from master — the branch is rebased cleanly on top of the current master.
  • No fixup! / squash! / wip commits remain.

Scope and Design

  • Changes are minimal — no unrelated modifications were introduced (CONTRIBUTING.md §Code/General).
  • No dead code, commented-out blocks, debug prints, printf/std::cout/print(...) left behind, or TODO without an owner and issue link.
  • No unrelated formatting churn that would obscure the diff.
  • N/A- Public API changes (if any) are intentional, documented, and reflected in affected callers/tests.

General Code Hygiene

  • The code is self-explanatory; comments were added only where the intent or rationale is non-obvious (CONTRIBUTING.md §Code/General).
  • Every modified or added file ends with a single trailing newline (CONTRIBUTING.md §Code/General).
  • No trailing whitespace, inconsistent indentation, or mixed formatting styles remain.
  • Identifiers referenced in comments or error messages are wrapped in Markdown backticks (e.g. the `AllReduce` implementation) (CONTRIBUTING.md §Code/General).
  • All comments and error messages are in English (CONTRIBUTING.md §Code/General).
  • Comments and error messages are complete sentences — capitalized first letter, terminal punctuation — unless the language/framework convention says otherwise (CONTRIBUTING.md §Code/General; §Python).

C++ Specific (if C++ files changed)

  • Code follows the Google C++ Style Guide strictly.
  • clang-format (version 16, per .github/workflows/clang-format.yml) has been run against all modified applicable files; the diff is clean.
  • No exceptions are thrown. Error paths use assert with messages that include at least __FILE__, __LINE__, and __func__ (CONTRIBUTING.md §C++).
  • N/A- Error and warning message wording follows the LLVM Coding Standards (CONTRIBUTING.md §C++).
  • N/A- Constructor initializer list order matches member declaration order (CONTRIBUTING.md §C++).
  • Exactly one blank line between classes, between classes and functions, and between functions (CONTRIBUTING.md §C++).
  • Exactly one blank line between members (functions and variables) within a class (CONTRIBUTING.md §C++).
  • Exactly one blank line before and after the contents of a namespace (CONTRIBUTING.md §C++).

Python Specific (if Python files changed)

  • N/A- Code is PEP 8 compliant; ruff check passes cleanly on CI (see .github/workflows/ruff.yml).
  • N/A- ruff format --check passes cleanly — if not, run ruff format and commit the result.
  • N/A- Comments are complete English sentences, starting with a capital letter and ending with punctuation; Markdown backticks are used for code references (CONTRIBUTING.md §Python).
  • N/A- Framework-specific conventions (e.g. lowercase pytest.skip messages without terminal period) are honored where applicable (CONTRIBUTING.md §Python).
  • N/A- No blank line between the function signature and the body when there is no docstring or comment (CONTRIBUTING.md §Python).
  • N/A- A blank line is present before and after if, for, and similar control-flow statements (CONTRIBUTING.md §Python).
  • N/A- A blank line appears before each return, except when it directly follows a control-flow statement (CONTRIBUTING.md §Python).
  • N/A- Docstrings (if any) follow PEP 257 (CONTRIBUTING.md §Python).
  • N/A- Type hints are added / kept consistent with the surrounding code.

Testing

  • All applicable example programs have been built and tested successfully on at least one supported heterogeneous cluster setup.

Build, CI, and Tooling

  • N/A- New backends or devices have been added to auto-detection in CMakeLists.txt under if(AUTO_DETECT_DEVICES) or to if(AUTO_DETECT_BACKENDS) if applicable.
  • Both CI workflows (clang-format.yml, ruff.yml) are green locally (or expected to be green on CI).

Documentation

  • N/A- README.md, CONTRIBUTING.md, or inline docs updated when behavior, build flags, or developer workflow changed.
  • N/A- Any user-visible breaking change is called out explicitly under "Summary" and in the commit/PR title with a ! or BREAKING CHANGE: footer.

Security and Safety

  • No secrets, access tokens, internal URLs, customer data, or personal hardware identifiers have been committed.
  • N/A- Third-party code is license-compatible and attributed.
  • No unsafe pointer arithmetic, uninitialized reads, or missing bounds checks were introduced.

@GordonYang1 GordonYang1 requested a review from Ziminli June 10, 2026 07:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant