Skip to content

[4970] Generate tuning inputs on GPU via splitmix64 device RNG#4971

Open
itikhono wants to merge 2 commits into
ROCm:developfrom
itikhono:gpu-device-bench-inputs
Open

[4970] Generate tuning inputs on GPU via splitmix64 device RNG#4971
itikhono wants to merge 2 commits into
ROCm:developfrom
itikhono:gpu-device-bench-inputs

Conversation

@itikhono

@itikhono itikhono commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

This PR covers 1st part of the issue #4970.
Eliminates "input-gen + H2D (CPU waste)" part , GPU part (caused by bundle increase 1->10) remains
image

  • During op/program tuning, candidate inputs were generated on the host (xorshf96 PRNG) and copied to the device for every candidate. This replaces that with a device kernel that fills tuning inputs directly on the GPU, removing the per-candidate host PRNG + H2D copy.
  • New device::generate_random uses a counter-based splitmix64 RNG (seed + i * golden_ratio_stepsplitmix64), so output is deterministic per seed and reproducible across candidates for fair comparison.
  • time_program now allocates inputs with allocate_gpu and fills them via gpu_generate_random (recurses tuple sub-objects), while fill_map inputs keep the host-fill path.

Behavior parity with the old host path

  • bool: handled by visit_allnormalize<bool>0/1, identical to the old special-case.
  • fp4x2 (only non-computable type): visit_all would throw, so generation falls back to a raw byte fill — matching the old uint8 host behavior.
  • tuples: same seed across sub-objects, same as the previous generate_argument.

Performance

  • No FPS regression across the YOLO model family (within noise).
  • Compile/tuning time improved up to ~6.6x at batch 64 on MI350, and ~10x at batch 32 on R9700 (measured together with reverting the bundle increase 1->10)

Test plan

  • test_gpu_generate_random: seed determinism + range, half type, empty shape no-op, tuple fills every sub-buffer, non-computable (fp4x2) raw-byte fill — 5/5 pass.
  • YOLO compile + inference sweep (fork vs develop).

Perf testing for YOLO-family models (MI350):

Used migraphx-driver perf, no actual diff detected, the results are quite noisy

Image

different models, batch 4

Model Fixed, img/s Develop (before), img/s Δ
yolov8m 1255.7 1242.5 +1.1%
yolov9m 1177.1 1111.2 +5.9%
yolov10m 1337.1 1303.4 +2.6%
yolo11m 1583.1 1564.2 +1.2%
yolo12m 1341.4 1350.1 −0.6%
yolo26m 1415.0 1407.2 +0.6%

Copilot AI review requested due to automatic review settings June 16, 2026 17:16
@itikhono itikhono requested a review from causten as a code owner June 16, 2026 17:16
@github-actions

Copy link
Copy Markdown
Contributor

Thank you for your contribution! Since this is an external pull request, a maintainer must review PR and add the "ok-to-test" label if it is approved for testing.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves GPU compile/tuning throughput by generating candidate input buffers directly on the GPU (splitmix64 counter-based RNG) instead of generating on the host and copying H2D per candidate.

Changes:

  • Added a GPU-side random-fill kernel (device::generate_random) and a host wrapper (gpu_generate_random) that recurses into tuple sub-objects.
  • Updated time_program tuning path to allocate parameter buffers on GPU and fill them via gpu_generate_random (keeping fill_map on the host-fill path).
  • Added a GPU unit test covering determinism, supported types, empty shapes, tuples, and non-computable types.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
test/gpu/generate_random.cpp New GPU test coverage for deterministic RNG fill, tuples, and non-computable types.
src/targets/gpu/time_op.cpp Switches tuning input creation to GPU allocation + GPU RNG fill (except fill_map).
src/targets/gpu/include/migraphx/gpu/hip.hpp Exposes gpu_generate_random API.
src/targets/gpu/include/migraphx/gpu/device/generate_random.hpp Declares new device-side RNG entrypoint.
src/targets/gpu/hip.cpp Implements gpu_generate_random wrapper with tuple recursion.
src/targets/gpu/device/generate_random.cpp Implements splitmix64-based device kernel to fill buffers.

Comment thread src/targets/gpu/time_op.cpp Outdated
Comment thread src/targets/gpu/device/generate_random.cpp Outdated
Comment thread src/targets/gpu/device/generate_random.cpp Outdated

MIGRAPHX_GPU_EXPORT void gpu_fill(context& ctx, const argument& dst, int value = 0);

MIGRAPHX_GPU_EXPORT void gpu_generate_random(context& ctx, const argument& dst, unsigned long seed);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should take a shape instead of an argument and return an argument:

argument gpu_generate_random(context& ctx, const shape& s, unsigned long seed)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally aligned it with gpu_fill function, but I agree the new signature is better. Done.

Comment thread test/gpu/generate_random.cpp Outdated
@itikhono itikhono requested a review from pfultz2 June 17, 2026 09:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants