[4970] Generate tuning inputs on GPU via splitmix64 device RNG#4971
Open
itikhono wants to merge 2 commits into
Open
[4970] Generate tuning inputs on GPU via splitmix64 device RNG#4971itikhono wants to merge 2 commits into
itikhono wants to merge 2 commits into
Conversation
Contributor
|
Thank you for your contribution! Since this is an external pull request, a maintainer must review PR and add the "ok-to-test" label if it is approved for testing. |
Contributor
There was a problem hiding this comment.
Pull request overview
This PR improves GPU compile/tuning throughput by generating candidate input buffers directly on the GPU (splitmix64 counter-based RNG) instead of generating on the host and copying H2D per candidate.
Changes:
- Added a GPU-side random-fill kernel (
device::generate_random) and a host wrapper (gpu_generate_random) that recurses into tuple sub-objects. - Updated
time_programtuning path to allocate parameter buffers on GPU and fill them viagpu_generate_random(keepingfill_mapon the host-fill path). - Added a GPU unit test covering determinism, supported types, empty shapes, tuples, and non-computable types.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| test/gpu/generate_random.cpp | New GPU test coverage for deterministic RNG fill, tuples, and non-computable types. |
| src/targets/gpu/time_op.cpp | Switches tuning input creation to GPU allocation + GPU RNG fill (except fill_map). |
| src/targets/gpu/include/migraphx/gpu/hip.hpp | Exposes gpu_generate_random API. |
| src/targets/gpu/include/migraphx/gpu/device/generate_random.hpp | Declares new device-side RNG entrypoint. |
| src/targets/gpu/hip.cpp | Implements gpu_generate_random wrapper with tuple recursion. |
| src/targets/gpu/device/generate_random.cpp | Implements splitmix64-based device kernel to fill buffers. |
pfultz2
requested changes
Jun 16, 2026
|
|
||
| MIGRAPHX_GPU_EXPORT void gpu_fill(context& ctx, const argument& dst, int value = 0); | ||
|
|
||
| MIGRAPHX_GPU_EXPORT void gpu_generate_random(context& ctx, const argument& dst, unsigned long seed); |
Collaborator
There was a problem hiding this comment.
This should take a shape instead of an argument and return an argument:
argument gpu_generate_random(context& ctx, const shape& s, unsigned long seed)
Contributor
Author
There was a problem hiding this comment.
I originally aligned it with gpu_fill function, but I agree the new signature is better. Done.
pfultz2
approved these changes
Jun 17, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR covers 1st part of the issue #4970.

Eliminates "input-gen + H2D (CPU waste)" part , GPU part (caused by bundle increase 1->10) remains
device::generate_randomuses a counter-based splitmix64 RNG (seed + i * golden_ratio_step→splitmix64), so output is deterministic per seed and reproducible across candidates for fair comparison.time_programnow allocates inputs withallocate_gpuand fills them viagpu_generate_random(recurses tuple sub-objects), whilefill_mapinputs keep the host-fill path.Behavior parity with the old host path
visit_all→normalize<bool>→0/1, identical to the old special-case.visit_allwould throw, so generation falls back to a raw byte fill — matching the olduint8host behavior.generate_argument.Performance
Test plan
test_gpu_generate_random: seed determinism + range, half type, empty shape no-op, tuple fills every sub-buffer, non-computable (fp4x2) raw-byte fill — 5/5 pass.Perf testing for YOLO-family models (MI350):
Used migraphx-driver perf, no actual diff detected, the results are quite noisy
different models, batch 4