Skip to content

Vectorize Resize#4967

Open
pfultz2 wants to merge 10 commits into
developfrom
resize-vectorize
Open

Vectorize Resize#4967
pfultz2 wants to merge 10 commits into
developfrom
resize-vectorize

Conversation

@pfultz2

@pfultz2 pfultz2 commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Motivation

Enable vectorization for resize which help improve the performance of the kernel.

Technical Details

Since the input and output are different sizes the vectorize arg transformer cannot be applied across all tensors. Instead, we vectorize the output and input differently. Ouput vectorization can be easily applied for most cases but input vectorization can only be applied when its not resizing the fastest axis. So it mainly helps for cases like NHWC.

Changelog Category

Add a CHANGELOG.md entry for any option other than Not Applicable

    • Added: New functionality.
    • Changed: Changes to existing functionality.
    • Removed: Functionality or support that has been removed. (Compared to a previous release)
    • Optimized: Component performance that has been optimized or improved.
    • Resolved Issues: Known issues from a previous version that have been resolved.
    • Not Applicable: This PR is not to be included in the changelog.

@pfultz2 pfultz2 changed the title Resize GPU improvements Vectorize Resize Jun 15, 2026
@gh-app-migraphx-bot-pr-write

gh-app-migraphx-bot-pr-write Bot commented Jun 15, 2026

Copy link
Copy Markdown
Test Batch New Rate (400886) Old Rate (b69836) Diff Status
torchvision-resnet50 64 nan 3,154.40 nan
torchvision-resnet50_fp16 64 nan 6,635.18 nan
torchvision-densenet121 32 nan 2,694.69 nan
torchvision-densenet121_fp16 32 nan 4,526.13 nan
torchvision-inceptionv3 32 nan 1,797.12 nan
torchvision-inceptionv3_fp16 32 nan 2,819.16 nan
cadene-inceptionv4 16 nan 824.35 nan
cadene-resnext64x4 16 nan 783.08 nan
slim-mobilenet 64 nan 8,386.62 nan
slim-nasnetalarge 64 nan 228.34 nan
slim-resnet50v2 64 nan 3,313.18 nan
bert-mrpc-onnx 8 nan 1,172.65 nan
bert-mrpc-tf 1 nan 493.16 nan
pytorch-examples-wlang-gru 1 nan 327.56 nan
pytorch-examples-wlang-lstm 1 nan 465.24 nan
torchvision-resnet50_1 1 nan 768.99 nan
cadene-dpn92_1 1 nan 453.05 nan
cadene-resnext101_1 1 nan 363.80 nan
onnx-taau-downsample 1 nan 399.85 nan
dlrm-criteoterabyte 1 nan 32.43 nan
dlrm-criteoterabyte_fp16 1 nan 51.82 nan
agentmodel 1 nan 10,024.97 nan
unet_fp16 2 nan 56.82 nan
resnet50v1_fp16 1 nan 953.97 nan
resnet50v1_int8 1 nan 932.02 nan
bert_base_cased_fp16 64 nan 1,097.66 nan
bert_large_uncased_fp16 32 nan 346.32 nan
bert_large_fp16 1 nan 203.57 nan
distilgpt2_fp16 16 nan 2,085.63 nan
yolov5s 1 nan 564.77 nan
tinyllama 1 nan 45.96 nan
vicuna-fastchat 1 nan 44.01 nan
whisper-tiny-encoder 1 nan 417.38 nan
whisper-tiny-decoder 1 nan 413.20 nan
llama2_7b 1 nan 19.07 nan
qwen1.5-7b 1 nan 22.86 nan
phi3-3.8b 1 nan 25.80 nan
llama3-8b 1 nan 18.06 nan
whisper-large-encoder 1 nan 7.78 nan
whisper-large-decoder 1 nan 7.09 nan
mistral-7b 1 nan 23.37 nan
FLUX.1-schnell 1 nan 305.98 nan

Regressions detected 🔴

@gh-app-migraphx-bot-pr-write

gh-app-migraphx-bot-pr-write Bot commented Jun 15, 2026

Copy link
Copy Markdown
Test Status Result
bert-mrpc-onnx PASSED: MIGraphX meets tolerance
bert-mrpc-tf ERROR - check error output
traceback
Traceback (most recent call last):
File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 377, in
main()
File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 313, in main
import tensorflow as tf
File "/usr/local/lib/python3.10/dist-packages/tensorflow/init.py", line 38, in
from tensorflow.python.tools import module_util as _module_util
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/init.py", line 36, in
from tensorflow.python import pywrap_tensorflow as _pywrap_tensorflow
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/pywrap_tensorflow.py", line 26, in
self_check.preload_check()
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/platform/self_check.py", line 63, in preload_check
from tensorflow.python.platform import _pywrap_cpu_feature_guard
ImportError: libamdhip64.so.6: cannot open shared object file: No such file or directory
pytorch-examples-wlang-gru PASSED: MIGraphX meets tolerance
pytorch-examples-wlang-lstm PASSED: MIGraphX meets tolerance
dlrm-criteoterabyte PASSED: MIGraphX meets tolerance
agentmodel PASSED: MIGraphX meets tolerance
unet PASSED: MIGraphX meets tolerance
resnet50v1 PASSED: MIGraphX meets tolerance
bert_base_cased_fp16 PASSED: MIGraphX meets tolerance
bert_large_uncased_fp16 🔴 FAILED: MIGraphX is not within tolerance - check verbose output
bert_large PASSED: MIGraphX meets tolerance
yolov5s PASSED: MIGraphX meets tolerance
tinyllama PASSED: MIGraphX meets tolerance
vicuna-fastchat PASSED: MIGraphX meets tolerance
whisper-tiny-encoder PASSED: MIGraphX meets tolerance
whisper-tiny-decoder PASSED: MIGraphX meets tolerance
distilgpt2_fp16 🔴 FAILED: MIGraphX is not within tolerance - check verbose output
llama2_7b PASSED: MIGraphX meets tolerance
qwen1.5-7b PASSED: MIGraphX meets tolerance
phi3-3.8b PASSED: MIGraphX meets tolerance
llama3-8b PASSED: MIGraphX meets tolerance
whisper-large-encoder ERROR - check error output
traceback
2026-06-17 12:07:52.141276 [WARN] [/data/src/onnx/onnx_parser.cpp:282] Model has unbound symbolic dimension(s): batch_size, encoder_sequence_length, feature_size. These default to 1 and may cause unexpected behavior. Try setting --dim-param @<name> <value> or --input-dim @<input> <dims> if program compilation fails.
Traceback (most recent call last):
File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 377, in
main()
File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 224, in main
model = migraphx.parse_onnx(model_name, default_dim_value=batch)
RuntimeError: /data/src/include/migraphx/op/convolution.hpp:113: normalize_compute_shape: CONVOLUTION: mismatched channel numbers: input channels (1) != weights channels (80) * group (1)
whisper-large-decoder PASSED: MIGraphX meets tolerance
mistral-7b PASSED: MIGraphX meets tolerance
FLUX.1-schnell PASSED: MIGraphX meets tolerance

@codecov

codecov Bot commented Jun 16, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

Additional details and impacted files
@@           Coverage Diff            @@
##           develop    #4967   +/-   ##
========================================
  Coverage    92.73%   92.73%           
========================================
  Files          594      594           
  Lines        31340    31340           
========================================
  Hits         29063    29063           
  Misses        2277     2277           
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@pfultz2 pfultz2 marked this pull request as ready for review June 17, 2026 14:12
@pfultz2 pfultz2 requested a review from causten as a code owner June 17, 2026 14:12

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the GPU resize JIT path to enable vectorized stores on the output (and conditionally vectorized loads on the input when the fastest axis is a true pass-through), aiming to improve kernel performance for common layouts like NHWC.

Changes:

  • Refactors the device-side resize kernels to run through a shared resize_apply wrapper that supports mixed (scalar vs vectorized) input/output handling.
  • Updates the GPU JIT resize kernel template to apply vectorize<N, Axis>() transformers to out (and to input when safe), and adjusts launch sizing to operate on vectorized output elements.
  • Adds host-side logic to decide when input vectorization is safe based on stride/scale/coordinate transform constraints.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/targets/gpu/kernels/include/migraphx/kernels/resize.hpp Adds vectorization support via a new resize_apply wrapper and updates nearest/linear/cubic implementations to use it.
src/targets/gpu/jit/resize.cpp Applies vectorize transformers in the generated kernel, selects vectorization parameters, and updates launch sizing and input-vectorization eligibility logic.

Comment on lines +209 to +214
template <index_int Axis, class Input, class Output, class Outv, class Compute>
__device__ void resize_apply(Input input, Output out, Outv outv, Compute compute)
{
auto idx = make_index();
auto in_shape = input.get_shape();
auto out_shape = output.get_shape();

idx.global_stride(out_shape.elements(), [&](auto out_idx) {
auto in_idx = compute_nearest_idx<CoordOp, NearestOp>(in_shape, out_shape, out_idx, scales);
output[out_idx] = input[in_idx];
});
auto idx = make_index();
constexpr index_int ivn = tensor_vec_size<Input>(); // >= 2 only for a pass-through fast axis
constexpr index_int ovn = tensor_vec_size<Outv>();
Comment thread src/targets/gpu/jit/resize.cpp Outdated
pfultz2 and others added 3 commits June 17, 2026 09:49
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants