Skip to content

[AIMIGRAPHX-1054] Log debug symbols when exceptions are thrown#4978

Open
eddieliao wants to merge 8 commits into
developfrom
debug_symbols_exceptions
Open

[AIMIGRAPHX-1054] Log debug symbols when exceptions are thrown#4978
eddieliao wants to merge 8 commits into
developfrom
debug_symbols_exceptions

Conversation

@eddieliao

@eddieliao eddieliao commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Motivation

This PR logs debug symbols on the debug severity when an exception is thrown.

Technical Details

Using a scope guard that tracks the number of uncaught exceptions, if the number exceeds the starting amount when the scope of the code section (usually for compute or insert), the associated debug symbols will be logged. The case in onnx_parser will build the debug symbol manually as there is no instruction to refer back to yet.

Changelog Category

Add a CHANGELOG.md entry for any option other than Not Applicable

    • Added: New functionality.
    • Changed: Changes to existing functionality.
    • Removed: Functionality or support that has been removed. (Compared to a previous release)
    • Optimized: Component performance that has been optimized or improved.
    • Resolved Issues: Known issues from a previous version that have been resolved.
    • Not Applicable: This PR is not to be included in the changelog.

@eddieliao eddieliao requested a review from Copilot June 16, 2026 23:17
@eddieliao eddieliao self-assigned this Jun 16, 2026
@eddieliao eddieliao added enhancement New feature or request Changelog: Added New functionality. labels Jun 16, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds exception-path debug logging to help correlate thrown exceptions with MIGraphX debug symbols during evaluation, shape recomputation, instruction replacement, and ONNX parsing.

Changes:

  • Introduces a small on_scope_fail / scope_fail_guard utility (exception-only scope guard).
  • Adds log_debug_symbols_on_exception(const instruction&) noexcept and wires it into core execution/shape paths.
  • Extends ONNX parsing to emit a debug log containing the node’s debug symbol when parsing throws.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
src/program.cpp Adds a scope-fail guard in generic_eval to log instruction debug symbols on thrown exceptions.
src/onnx/onnx_parser.cpp Adds scope-fail logging for node parsing failures (manual debug symbol construction).
src/module.cpp Adds scope-fail logging around multiple shape/replace codepaths.
src/instruction.cpp Adds scope-fail logging during shape recomputation and introduces the log_debug_symbols_on_exception helper.
src/include/migraphx/scope_guard.hpp New exception-only scope guard implementation.
src/include/migraphx/instruction.hpp Exposes the new logging helper in the public instruction API.

Comment thread src/include/migraphx/scope_guard.hpp
Comment thread src/program.cpp Outdated
Comment thread src/onnx/onnx_parser.cpp Outdated
Comment thread src/module.cpp
Comment thread src/module.cpp
Comment thread src/module.cpp
Comment thread src/module.cpp
Comment thread src/instruction.cpp
Comment thread src/instruction.cpp
Comment thread src/instruction.cpp
eddieliao and others added 2 commits June 16, 2026 16:23
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
@eddieliao eddieliao marked this pull request as ready for review June 16, 2026 23:37
@eddieliao eddieliao requested a review from causten as a code owner June 16, 2026 23:37
@eddieliao eddieliao requested review from CharlieL7 and pfultz2 June 16, 2026 23:37
@eddieliao eddieliao requested a review from a team as a code owner June 16, 2026 23:39
@codecov

codecov Bot commented Jun 17, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 75.67568% with 9 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/instruction.cpp 66.67% 4 Missing ⚠️
src/onnx/onnx_parser.cpp 69.23% 4 Missing ⚠️
src/module.cpp 80.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #4978      +/-   ##
===========================================
- Coverage    92.73%   92.71%   -0.02%     
===========================================
  Files          592      593       +1     
  Lines        31289    31323      +34     
===========================================
+ Hits         29015    29040      +25     
- Misses        2274     2283       +9     
Files with missing lines Coverage Δ
src/include/migraphx/instruction.hpp 100.00% <ø> (ø)
src/include/migraphx/scope_guard.hpp 100.00% <100.00%> (ø)
src/program.cpp 81.67% <100.00%> (+0.02%) ⬆️
src/module.cpp 89.39% <80.00%> (-0.05%) ⬇️
src/instruction.cpp 90.03% <66.67%> (-0.74%) ⬇️
src/onnx/onnx_parser.cpp 88.28% <69.23%> (-0.72%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@gh-app-migraphx-bot-pr-write

Copy link
Copy Markdown
Test Batch New Rate (647468) Old Rate (0043a5)* Diff Status
torchvision-resnet50 64 3,163.73 3,153.59 0.32%
torchvision-resnet50_fp16 64 6,674.82 6,635.99 0.59%
torchvision-densenet121 32 2,709.61 2,694.92 0.55%
torchvision-densenet121_fp16 32 4,545.67 4,513.21 0.72%
torchvision-inceptionv3 32 1,801.99 1,796.76 0.29%
torchvision-inceptionv3_fp16 32 2,820.15 2,812.91 0.26%
cadene-inceptionv4 16 825.96 824.94 0.12%
cadene-resnext64x4 16 784.86 784.29 0.07%
slim-mobilenet 64 8,438.35 8,391.26 0.56%
slim-nasnetalarge 64 229.08 228.80 0.12%
slim-resnet50v2 64 3,330.68 3,314.14 0.50%
bert-mrpc-onnx 8 1,171.02 1,171.25 -0.02%
bert-mrpc-tf 1 484.08 496.75 -2.55%
pytorch-examples-wlang-gru 1 339.16 327.37 3.60%
pytorch-examples-wlang-lstm 1 446.20 505.79 -11.78% 🔴
torchvision-resnet50_1 1 750.54 763.35 -1.68%
cadene-dpn92_1 1 449.44 453.50 -0.89%
cadene-resnext101_1 1 363.70 363.65 0.01%
onnx-taau-downsample 1 400.80 399.46 0.34%
dlrm-criteoterabyte 1 32.68 32.42 0.80%
dlrm-criteoterabyte_fp16 1 52.61 51.82 1.52%
agentmodel 1 9,403.45 12,562.69 -25.15% 🔴
unet_fp16 2 57.24 56.86 0.68%
resnet50v1_fp16 1 944.30 1,007.81 -6.30% 🔴
resnet50v1_int8 1 929.58 927.24 0.25%
bert_base_cased_fp16 64 1,103.15 1,098.00 0.47%
bert_large_uncased_fp16 32 347.46 346.68 0.23%
bert_large_fp16 1 204.00 203.83 0.08%
distilgpt2_fp16 16 2,098.61 2,096.89 0.08%
yolov5s 1 567.25 562.13 0.91%
tinyllama 1 46.02 45.95 0.15%
vicuna-fastchat 1 43.99 44.14 -0.33%
whisper-tiny-encoder 1 418.79 417.62 0.28%
whisper-tiny-decoder 1 415.56 412.86 0.65%
llama2_7b 1 20.47 20.35 0.60%
qwen1.5-7b 1 23.69 23.60 0.37%
phi3-3.8b 1 26.69 26.88 -0.73%
llama3-8b 1 21.83 21.81 0.13%
whisper-large-encoder 1 10.32 10.28 0.31%
whisper-large-decoder 1 102.24 106.05 -3.58%
mistral-7b 1 23.84 23.79 0.23%
FLUX.1-schnell 1 750.00 753.45 -0.46%

Regressions detected 🔴

* No develop baseline was found for this PR's branch point; compared against the latest available develop run instead.

@gh-app-migraphx-bot-pr-write

Copy link
Copy Markdown
Test Status Result
bert-mrpc-onnx PASSED: MIGraphX meets tolerance
bert-mrpc-tf ERROR - check error output
traceback
Traceback (most recent call last):
File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 377, in
main()
File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 313, in main
import tensorflow as tf
File "/usr/local/lib/python3.10/dist-packages/tensorflow/init.py", line 38, in
from tensorflow.python.tools import module_util as _module_util
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/init.py", line 36, in
from tensorflow.python import pywrap_tensorflow as _pywrap_tensorflow
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/pywrap_tensorflow.py", line 26, in
self_check.preload_check()
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/platform/self_check.py", line 63, in preload_check
from tensorflow.python.platform import _pywrap_cpu_feature_guard
ImportError: libamdhip64.so.6: cannot open shared object file: No such file or directory
pytorch-examples-wlang-gru PASSED: MIGraphX meets tolerance
pytorch-examples-wlang-lstm PASSED: MIGraphX meets tolerance
dlrm-criteoterabyte PASSED: MIGraphX meets tolerance
agentmodel PASSED: MIGraphX meets tolerance
unet PASSED: MIGraphX meets tolerance
resnet50v1 PASSED: MIGraphX meets tolerance
bert_base_cased_fp16 PASSED: MIGraphX meets tolerance
bert_large_uncased_fp16 🔴 FAILED: MIGraphX is not within tolerance - check verbose output
bert_large PASSED: MIGraphX meets tolerance
yolov5s PASSED: MIGraphX meets tolerance
tinyllama PASSED: MIGraphX meets tolerance
vicuna-fastchat PASSED: MIGraphX meets tolerance
whisper-tiny-encoder PASSED: MIGraphX meets tolerance
whisper-tiny-decoder PASSED: MIGraphX meets tolerance
distilgpt2_fp16 PASSED: MIGraphX meets tolerance
llama2_7b PASSED: MIGraphX meets tolerance
qwen1.5-7b PASSED: MIGraphX meets tolerance
phi3-3.8b PASSED: MIGraphX meets tolerance
llama3-8b PASSED: MIGraphX meets tolerance
whisper-large-encoder ERROR - check error output
traceback
2026-06-17 00:58:24.501977 [WARN] [/data/src/onnx/onnx_parser.cpp:283] Model has unbound symbolic dimension(s): batch_size, encoder_sequence_length, feature_size. These default to 1 and may cause unexpected behavior. Try setting --dim-param @<name> <value> or --input-dim @<input> <dims> if program compilation fails.
Traceback (most recent call last):
File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 377, in
main()
File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 224, in main
model = migraphx.parse_onnx(model_name, default_dim_value=batch)
RuntimeError: /data/src/include/migraphx/op/convolution.hpp:113: normalize_compute_shape: CONVOLUTION: mismatched channel numbers: input channels (1) != weights channels (80) * group (1)
whisper-large-decoder PASSED: MIGraphX meets tolerance
mistral-7b ERROR - check error output
traceback
2026-06-17 00:59:41.472898 [WARN] [/data/src/onnx/onnx_parser.cpp:283] Model has unbound symbolic dimension(s): batch_size, sequence_length. These default to 1 and may cause unexpected behavior. Try setting --dim-param @<name> <value> or --input-dim @<input> <dims> if program compilation fails.
FLUX.1-schnell PASSED: MIGraphX meets tolerance

Comment thread src/onnx/onnx_parser.cpp
@CharlieL7

Copy link
Copy Markdown
Collaborator

What is the output when an exception is thrown with debug symbols?

@eddieliao

Copy link
Copy Markdown
Contributor Author

What is the output when an exception is thrown with debug symbols?

2026-06-18 16:45:09.437400 [INFO] [/code/AMDMIGraphX/src/driver/main.cpp:1215] Running [ MIGraphX Version: 2.16.0.20250912-17-529-g647468bb3 ]: ./build/bin/driver run tmp/ds-example/bad_add.onnx --debug-symbols --log-level debug
2026-06-18 16:45:09.437468 [INFO] [/code/AMDMIGraphX/src/driver/main.cpp:482] Reading: tmp/ds-example/bad_add.onnx
2026-06-18 16:45:09.601557 [DEBUG] [/code/AMDMIGraphX/src/onnx/onnx_parser.cpp:588] Exception thrown while parsing node 'Add' with debug symbols: MyBadAdd
2026-06-18 16:45:09.601708 [ERROR] [/code/AMDMIGraphX/src/onnx/onnx.cpp:87] module: "main"
2026-06-18 16:45:09.601717 [ERROR] [/code/AMDMIGraphX/src/onnx/onnx.cpp:87] b = @param:b -> float_type, {4, 5}, {5, 1}
2026-06-18 16:45:09.601725 [ERROR] [/code/AMDMIGraphX/src/onnx/onnx.cpp:87] a = @param:a -> float_type, {2, 3}, {3, 1}
2026-06-18 16:45:09.601734 [ERROR] [/code/AMDMIGraphX/src/onnx/onnx.cpp:87] 
2026-06-18 16:45:09.601749 [ERROR] [/code/AMDMIGraphX/src/onnx/onnx.cpp:87] 
terminate called after throwing an instance of 'migraphx::version_2_16_0::exception'
  what():  /code/AMDMIGraphX/src/common.cpp:48: operator(): COMPUTE_BROADCASTLEN: shape {2, 3} and {4, 5} mismatch!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Changelog: Added New functionality. enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants