fix: ProcessingS3Output's s3_uri to be an optional field (5559)#5755
fix: ProcessingS3Output's s3_uri to be an optional field (5559)#5755sagemaker-bot wants to merge 2 commits intoaws:masterfrom
ProcessingS3Output's s3_uri to be an optional field (5559)#5755Conversation
sagemaker-bot
left a comment
There was a problem hiding this comment.
🤖 AI Code Review
This PR fixes ProcessingS3Output's s3_uri to be optional, allowing SageMaker to auto-generate S3 paths. The logic changes look reasonable and tests are comprehensive. However, there are several issues: the PR description mentions changing the shapes definition but no change to shapes is included in the diff, the import line exceeds 100 characters, hardcoded defaults should be constants, and there's a potential issue with the default local_path when creating s3_output from scratch.
| from sagemaker.core.local.local_session import LocalSession | ||
| from sagemaker.core.helper.session_helper import Session | ||
| from sagemaker.core.shapes import ProcessingInput, ProcessingOutput, ProcessingS3Input | ||
| from sagemaker.core.shapes import ProcessingInput, ProcessingOutput, ProcessingS3Input, ProcessingS3Output |
There was a problem hiding this comment.
This import line exceeds the 100-character line length limit. Please break it into multiple lines:
from sagemaker.core.shapes import (
ProcessingInput,
ProcessingOutput,
ProcessingS3Input,
ProcessingS3Output,
)| # Generate a name for the ProcessingOutput if it doesn't have one. | ||
| if output.output_name is None: | ||
| output.output_name = "output-{}".format(count) | ||
| # If s3_output is None, create a default one with None s3_uri |
There was a problem hiding this comment.
Missing shapes definition change: The PR description states that s3_uri should be made optional (defaulting to None) on ProcessingS3Output in the shapes definition, but no change to the shapes file is included in this diff. Without that change, ProcessingS3Output(s3_uri=None, ...) or ProcessingS3Output(local_path=..., s3_upload_mode=...) (without s3_uri) will likely fail Pydantic validation if s3_uri is currently a required field. Please include the shapes definition change or confirm that s3_uri is already optional in the shapes.
There was a problem hiding this comment.
$context sagemaker-core/src/sagemaker/core/shapes/shapes.py
🤖 Iteration #1 — Review Comments AddressedDescriptionMake Changes
Behavior
Testing
Comments reviewed: 9
|
sagemaker-bot
left a comment
There was a problem hiding this comment.
🤖 AI Code Review
This PR makes ProcessingS3Output.s3_uri optional to allow SageMaker to auto-generate S3 paths, matching V2 behavior. The approach is sound and well-tested, but there are a few issues: the shapes file is auto-generated and shouldn't be manually edited, the Processor class doesn't use Pydantic BaseModel, and there's a line length violation.
| s3_uri: Optional[StrPipeVar] = None | ||
| s3_upload_mode: StrPipeVar | ||
| local_path: Optional[StrPipeVar] = Unassigned() | ||
|
|
There was a problem hiding this comment.
Critical concern: The shapes.py file in sagemaker-core is auto-generated from API definitions (as noted in the SDK architecture: "auto-generated from API definitions"). Manual edits to this file will be overwritten the next time the code generator runs.
The correct fix should be made in the shapes definition/generation configuration (e.g., a JSON/YAML model or codegen template) so that s3_uri is generated as Optional. If this is truly a bug in the API model where the SageMaker API actually accepts S3Uri as optional for ProcessingS3Output, the codegen input should be updated.
Please confirm whether this file is safe to edit manually or if the change needs to go into the code generation source.
| s3_uri: Optional[StrPipeVar] = None | ||
| s3_upload_mode: StrPipeVar | ||
| local_path: Optional[StrPipeVar] = Unassigned() | ||
|
|
There was a problem hiding this comment.
The field ordering changed here: s3_uri (now optional with default) is listed before s3_upload_mode (required). In Pydantic, required fields must come before optional fields with defaults in the class body, or you'll get a validation error. Since s3_upload_mode is StrPipeVar (required, no default) and s3_uri now has = None, this ordering will cause a Pydantic error.
You need to reorder so that s3_upload_mode comes before s3_uri:
s3_upload_mode: StrPipeVar
s3_uri: Optional[StrPipeVar] = None
local_path: Optional[StrPipeVar] = Unassigned()Alternatively, if using from __future__ import annotations or Pydantic v2 with appropriate config, this may not be an issue — but please verify.
| if getattr(self.sagemaker_session, "local_mode", False) and parse_result.scheme == "file": | ||
| # If the output's s3_uri is None or not an s3_uri, create one. | ||
| if output.s3_output.s3_uri is None: | ||
| parse_result_scheme = "" |
There was a problem hiding this comment.
Line length likely exceeds 100 characters:
if getattr(self.sagemaker_session, "local_mode", False) and parse_result_scheme == "file":This should be wrapped to stay within the 100-character limit.
| if output.output_name is None: | ||
| output.output_name = "output-{}".format(count) | ||
| # If s3_output is None, create a default one with None s3_uri. | ||
| # The s3_uri will be auto-generated below based on job/pipeline context. |
There was a problem hiding this comment.
The default local_path and s3_upload_mode values are hardcoded here. Consider whether these should match whatever defaults the user might expect from the ProcessingS3Output model itself. If ProcessingS3Output already has defaults for local_path and s3_upload_mode, you could simplify:
output.s3_output = ProcessingS3Output(
s3_upload_mode=DEFAULT_S3_UPLOAD_MODE,
local_path=DEFAULT_PROCESSING_LOCAL_OUTPUT_PATH,
)Also, the local_path on the auto-created ProcessingS3Output may conflict with the user's actual ProcessingOutput.app_managed or other output configurations. Is /opt/ml/processing/output always the correct default?
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
| # Default values used when creating a ProcessingS3Output for outputs |
There was a problem hiding this comment.
Nit: DEFAULT_S3_UPLOAD_MODE should use the enum or constant from the shapes/API model if one exists, rather than a raw string "EndOfJob". This avoids drift if the API model changes the accepted values.
| assert "test-job" in result[0].s3_output.s3_uri | ||
| assert "my-output" in result[0].s3_output.s3_uri | ||
|
|
||
| def test_normalize_outputs_with_none_s3_uri_and_pipeline_config_generates_join(self, mock_session): |
There was a problem hiding this comment.
Line exceeds 100 characters. Please wrap this line:
def test_normalize_outputs_with_none_s3_uri_and_pipeline_config_generates_join(
self, mock_session
):| image_uri="test-image:latest", | ||
| instance_count=1, | ||
| instance_type="ml.m5.xlarge", | ||
| sagemaker_session=mock_session, |
There was a problem hiding this comment.
Good test coverage! However, the mock_session fixture is used but I don't see it defined in this diff. Please confirm it's defined elsewhere in the test file (e.g., as a conftest.py fixture or earlier in the file). If it's the fixture from the existing test class, note that these tests are in a new class and pytest fixtures defined inside another class won't be inherited.
| def test_normalize_outputs_with_none_s3_output_generates_s3_path(self, mock_session): | ||
| """When s3_output is None, _normalize_outputs should create s3_output and auto-generate URI.""" | ||
| processor = Processor( | ||
| role="arn:aws:iam::123456789012:role/SageMakerRole", |
There was a problem hiding this comment.
The import of Join is inside the test function body. While this works, it's better practice to put imports at the top of the file. If Join is only needed for this assertion, consider using hasattr or checking the type name string instead, or move the import to the top of the file with the other imports.
| # After _normalize_outputs, s3_uri should always be populated. | ||
| # If it is still None at serialization time, omit S3Uri from the dict | ||
| # rather than sending None to the API. This is a defensive guard; | ||
| # _normalize_outputs is expected to fill in s3_uri before we reach here. |
There was a problem hiding this comment.
The defensive guard to omit S3Uri when it's None is good, but the comment says "_normalize_outputs is expected to fill in s3_uri before we reach here." If that's the case, consider adding a logger.warning() when s3_uri is None at this point, since it indicates an unexpected code path. Silent omission could make debugging harder:
if processing_output.s3_output.s3_uri is not None:
s3_output_dict["S3Uri"] = processing_output.s3_output.s3_uri
else:
logger.warning(
"s3_uri is None for output '%s' at serialization time; "
"expected _normalize_outputs to have populated it.",
processing_output.output_name,
)
Description
In V3 (sagemaker-core),
ProcessingS3Output'ss3_urifield is mandatory, preventing users from leaving the output destination asNoneto let SageMaker auto-generate an S3 path (partitioned by job_name/step_name/output_name). In V2,ProcessingOutput.destinationcould beNoneand the SDK would auto-generate an S3 URI during normalization. The fix requires: (1) Makings3_urioptional (defaulting toNone) onProcessingS3Outputin the shapes definition, (2) Updating_normalize_outputs()to handleNones3_uri by auto-generating a path, (3) AllowingProcessingOutputto be created withouts3_outputentirely and having the normalizer create it, and (4) Updating_processing_output_to_request_dictto handle optional s3_uri.Related Issue
Related issue: 5559
Changes Made
sagemaker-core/src/sagemaker/core/processing.pysagemaker-core/tests/unit/test_processing.pyAI-Generated PR
This PR was automatically generated by the PySDK Issue Agent.
Merge Checklist
prefix: descriptionformat