Skip to content

[FLINK-36317][runtime] Populate ArchivedExecutionGraph with CheckpointStatsSnapshot in WaitingForResources state#28519

Open
kaustubhbutte17 wants to merge 2 commits into
apache:masterfrom
kaustubhbutte17:FLINK-36317-checkpoint-stats-waiting-for-resources
Open

[FLINK-36317][runtime] Populate ArchivedExecutionGraph with CheckpointStatsSnapshot in WaitingForResources state#28519
kaustubhbutte17 wants to merge 2 commits into
apache:masterfrom
kaustubhbutte17:FLINK-36317-checkpoint-stats-waiting-for-resources

Conversation

@kaustubhbutte17

Copy link
Copy Markdown

What is the purpose of the change

When a Flink job fails and restarts, it transitions through Restarting → WaitingForResources. During this state, the previousExecutionGraph (which contains checkpoint statistics from the prior execution) is available, but the REST API/Web UI cannot access these stats because StateWithoutExecutionGraph.getJob() creates a sparse ArchivedExecutionGraph with empty checkpoint data.

This PR fixes that by:

  1. Adding withCheckpointStatsSnapshot() to ArchivedExecutionGraph — creates a copy with different checkpoint stats
  2. Overriding getJob() in WaitingForResources — attaches checkpoint stats from the previousExecutionGraph when available

Brief change log

  • Added ArchivedExecutionGraph.withCheckpointStatsSnapshot() method
  • Overrode WaitingForResources.getJob() to preserve checkpoint stats from the previous execution graph
  • Added 2 unit tests in WaitingForResourcesTest

Verifying this change

This change is verified by new unit tests:

  • testGetJobIncludesCheckpointStatsFromPreviousExecutionGraph — verifies checkpoint stats from a mock previous execution graph are preserved
  • testGetJobWithoutPreviousExecutionGraphReturnsNullCheckpointStats — verifies the default behavior when no previous execution graph exists

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: no
  • The (broadcasting) coordination functionality: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

…pshot in WaitingForResources state

When a job restarts and enters WaitingForResources with a previousExecutionGraph,
the checkpoint statistics from the previous execution are now preserved in the
ArchivedExecutionGraph returned by getJob(). Previously, these stats were lost
because StateWithoutExecutionGraph.getJob() creates a sparse archived graph
with empty checkpoint stats.

This change:
- Adds withCheckpointStatsSnapshot() to ArchivedExecutionGraph for creating
  a copy with different checkpoint stats
- Overrides getJob() in WaitingForResources to attach checkpoint stats from
  the previousExecutionGraph when available
- Adds tests verifying checkpoint stats preservation

Signed-off-by: Kaustubh Butte <kaustubhbutte17@gmail.com>
@flinkbot

flinkbot commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@spuru9

spuru9 commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Hi @kaustubhbutte17
Appears to be a linting issue mvn spotless:apply
Also can you add the component name in the PR title as specified in the guidelines/AGENTS.md.

…Test

Signed-off-by: Kaustubh Butte <kaustubhbutte17@gmail.com>
@kaustubhbutte17 kaustubhbutte17 changed the title [FLINK-36317] Populate ArchivedExecutionGraph with CheckpointStatsSnapshot in WaitingForResources state [FLINK-36317][runtime] Populate ArchivedExecutionGraph with CheckpointStatsSnapshot in WaitingForResources state Jun 23, 2026
@kaustubhbutte17

Copy link
Copy Markdown
Author

Thanks @spuru9 for the review! Fixed both issues:

  1. Ran mvn spotless:apply to fix formatting
  2. Updated PR title to include [runtime] component

Pushed the fix in the latest commit.

@kaustubhbutte17

Copy link
Copy Markdown
Author

@flinkbot run azure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants