Persist endpoint UUID for vector_search_endpoints drift detection#5127
Conversation
…ctor_search_endpoints
Two fixes that together make `vector_search_endpoints` plan/deploy correctly
across out-of-band changes the user can trigger from the console.
1) endpoint UUID drift detection.
The endpoint name is stable but its UUID changes if the endpoint is
deleted and recreated with the same name. Without persisting that UUID
the planner couldn't detect:
- the endpoint itself being replaced out-of-band (permissions silently
rebound to a different backing endpoint);
- any caller that depends on endpoint_uuid (e.g. permissions object_id)
racing the recreate.
VectorSearchEndpointState now embeds CreateEndpoint and adds
EndpointUuid. DoCreate records the UUID from the create response;
DoUpdate copies it from entry.RemoteState so unrelated updates (e.g.
min_qps) don't blank it out. OverrideChangeDesc classifies
endpoint_uuid drift as Recreate when saved != remote, Skip otherwise.
drift/recreated_same_name flips from a "badness snapshot" to the
recreate behavior, with a permissions block on the endpoint to verify
the cascade rebinds correctly.
2) ignore_remote_changes for endpoint.budget_policy_id.
The API returns effective_budget_policy_id on Get, which folds in
workspace-inherited policy. That value rarely matches the user-set
budget_policy_id, so every plan was seeing drift on a field the user
never touched. drift/budget_policy now asserts the field is correctly
ignored.
drift/min_qps/out.plan.direct.json regenerates to include the new
endpoint_uuid skip entry in the detailed plan.
Co-authored-by: Isaac
Will be handled in an SDK update. Keeping this PR focused on the endpoint UUID drift fix. Co-authored-by: Isaac
| "action": "skip", | ||
| "reason": "state-only field", | ||
| "old": "[UUID]", | ||
| "remote": "[UUID]" |
There was a problem hiding this comment.
I guess both old and remote point to the same uuid? Can we use add_repl.py to add replacement for exact UUID that we're replacing so that test confirms that's it's the same?
| "changes": { | ||
| "endpoint_uuid": { | ||
| "action": "skip", | ||
| "reason": "state-only field", |
There was a problem hiding this comment.
Is "state-only field" correct description? It also exists on the resource itself, e.g. you can see the value in "remote".
We should probably use the same name as etag, same idea there? (not that "custom" is a good name, but makes sense to align there).
bundle/migrate/dashboards/out.plan_after_migrate.json: "etag": {
bundle/migrate/dashboards/out.plan_after_migrate.json- "action": "skip",
bundle/migrate/dashboards/out.plan_after_migrate.json- "reason": "custom",
bundle/migrate/dashboards/out.plan_after_migrate.json- "old": "[NUMID]",
bundle/migrate/dashboards/out.plan_after_migrate.json- "remote": "[NUMID]"
bundle/migrate/dashboards/out.plan_after_migrate.json- },
| } | ||
| if savedUuid != "" && remoteUuid != "" && savedUuid != remoteUuid { | ||
| change.Action = deployplan.Recreate | ||
| change.Reason = "endpoint replaced out-of-band" |
There was a problem hiding this comment.
~/work/cli/bundle % git grep Reason deployplan/plan.go
deployplan/plan.go: Reason string `json:"reason,omitempty"`
deployplan/plan.go:// Possible values for Reason field
deployplan/plan.go: ReasonBackendDefault = "backend_default"
deployplan/plan.go: ReasonAlias = "alias"
deployplan/plan.go: ReasonRemoteAlreadySet = "remote_already_set"
deployplan/plan.go: ReasonEmpty = "empty"
deployplan/plan.go: ReasonCustom = "custom"
deployplan/plan.go: ReasonDrop = "!drop"
We should use a constant there. I'd just reuse ReasonCustom here, same as we do for dashboard.etag, we don't need to explain too much in reason field, it's more of a debugging tool.
also, we don't need to have separate reasons for recreate and skip cases, action already distinguishes those two.
Drop custom Reason strings from OverrideChangeDesc and let the framework default to ReasonCustom when the action is changed (matches dashboard.etag). Register MY_ENDPOINT_UUID via add_repl.py in the min_qps drift test so the plan output shows both old and remote rendering as the same labeled token, proving they're the same UUID rather than two arbitrary UUIDs both masked to [UUID] by the generic regex. Regenerate refschema out.fields.txt to include the new endpoint_uuid STATE classification. Co-authored-by: Isaac
## Summary - Reverts #5127 (`Persist endpoint UUID for vector_search_endpoints drift detection`) and the follow-up changelog entry from #5192. - The badness #5127 was meant to fix — bundle silently rebinding permissions to a different backing endpoint after an out-of-band recreate — was actually addressed by the testserver fix in #5186 (`testserver: 404 on permissions GET when V2 parent is gone`). With the testserver matching real V2 cloud behavior, bundle correctly observes that the new endpoint has no permissions and creates them, with no permanent drift afterwards. UUID persistence in state is no longer necessary. - Reworks the `drift/recreated_same_name` acceptance test: keeps endpoint permissions in `databricks.yml`, drops the obsolete "recreate detected" assertion, and adds a post-deploy `bundle plan` to confirm there is no permanent drift. ## Test plan - [x] `./task build` clean. - [x] `go test ./acceptance -run 'TestAccept/bundle/resources/vector_search_endpoints/drift'` — all green (terraform + direct). - [x] `go test ./bundle/direct/dresources/...` — green. - [x] `./task lint-q` — clean. - [x] Verified post-deploy plan shows `Plan: 0 to add, 0 to change, 0 to delete, 2 unchanged` after an out-of-band endpoint recreate, so permissions don't end up in permanent drift even without UUID-based recreate detection. This pull request and its description were written by Isaac.
) ## Changes Persist `endpoint_uuid` in state and detect identity drift on `vector_search_endpoints`. The endpoint name is stable but its UUID changes if the endpoint is deleted and recreated by name (e.g. via the workspace UI). Without persisting the UUID: - The bundle silently rebound permissions to a different backing endpoint without recreating the endpoint resource. - Anything else referencing `endpoint_uuid` (most importantly the permissions object_id, but also indexes added on top in the next PR) raced the recreate. `VectorSearchEndpointState` now embeds `vectorsearch.CreateEndpoint` and adds `EndpointUuid`. `DoCreate` records the UUID from the create response; `DoUpdate` copies it from `entry.RemoteState` so unrelated updates (e.g. `min_qps`) don't blank it out. `OverrideChangeDesc` classifies `endpoint_uuid` drift as `Recreate` when saved differs from remote, `Skip` otherwise. `drift/recreated_same_name` flips from a "badness snapshot" (which captured the old behavior of permissions silently rebinding) to the recreate behavior, with a permissions block on the endpoint to verify the cascade rebinds correctly. `drift/min_qps/out.plan.direct.json` regenerates to include the new `endpoint_uuid` skip entry in the detailed plan. ## Why Splitting this out of the larger `vector_search_indexes` PR ([#5123](#5123)) so it can land independently. The index PR builds on the persisted UUID for orphan detection, but the endpoint UUID work stands on its own and is useful regardless. ## Tests - `make fmtfull`, `make checks`, `make lintfull` — clean. - `make test` — green (`libs/apps/runlocal` needed `NODE_OPTIONS=` for the harness leak; unrelated). `bundle/internal/schema TestRequiredAnnotationsForNewFields` panics, which is failing on `main` for unrelated reasons. - `go test ./acceptance -run 'TestAccept/bundle/resources/vector_search_endpoints'` — all green, including the flipped `drift/recreated_same_name`. _This PR was written by Claude Code._
## Summary - Reverts #5127 (`Persist endpoint UUID for vector_search_endpoints drift detection`) and the follow-up changelog entry from #5192. - The badness #5127 was meant to fix — bundle silently rebinding permissions to a different backing endpoint after an out-of-band recreate — was actually addressed by the testserver fix in #5186 (`testserver: 404 on permissions GET when V2 parent is gone`). With the testserver matching real V2 cloud behavior, bundle correctly observes that the new endpoint has no permissions and creates them, with no permanent drift afterwards. UUID persistence in state is no longer necessary. - Reworks the `drift/recreated_same_name` acceptance test: keeps endpoint permissions in `databricks.yml`, drops the obsolete "recreate detected" assertion, and adds a post-deploy `bundle plan` to confirm there is no permanent drift. ## Test plan - [x] `./task build` clean. - [x] `go test ./acceptance -run 'TestAccept/bundle/resources/vector_search_endpoints/drift'` — all green (terraform + direct). - [x] `go test ./bundle/direct/dresources/...` — green. - [x] `./task lint-q` — clean. - [x] Verified post-deploy plan shows `Plan: 0 to add, 0 to change, 0 to delete, 2 unchanged` after an out-of-band endpoint recreate, so permissions don't end up in permanent drift even without UUID-based recreate detection. This pull request and its description were written by Isaac.
…abricks#5192) (databricks#5193) ## Summary - Reverts databricks#5127 (`Persist endpoint UUID for vector_search_endpoints drift detection`) and the follow-up changelog entry from databricks#5192. - The badness databricks#5127 was meant to fix — bundle silently rebinding permissions to a different backing endpoint after an out-of-band recreate — was actually addressed by the testserver fix in databricks#5186 (`testserver: 404 on permissions GET when V2 parent is gone`). With the testserver matching real V2 cloud behavior, bundle correctly observes that the new endpoint has no permissions and creates them, with no permanent drift afterwards. UUID persistence in state is no longer necessary. - Reworks the `drift/recreated_same_name` acceptance test: keeps endpoint permissions in `databricks.yml`, drops the obsolete "recreate detected" assertion, and adds a post-deploy `bundle plan` to confirm there is no permanent drift. ## Test plan - [x] `./task build` clean. - [x] `go test ./acceptance -run 'TestAccept/bundle/resources/vector_search_endpoints/drift'` — all green (terraform + direct). - [x] `go test ./bundle/direct/dresources/...` — green. - [x] `./task lint-q` — clean. - [x] Verified post-deploy plan shows `Plan: 0 to add, 0 to change, 0 to delete, 2 unchanged` after an out-of-band endpoint recreate, so permissions don't end up in permanent drift even without UUID-based recreate detection. This pull request and its description were written by Isaac.
## Changes
Adds `vector_search_indexes` as a first-class DABs resource on the
direct engine, alongside the existing `vector_search_endpoints`. Direct
engine only — vector search has no Terraform provider.
```yaml
resources:
vector_search_endpoints:
my_endpoint:
name: my-endpoint
endpoint_type: STANDARD
vector_search_indexes:
my_index:
name: main.default.my_index
endpoint_name: ${resources.vector_search_endpoints.my_endpoint.name}
primary_key: id
index_type: DELTA_SYNC
delta_sync_index_spec:
source_table: main.default.source
pipeline_type: TRIGGERED
grants:
- principal: data-engineers
privileges: [SELECT]
```
What's included:
- **Resource model** in `bundle/config/resources/vector_search_index.go`
(with `grants`) and `bundle/direct/dresources/vector_search_index.go`
(state, lifecycle, drift classification). `RemapState` round-trips
`index_subtype` so a populated remote subtype isn't classified as drift
on the next plan.
- **UC grants** wired through the generic grants path with securable
type `table`.
- **`recreate_on_changes`** for immutable spec fields (`name`,
`endpoint_name`, `index_type`, `index_subtype`, `primary_key`,
`delta_sync_index_spec`, `direct_access_index_spec`);
`delta_sync_index_spec.columns_to_sync` marked `ignore_remote_changes`
(request-only field — see follow-up note below). The index API has no
rename or update path, so any config-side change has to round-trip
through delete + create.
- **Index orphaning detection**: index state persists the
`endpoint_uuid` of the endpoint it was created against. `DoRead` looks
up the current endpoint UUID by name; if the endpoint was deleted
out-of-band the lookup returns `""` and `OverrideChangeDesc` classifies
the saved-vs-remote mismatch as `Recreate`. Builds on the endpoint UUID
persistence merged in #5127.
- **Async delete handling**: new optional `WaitAfterDelete` adapter
method (sibling to `WaitAfterCreate` / `WaitAfterUpdate`). For VS
indexes it polls `GetIndex` until 404 (15-minute cap). `apply.Recreate`
runs `DoDelete → DeleteState → WaitAfterDelete → DoCreate → SaveState →
WaitAfterCreate`, so a wait-time failure leaves the bundle consistent.
Replaces the prior `SaveState("", nil, nil)` placeholder that produced
`invalid state: empty id` planning failures on partial recreate.
- **Destructive-action prompt** for VS indexes in `bundle/phases/`. The
message intentionally covers both Delta Sync ("re-runs the embedding
pipeline") and Direct Access ("upserted vectors lost") in one paragraph
— picking a type-specific message from the bundle config would be wrong
on type changes (`DELTA_SYNC` → `DIRECT_ACCESS` recreates would describe
the destination type while the actual teardown is of the source type).
- **Dev-mode name prefixing** for indexes prefixes only the leaf
component of `catalog.schema.name`, since catalog and schema are
external references (the previous behavior produced invalid names like
`dev_jan_main.default.my_index`). The mutator skips names that still
carry literal `${...}` tokens, since the leaf split would otherwise
inject the prefix inside the trailing ref expression itself.
- **Testserver** enforces endpoint existence on index create. Index
status returns `Ready: true` immediately, matching the convention used
by every other slow resource the testserver fakes (endpoints → `ONLINE`,
database instances → `AVAILABLE`, apps → `RUNNING`).
`index_type` / spec-block consistency is intentionally **not** validated
client-side — the CreateIndex API rejects mismatched combinations at
deploy time, and replicating that check in DABs would just duplicate
backend logic.
## Why
The direct engine recently gained `vector_search_endpoints` (#4887).
This PR extends the support to indexes, which were the missing half.
Along the way it surfaces and fixes a number of issues:
- Without persisted endpoint UUIDs, identity drift was undetectable. An
index pointing at a deleted-and-recreated endpoint would appear live by
name but its backing endpoint was gone, leading to confusing "index
already exists" errors on subsequent deploys. #5127 added the same UUID
tracking on the endpoint side; this PR mirrors it on the index side so
the orphan is caught.
- The async deletion model isn't documented in the SDK, but `recreate`
deploys hit it every time. Without a wait, every recreate failed on the
immediate Create.
- `apply.Recreate` was writing a malformed empty-ID state entry as its
"delete state" step, which then poisoned the next plan with `invalid
state: empty id`.
- Recreating a VS index is genuinely expensive — Delta Sync re-runs the
full embedding pipeline; Direct Access loses every upserted vector. The
destructive-action prompt now reflects that.
## Follow-ups
- **`delta_sync_index_spec.columns_to_sync`** is request-only in the SDK
today: the field is accepted on `Create` but the `Get` response doesn't
echo it back, which is why we mark it `ignore_remote_changes` here.
There's an open backend PR to expose `columns_to_sync` on the read path;
once the SDK is regenerated against that, we can drop the
`ignore_remote_changes` entry and let normal drift detection handle the
field.
- **`vector_search_endpoints.budget_policy_id`** drift (effective vs.
requested) and the SDK doc-comment for
**`vector_search_endpoints.usage_policy_id`** are intentionally not in
this PR — both will be addressed by the next SDK bump and the
corresponding `./task generate-schema` regen.
## Tests
- `./task fmt`, `./task checks`, `./task lint` — all clean.
- `./task test` — unit tests green across `bundle/...`.
- New unit test `TestVectorSearchIndexNameWithUnresolvedRefsLeftAlone`
in `apply_target_mode_test.go` exercises the leaf-prefix skip on
`${var.catalog}.${var.schema}.${var.index}`.
- New acceptance directories under
`acceptance/bundle/resources/vector_search_indexes/`: `basic`,
`drift/columns_to_sync`, `drift/deleted_remotely`,
`drift/orphaned_endpoint`, `recreate/index_type`,
`recreate/mixed_types`, `grants/select`.
- The recreate request log
(`recreate/index_type/out.requests.recreate.direct.json`) captures `GET
→ DELETE → GET → POST` with `--get` enabled in `print_requests.py`. The
middle `GET` is the `WaitAfterDelete` poll; if a future change drops the
wait the regenerated capture loses that line and the test fails.
- `acceptance/bundle/validate/presets_name_prefix` covers the leaf-only
name prefix on a 3-part index name.
- `acceptance/bundle/invariant/configs/vector_search_index.yml.tmpl`
exercises the resource through the invariant matrix; the testserver
enforces endpoint existence on index create.
- Live tested with `--profile tmp` against staging across initial deploy
/ drift / recreate / destroy.
_This PR was written by Claude Code._
Changes
Persist
endpoint_uuidin state and detect identity drift onvector_search_endpoints.The endpoint name is stable but its UUID changes if the endpoint is deleted and recreated by name (e.g. via the workspace UI). Without persisting the UUID:
endpoint_uuid(most importantly the permissions object_id, but also indexes added on top in the next PR) raced the recreate.VectorSearchEndpointStatenow embedsvectorsearch.CreateEndpointand addsEndpointUuid.DoCreaterecords the UUID from the create response;DoUpdatecopies it fromentry.RemoteStateso unrelated updates (e.g.min_qps) don't blank it out.OverrideChangeDescclassifiesendpoint_uuiddrift asRecreatewhen saved differs from remote,Skipotherwise.drift/recreated_same_nameflips from a "badness snapshot" (which captured the old behavior of permissions silently rebinding) to the recreate behavior, with a permissions block on the endpoint to verify the cascade rebinds correctly.drift/min_qps/out.plan.direct.jsonregenerates to include the newendpoint_uuidskip entry in the detailed plan.Why
Splitting this out of the larger
vector_search_indexesPR (#5123) so it can land independently. The index PR builds on the persisted UUID for orphan detection, but the endpoint UUID work stands on its own and is useful regardless.Tests
make fmtfull,make checks,make lintfull— clean.make test— green (libs/apps/runlocalneededNODE_OPTIONS=for the harness leak; unrelated).bundle/internal/schema TestRequiredAnnotationsForNewFieldspanics, which is failing onmainfor unrelated reasons.go test ./acceptance -run 'TestAccept/bundle/resources/vector_search_endpoints'— all green, including the flippeddrift/recreated_same_name.This PR was written by Claude Code.