Skip to content

skip service-events otlp pipelines when sigv4auth validation fails#2139

Open
jj22ee wants to merge 2 commits into
aws:mainfrom
jj22ee:appsignals-logs-translator-pr-3
Open

skip service-events otlp pipelines when sigv4auth validation fails#2139
jj22ee wants to merge 2 commits into
aws:mainfrom
jj22ee:appsignals-logs-translator-pr-3

Conversation

@jj22ee

@jj22ee jj22ee commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Description of the issue

Prevent CWAgent startup failure when Application Signals is enabled but the sigv4auth extension cannot resolve AWS credentials via the AWS SDK Go v2 default chain — most commonly on-prem hosts.

The Application Signals OTLP metrics export and the logs pipelines route data through the OTel sigv4auth extension, which signs requests using the AWS SDK Go v2 default credential chain. sigv4auth.Validate() eagerly resolves credentials during the collector's config-validation phase. If the v2 chain cannot resolve credentials, validation fails and CWAgent fails to start.

Description of changes

Pre-check whether sigv4auth can resolve credentials before registering the pipelines that depend on it, and degrade gracefully instead of failing startup.

  1. translator/translate/otel/extension/sigv4auth/translator.go — CanResolveCredentials() new method builds a throwaway sigv4auth config with the same region/role the real extension would use and runs xconfmap.Validate() on it — exercising the exact code path that would otherwise fail at collector startup. No sigv4auth instance exists at translation time, so a throwaway instance is needed.

  2. translator/translate/otel/pipeline/applicationsignals/translators.go — NewTranslators() — updated so that when credentials cannot be resolved:

    • Metrics: register a single default pipeline (OTLP receiver → AppSignals processors → EMF exporter), with no routing connector and no OTLP export — it never references sigv4auth.
    • Logs: skip entirely.

The check runs the same resolution the real extension would, in the same credential environment, moments before the collector's own validation. If it passes, the real sigv4auth instances will also pass, and if it fails, the real sigv4auth instances will also fail identically, so we avoid creating them and log a warning instead.

Behavior summary:

sigv4auth creds resolvable Metrics Logs
Yes Routing → EMF + OTLP destinations (unchanged) Full logs pipelines (unchanged)
No Single default pipeline: OTLP → EMF only None (skipped)

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

Run CWAgent with ApplicationSignals configuration in Docker environment without fix and without credentials:

  -e AWS_ACCESS_KEY_ID="" \
  -e AWS_SECRET_ACCESS_KEY="" \
  -e AWS_SHARED_CREDENTIALS_FILE="/nonexistent" \
  -e AWS_CONFIG_FILE="/nonexistent" \
  -e AWS_EC2_METADATA_DISABLED="true" \
E! [EC2] Fetch identity document from EC2 metadata fail: EC2MetadataRequestError: failed to get EC2 instance identity document
2026/06/02 02:50:59 W! retry [0/3], unable to get http response from http://169.254.170.2/v2/metadata, error: unable to get response from http://169.254.170.2/v2/metadata, error: Get "http://169.254.170.2/v2/metadata": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2026/06/02 02:51:00 W! retry [1/3], unable to get http response from http://169.254.170.2/v2/metadata, error: unable to get response from http://169.254.170.2/v2/metadata, error: Get "http://169.254.170.2/v2/metadata": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2026/06/02 02:51:01 W! retry [2/3], unable to get http response from http://169.254.170.2/v2/metadata, error: unable to get response from http://169.254.170.2/v2/metadata, error: Get "http://169.254.170.2/v2/metadata": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2026/06/02 02:51:01 I! access ECS task metadata fail with response unable to get response from http://169.254.170.2/v2/metadata, error: Get "http://169.254.170.2/v2/metadata": context deadline exceeded (Client.Timeout exceeded while awaiting headers), assuming I'm not running in ECS.

...

2026/06/02 02:51:01 E! failed to generate YAML configuration validation content: invalid otel config: extensions::sigv4auth/logs: could not retrieve credential provider: failed to refresh cached credentials, no EC2 IMDS role found, operation error ec2imds: GetMetadata, access disabled to EC2 IMDS via client option, or "AWS_EC2_METADATA_DISABLED" environment variable
extensions::sigv4auth/monitoring: could not retrieve credential provider: failed to refresh cached credentials, no EC2 IMDS role found, operation error ec2imds: GetMetadata, access disabled to EC2 IMDS via client option, or "AWS_EC2_METADATA_DISABLED" environment variable

Run CWAgent with same configuration/environment with fix:

/opt/aws/amazon-cloudwatch-agent/bin/default_linux_config.json does not exist or cannot read. Skipping it.
2026/06/02 02:54:03 W! Skipping Application Signals OTLP metrics pipeline: AWS credentials unavailable for sigv4auth
2026/06/02 02:54:03 W! Skipping Application Signals logs pipeline: AWS credentials unavailable for sigv4auth
    shared_credential_file = "/root/.aws/credentials"
        shared_credentials_file:
            - /root/.aws/credentials
        shared_credentials_file:
            - /root/.aws/credentials
        shared_credentials_file:
            - /root/.aws/credentials

...

2026/06/02 02:54:00 I! attempt to access ECS task metadata to determine whether I'm running in ECS.
2026/06/02 02:54:03 I! access ECS task metadata fail with response unable to get response from http://169.254.170.2/v2/metadata, error: Get "http://169.254.170.2/v2/metadata": context deadline exceeded (Client.Timeout exceeded while awaiting headers), assuming I'm not running in ECS.
2026-06-02T02:54:03Z I! {"caller":"service@v0.124.0/service.go:266","msg":"Starting CWAgent...","Version":"Unknown","NumCPU":14}
2026-06-02T02:54:03Z I! {"caller":"service@v0.124.0/service.go:289","msg":"Everything is ready. Begin running and processing data."}
...

Also added unit and sampleConfig tests:

  • TestNewTranslatorsMetricsNoCredentials — 1 translator (default) when creds unresolvable
  • TestNewTranslatorsLogsNoCredentials — 0 log translators when creds unresolvable
  • TestTranslatorMetricsDefault — default metrics pipeline structure (OTLP → EMF)
  • TestAppSignalsNoCredentialsConfig — sampleConfig YAML for the no-credentials config

Requirements

Before commiting your code, please do the following steps.

  1. Run make fmt and make fmt-sh
  2. Run make lint

Integration Tests

To run integration tests against this PR, add the ready for testing label.

@jj22ee jj22ee requested a review from a team as a code owner June 2, 2026 04:02
@github-actions

Copy link
Copy Markdown
Contributor

This PR was marked stale due to lack of activity.

@github-actions github-actions Bot added the Stale label Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant