Skip to content

ras: aest: extend AEST support to Device Tree frontend#603

Draft
umang-chheda wants to merge 24 commits into
qualcomm-linux:qcom-6.18.yfrom
umang-chheda:edac-post
Draft

ras: aest: extend AEST support to Device Tree frontend#603
umang-chheda wants to merge 24 commits into
qualcomm-linux:qcom-6.18.yfrom
umang-chheda:edac-post

Conversation

@umang-chheda
Copy link
Copy Markdown

This series extends Tian Ruidong’s [1] ACPI-based AEST support series
to also cover Device Tree based platforms.

While the existing AEST driver relies on the AEST ACPI table [3], many
embedded Arm platforms use Device Tree exclusively and cannot use the
driver today. This series adds a DT frontend that mirrors the ACPI
implementation and feeds the same core driver, keeping ACPI and DT
paths functionally equivalent.

Along the way, several correctness issues were identified in the core
driver and are fixed in the first part of this series.

The DT frontend is mutually exclusive with ACPI and does not introduce
any DT-specific logic into the core.

Ruidong Tian and others added 24 commits May 19, 2026 17:03
This patch introduces the creation of AEST platform devices, where each
device represents a logical "error node device" grouping one or more
AEST nodes from the ACPI table.

Instead of relying on the optional 'error_node_device' field in the AEST
table[1], this commit uses the interrupt number as the sole identifier for
the parent device. This design simplifies the driver logic by providing a
single, consistent mechanism for grouping nodes.

The 'error_node_device' field can be unspecified, but an AEST node is
always physically associated with a parent component. The interrupt
number serves as a reliable proxy for this association. This approach
is based on the safe assumption that distinct hardware components (e.g.,
SMMU, CMN, GIC) are assigned unique error interrupts and do not share
them.

[1]: https://developer.arm.com/documentation/den0085/latest

Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com>
Link: https://patch.msgid.link/20260122094656.73399-2-tianruidong@linux.alibaba.com
Signed-off-by: Umang Chheda <umang.chheda@oss.qualcomm.com>
Parse register information from the AEST table in the probe function,
create corresponding structures, and mappings AEST record.

Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com>
Link: https://patch.msgid.link/20260122094656.73399-3-tianruidong@linux.alibaba.com
Signed-off-by: Umang Chheda <umang.chheda@oss.qualcomm.com>
Support for various AEST group formats allows for flexible configuration of
AEST node address space sizes and maximum record counts per group.

Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com>
Link: https://patch.msgid.link/20260122094656.73399-4-tianruidong@linux.alibaba.com
Signed-off-by: Umang Chheda <umang.chheda@oss.qualcomm.com>
…IO register

Use record_read/write to simultaneously read and write system registers and
MMIO registers while maintaining code conciseness.

Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com>
Link: https://patch.msgid.link/20260122094656.73399-5-tianruidong@linux.alibaba.com
Signed-off-by: Umang Chheda <umang.chheda@oss.qualcomm.com>
The RAS version of a component can be probed via its ERRDEVARCH register.

In cases where a component (e.g., SMMU) does not implement an ERRDEVARCH
register, the driver falls back to using the RAS version of the Processing
Element (PE).

Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com>
Link: https://patch.msgid.link/20260122094656.73399-6-tianruidong@linux.alibaba.com
Signed-off-by: Umang Chheda <umang.chheda@oss.qualcomm.com>
Add inject register descripted in Common Fault Injection Model
Extension.

Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com>
Link: https://patch.msgid.link/20260122094656.73399-7-tianruidong@linux.alibaba.com
Signed-off-by: Umang Chheda <umang.chheda@oss.qualcomm.com>
The CE threshold defines the number of Correctable Errors (CE) that
must occur in a record before triggering an interrupt. Error records
support multiple threshold configurations, including 8B, 16B, and 32B.
This patch detects the supported threshold settings for error records
and sets the default threshold to 1, ensuring an interrupt is generated
for every CE occurrence.

Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com>
Link: https://patch.msgid.link/20260122094656.73399-8-tianruidong@linux.alibaba.com
Signed-off-by: Umang Chheda <umang.chheda@oss.qualcomm.com>
The interrupt numbers for certain error records may be explicitly
programmed into their configuration register.

And for PPIs, each core will maintains its own copy of the aest_device
structure.

Given that handling RAS errors entails complex processes such as EDAC
and memory_failure, all handling is deferred to and handled within a
bottom-half context.

Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com>
Link: https://patch.msgid.link/20260122094656.73399-9-tianruidong@linux.alibaba.com
Signed-off-by: Umang Chheda <umang.chheda@oss.qualcomm.com>
Move the configuration of interrupts and CE thresholds
into the CPU hotplug callbacks for the per-CPU AEST node.

Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com>
Link: https://patch.msgid.link/20260122094656.73399-10-tianruidong@linux.alibaba.com
Signed-off-by: Umang Chheda <umang.chheda@oss.qualcomm.com>
Exposes certain AEST driver information to userspace.

Only ROOT can access these interface because it includes
hardware-sensitive information:

  ls /sys/kernel/debug/aest/
  memory<id> smmu<id> ...

  ls /sys/kernel/debug/aest/memory<id>/
  record0 record1 ...

All details at:
        Documentation/ABI/testing/debugfs-aest

Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com>
Link: https://patch.msgid.link/20260122094656.73399-11-tianruidong@linux.alibaba.com
Signed-off-by: Umang Chheda <umang.chheda@oss.qualcomm.com>
This commit introduces error counting functionality for AEST records.
Previously, error statistics were not directly available for individual
error records or AEST nodes.

Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com>
Link: https://patch.msgid.link/20260122094656.73399-12-tianruidong@linux.alibaba.com
Signed-off-by: Umang Chheda <umang.chheda@oss.qualcomm.com>
This commit introduces the ability to configure the Corrected Error (CE)
threshold for AEST records through debugfs. This allows administrators to
dynamically adjust the CE threshold for error reporting.

Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com>
Link: https://patch.msgid.link/20260122094656.73399-13-tianruidong@linux.alibaba.com
Signed-off-by: Umang Chheda <umang.chheda@oss.qualcomm.com>
AEST offers both soft and hard injection. Soft injection simulates errors
in software, providing flexibility to define the error register content.
Hard injection, on the other hand, utilizes error injection registers to
introduce hardware faults, strictly requiring values that adhere to their
specifications.

Read Documentation/ABI/testing/debugfs-aest to learn how to use them.

Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com>
Link: https://patch.msgid.link/20260122094656.73399-14-tianruidong@linux.alibaba.com
Signed-off-by: Umang Chheda <umang.chheda@oss.qualcomm.com>
AEST table include vendor error node to support the component that do
not implement standard Arm RAS architecture[1]. Each vendor node may
have their own initialize and interrupt handle function. This patch
supply a framework to process vendor error nodes, the vendor process
function is binded with vendor HID.

[1]: https://developer.arm.com/documentation/ddi0587/latest/

Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com>
Link: https://patch.msgid.link/20260122094656.73399-15-tianruidong@linux.alibaba.com
Signed-off-by: Umang Chheda <umang.chheda@oss.qualcomm.com>
The CMN (Coherent Mesh Network) architecture incorporates five distinct
device types. Each device type is associated with an error group register
set. The struct aest_cmn_700 models a single CMN instance, while struct
aest_cmn_700_child represents an individual CMN device.

CMN's error records utilize a memory-mapped single error record view [1].
Critically, one error record corresponds to one AEST node, implying that
a single CMN instance can generate hundreds of AEST nodes. To manage this
scale, this driver introduces a virtual AEST node, which represents an
entire CMN device, such as an HNI or HNF. This allows an HNF AEST node,
for instance, to leverage its errgsr register to pinpoint which specific
error record has reported an error.

During the AEST probe phase, the CMN AEST driver identifies the CMN node
type using the cmn_node_info register. It then reorganizes all AEST nodes
belonging to the same CMN node type into a cohesive CMN AEST node
structure. To locate the relevant CMN register addresses, the CMN's
presence in the DSDT is required, along with the CMN node offset
specified in the AEST vendor specification data [1].

[1]: https://developer.arm.com/documentation/102308/latest/

Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com>
Link: https://patch.msgid.link/20260122094656.73399-16-tianruidong@linux.alibaba.com
Signed-off-by: Umang Chheda <umang.chheda@oss.qualcomm.com>
Add a trace event for hardware errors reported by the ARMv8
RAS extension registers. userspace app can monitor this
trace event and decode error information.

Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com>
Link: https://patch.msgid.link/20260122094656.73399-17-tianruidong@linux.alibaba.com
Signed-off-by: Umang Chheda <umang.chheda@oss.qualcomm.com>
… messages

Two related fixes for processor nodes with ACPI_AEST_PROC_FLAG_SHARED
or ACPI_AEST_PROC_FLAG_GLOBAL set (e.g. cluster L3 cache, DSU):

1. aest_dev_is_oncore() returns true for any PROCESSOR_ERROR_NODE,
   causing shared processor nodes (which use an SPI) to take the
   cpuhp/PPI path.  cpuhp_setup_state() is called instead of
   aest_online_dev(), so aest_config_irq() is never called and the
   hardware IRQ-config register is never programmed.

   Fix aest_dev_is_oncore() to check irq_is_percpu() on the registered
   IRQ.  Only nodes whose FHI or ERI is a per-CPU PPI take the oncore
   path, nodes with an SPI take aest_online_dev().

2. alloc_aest_node_name() uses processor_id for the node name of all
   processor nodes.  Shared/global nodes have processor_id=0 (the
   field is unused when SHARED/GLOBAL is set), so every shared node
   and the per-PE node for CPU 0 both got the name "processor.0",
   making error logs ambiguous.

   For shared/global nodes, build the name as
   "processor.<resource_type>.<device_id>" (e.g. "processor.cache.1")
   so each node has a unique, meaningful identifier.  Per-PE nodes
   keep the original "processor.<mpidr>" form.

   Also add proc_flags to struct aest_event so aest_print() can
   distinguish shared from per-PE nodes and print an appropriate
   message.

Link: https://lore.kernel.org/lkml/20260505-aest-devicetree-support-v1-1-d5d6ffacf0a5@oss.qualcomm.com/
Signed-off-by: Umang Chheda <umang.chheda@oss.qualcomm.com>
The error counts visible under:
  /sys/kernel/debug/aest/<dev>/processor<cpu>/<node>/err_count

always reported zero, even though corrected errors (CEs) were being
serviced by the interrupt handler. aest_oncore_dev_init_debugfs() sets
up per CPU debugfs entries but wired them up incorrectly in two places:

- this_cpu_ptr(adev->adev_oncore) was used inside for_each_possible_cpu().
  This always selects the slot for the CPU executing the init code, so all
  debugfs files ended up referencing the same per CPU aest_device instance
  instead of the CPU indicated by the loop variable.

- The code referenced adev->nodes[i], i.e. the template nodes allocated
  before __setup_ppi, rather than the per-CPU copies at
  percpu_dev->nodes[i]. The IRQ handler updates CE counters in the per-CPU
  records created by __setup_ppi, the template records are never touched
  at runtime, so err_count always read as zero.

Fix this by:

- Using per_cpu_ptr(adev->adev_oncore, cpu) when iterating over CPUs.
  Wiring debugfs files to percpu_dev->nodes[i] so counters reflect the
  data updated by the IRQ handler.

- Using adev->nodes[i].name for debugfs directory names. The per-CPU node
  receives name via a shallow memcpy and is not the authoritative source.

Link: https://lore.kernel.org/lkml/20260505-aest-devicetree-support-v1-2-d5d6ffacf0a5@oss.qualcomm.com/
Signed-off-by: Umang Chheda <umang.chheda@oss.qualcomm.com>
The record_implemented bitmap uses the same semantics as the rest of
the driver: a SET bit means the record is NOT implemented (skip it),
a CLEAR bit means the record IS implemented (process it).

aest_node_init_debugfs() and aest_node_err_count_show() were iterating
all record_count records unconditionally, creating debugfs entries and
accumulating error counts for unimplemented records too.

Fix both functions to skip records where the corresponding bit is set
in node->record_implemented, consistent with how aest_node_foreach_record()
handles the same bitmap.

Link: https://lore.kernel.org/lkml/20260505-aest-devicetree-support-v1-3-d5d6ffacf0a5@oss.qualcomm.com/
Signed-off-by: Umang Chheda <umang.chheda@oss.qualcomm.com>
The driver unconditionally calls panic() whenever an unrecoverable,
uncontainable UE (UET_UC or UET_UEU) is detected. There is no way
for the user to suppress this behaviour, which makes it difficult to
test UE injection or to run in environments where a kernel panic on
every UE is undesirable.

Add a module parameter `aest_panic_on_ue` When set to 0 the driver
logs the UE and continues instead of panicking.

Usage:
  # Boot time (kernel cmdline)
  aest.aest_panic_on_ue=0

  # Runtime
  echo 0 > /sys/module/aest/parameters/aest_panic_on_ue

Link: https://lore.kernel.org/lkml/20260505-aest-devicetree-support-v1-4-d5d6ffacf0a5@oss.qualcomm.com/
Signed-off-by: Umang Chheda <umang.chheda@oss.qualcomm.com>
The Arm Error Source Table (AEST) specification describes how firmware
exposes RAS error source topology to the operating system. On ACPI
systems this information is provided via the AEST ACPI table.

Introduce Device Tree bindings that provide an equivalent description
of AEST error sources for DT-based platforms.

Link: https://lore.kernel.org/lkml/20260505-aest-devicetree-support-v1-5-d5d6ffacf0a5@oss.qualcomm.com/
Signed-off-by: Umang Chheda <umang.chheda@oss.qualcomm.com>
Add a Device Tree frontend for the Arm AEST RAS framework, allowing the
existing AEST core driver to be used on DT-only systems.

The DT frontend parses the "arm,aest" Device Tree hierarchy and populates
the same internal structures as the ACPI-based implementation. It is
initialized at the same layer as ACPI and is mutually exclusive with it,
ensuring identical behaviour regardless of the firmware interface in use.

Link: https://lore.kernel.org/lkml/20260505-aest-devicetree-support-v1-6-d5d6ffacf0a5@oss.qualcomm.com/
Signed-off-by: Umang Chheda <umang.chheda@oss.qualcomm.com>
Add AEST RAS error source nodes for the Lemans SoC.

The DT describes a processor error source covering all CPU cores and a
shared L3 cache error source for the cluster. These nodes model the
hardware error reporting blocks and associated interrupts as required
by the Arm AEST specification.

Link: https://lore.kernel.org/lkml/20260505-aest-devicetree-support-v1-7-d5d6ffacf0a5@oss.qualcomm.com/
Co-developed-by: Faruque Ansari <faruque.ansari@oss.qualcomm.com>
Signed-off-by: Faruque Ansari <faruque.ansari@oss.qualcomm.com>
Signed-off-by: Umang Chheda <umang.chheda@oss.qualcomm.com>
Add AEST RAS error source nodes for the Monaco SoC.

The DT describes a processor error source covering all CPU cores and a
shared L3 cache error source for the cluster. These nodes model the
hardware error reporting blocks and associated interrupts as required
by the Arm AEST specification.

Link: https://lore.kernel.org/lkml/20260505-aest-devicetree-support-v1-8-d5d6ffacf0a5@oss.qualcomm.com/
Co-developed-by: Faruque Ansari <faruque.ansari@oss.qualcomm.com>
Signed-off-by: Faruque Ansari <faruque.ansari@oss.qualcomm.com>
Signed-off-by: Umang Chheda <umang.chheda@oss.qualcomm.com>
@qswat-orbit-external
Copy link
Copy Markdown

Merge Check Failed: No CR Numbers Found

Error: No Change Request numbers were found.

Please add Change Request numbers to your pull request description in the format CRs-Fixed: 12345 or link GitHub issues that are associated with Change Requests.

@qcomlnxci qcomlnxci requested review from a team, Komal-Bajaj, quic-kaushalk and trsoni May 19, 2026 16:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant