Skip to content

Thermal Sensor HAL

The Thermal Sensor HAL manages platform thermal state signalling for RDK-E devices. It abstracts the underlying hardware sensors, vendor thermal policy engine, and cooling device behaviour, presenting a unified, event-driven interface to the RDK Middleware.

Thermal thresholds are defined at multiple severity levels, allowing the system or user space components to receive early warning events and communicate thermal stress to the user, indicating that the device will shut down if the condition is not rectified.

The vendor thermal policy remains responsible for determining exact threshold values and corresponding mitigation strategies. When any level is breached, the HAL emits a well-defined ThermalActionEvent describing the new thermal condition.

This layered model keeps thermal management flexible across hardware implementations while ensuring consistent signalling behaviour to upper layers.


References

Interface Definition sensor/current/thermal
API Documentation TBD
HAL Interface Type AIDL and Binder
Initialization - TBC systemd - hal-sensor-thermal.service
VTS Tests TBC

Related Pages


Overview

The Thermal HAL allows RDK Middleware to receive high-level thermal action events, triggered by vendor-defined thermal policies.

Typical Use Cases

  • Prepare for thermal shutdown due to high board or SoC temperature
  • Gracefully handle shutdown events
  • Collect telemetry on thermal behaviour in the field

Implementation Requirements

# Requirement Comments
HAL.THERMAL.1 Shall provide an event-driven API to report vendor-defined thermal state changes to RDK Middleware using the State enum.
HAL.THERMAL.2 Shall not expose or require Middleware to manage raw temperature thresholds or thermal policy decisions.
HAL.THERMAL.3 Shall emit all thermal state change events to registered Middleware clients and support registering / unregistering of such clients.
HAL.THERMAL.4 Shall support querying the current thermal state at any time via a getCurrentThermalState() API.
HAL.THERMAL.5 Shall support optional reporting of current temperature readings for platform sensors via getCurrentTemperatures().
HAL.THERMAL.6 Shall provide a vendorInfo string field in thermal state change events for vendor-specific debug or telemetry purposes.
HAL.THERMAL.7 Shall update getCurrentThermalState() coherently with emitted state change events (CRITICAL_TEMPERATURE_EXCEEDED when cooling, NORMAL after recovery, CRITICAL_SHUTDOWN_IMMINENT before shutdown). Ensures predictable state → event alignment.

Interface Definition

Interface Definition File Description
com/rdk/hal/sensor/thermal/IThermalSensor.aidl Main service interface for registering listeners and querying state/telemetry.
com/rdk/hal/sensor/thermal/IThermalEventListener.aidl One-way callback for thermal state change events.
com/rdk/hal/sensor/thermal/ActionEvent.aidl Parcelable event payload, including state, timestampMonotonicMs, and temperatureReading.
com/rdk/hal/sensor/thermal/State.aidl Thermal state enumeration: NORMAL, CRITICAL_TEMPERATURE_EXCEEDED, CRITICAL_TEMPERATURE_RECOVERED, CRITICAL_SHUTDOWN_IMMINENT.
com/rdk/hal/sensor/thermal/TemperatureReading.aidl Optional per-sensor telemetry record (°C + timestamp).

Initialization

The systemd hal-sensor-thermal.service unit file is provided by the vendor layer to start the service and should include
Wants or Requires directives to start any platform driver services it depends upon.

Upon starting, the service shall register the IThermalSensor interface with the Service Manager using the string IThermalSensor.serviceName and immediately become operational.


System Context

The Thermal HAL fits into the system architecture as the thermal state signalling layer between vendor-defined thermal policy engines and the RDK Middleware.

It enables consistent and portable notification of thermal state changes to the Middleware and Applications, allowing system UX and behaviour to be adapted accordingly.

The HAL abstracts away the diversity of hardware implementations and thermal policy tuning across platforms, exposing only well-defined thermal state change events to the RDK stack.

Design Principles

  • Vendor Controls Policy Thermal thresholds, response logic, and mitigation strategies are entirely defined and owned by the vendor platform. These are established by the hardware design team as part of the thermal envelope validation process and are not configurable by RDK middleware.

The vendor implementation must:

  • Define and enforce temperature thresholds approved for the hardware design.
  • Ensure the platform behaves deterministically when limits are exceeded.
  • Apply any platform-specific thermal regulation or mitigation automatically (e.g. fan control) if supported.
  • Trigger a system-controlled shutdown when required — this action is outside the control of upper software layers.
  • Update the vendor HAL layer if observed behaviour deviates from validated specifications.

In essence, thermal compliance is a hardware validation responsibility, not a runtime tuning exercise. The HAL exposes events and telemetry for visibility — it does not make policy decisions.

  • HAL Emits State Change Events

This standardizes how temperature and mitigation state changes are reported. Events carry portable, structured data that allows middleware to log, visualize, or correlate system behaviour without influencing platform policy.

  • Middleware Acts on Events

Middleware may use events for UX adaptation, application awareness, or telemetry, but it does not modify hardware thresholds or take control actions. The separation ensures deterministic hardware behaviour even if middleware is delayed or offline.

  • Telemetry Optional

Middleware may periodically query temperature sensors for analytics or backend reporting. This is an optional diagnostic feature — it must not interfere with the vendor’s thermal control mechanisms.

  • Platform-Specific Cooling Technologies Supported

The HAL design accommodates vendor-unique mitigation strategies (e.g. fan profiles, passive heat spreading). Such features remain internal to the vendor implementation but are surfaced via standardized event types for consistent observability.

  • Explicit Shutdown Signalling

When a thermal limit forces a platform-controlled shutdown, the HAL must emit a final state change event indicating that condition. This provides clear auditability and compliance traceability for safety and user-experience requirements.

graph RL
    subgraph HAL
      B1[Vendor Thermal Policy Engine] --> C1[Thermal HAL Layer]
    end
    C1 --> D1[RDK Middleware]
    subgraph "Vendor / Hardware"
        A1[SoC Thermal Sensors] --> B1
        A2[External Sensors: DDR, USB, PMIC] --> B1
        A3[Cooling Devices: Fan, TEC, Heatsink] --> B1
    end

    classDef background fill:#121212,stroke:none,color:#E0E0E0;
    classDef blue fill:#1565C0,stroke:#E0E0E0,stroke-width:2px,color:#E0E0E0;
    classDef wheat fill:#FFB74D,stroke:#424242,stroke-width:2px,color:#000000;
    classDef green fill:#4CAF50,stroke:#E0E0E0,stroke-width:2px,color:#FFFFFF;

    D1:::blue
    C1:::wheat
    A1:::green
    A2:::green
    A3:::green
    B1:::wheat

Thermal Policy Ownership

Aspect Owner
Sensor thresholds Vendor
Policy decisions Vendor
Cooling device control Vendor
State change signalling HAL
Middleware reaction RDK Middleware
App reaction Applications (via MW)

Thermal States

The Thermal HAL exposes a small, extensible set of thermal states, represented by the State enum.

These states represent high-level system thermal conditions determined by the platform's vendor-defined Thermal Policy Engine.

They enable general principles of:

  • Managing application behaviour
  • Logging and reporting field telemetry
  • Supporting regulatory and UX requirements

The HAL emits state change events containing these states, abstracting away platform-specific thresholds and control logic.

State Enum

enum State {
    /**
     * @brief Normal thermal conditions.
     */
    NORMAL = 0,

    /**
    * @brief Temperature has exceeded a critical threshold.
    *  Platform will be in active mitigation if possible.
    */
    CRITICAL_TEMPERATURE_EXCEEDED = 1,

    /**
    * @brief Temperature has recovered from a critical event.
    *  Platform will return to normal state
    */
    CRITICAL_TEMPERATURE_RECOVERED = 2,

    /**
     * @brief Shutdown is imminent due to critical thermal breach.
     */
    CRITICAL_SHUTDOWN_IMMINENT = 3
}

Thermal State Machine

getCurrentThermalState() returns one of the following values:

State Description
NORMAL No mitigation active; platform within safe thermal limits.
CRITICAL_TEMPERATURE_EXCEEDED Platform entered critical temperature, if possible platform vendor-defined mitigation is active.
CRITICAL_TEMPERATURE_RECOVERED Platform recovered from critical temperature, and is returning to normal state.
CRITICAL_SHUTDOWN_IMMINENT Platform entering forced thermal shutdown; critical platform level actions are imminent.

Typical transition model:

NORMAL  →  CRITICAL_TEMPERATURE_EXCEEDED  →  CRITICAL_SHUTDOWN_IMMINENT
   ↑          ↓                    ↓
   └──────────┴──── CRITICAL_TEMPERATURE_RECOVERED (returns to NORMAL)

Interaction Flow Examples

Normal Operation

sequenceDiagram
    participant Sensors
    participant Policy
    participant HAL
    participant MW
    participant App

    Sensors->>Policy: Report current temperatures
    Policy->>HAL: Detect moderate condition, activate mitigation
    HAL->>MW: Emit onThermalStateChange(state=CRITICAL_TEMPERATURE_EXCEEDED)
    MW->>Telemetry: Log event
    HAL->>MW: Emit onThermalStateChange(state=CRITICAL_TEMPERATURE_RECOVERED)
    MW->>App: Resume normal behaviour

Critical Shutdown Path

sequenceDiagram
    participant Sensors
    participant Policy
    participant HAL
    participant MW
    participant App

    Sensors->>Policy: Detect catastrophic thermal violation
    Policy->>HAL: Emit CRITICAL_SHUTDOWN_IMMINENT
    HAL->>MW: Emit onThermalStateChange(state=CRITICAL_SHUTDOWN_IMMINENT)
    MW->>App: Display warning, stop activity
    MW->>Telemetry: Report shutdown event
    HAL->>SystemControl: Initiate hardware shutdown

Platform Policy Metadata

Vendors may define platform-specific thermal policy hints in the hardware configuration (HFP) for reference and telemetry use.

sensor:
  thermal:
    - id: "soc_die"
      sensorName: "SoC Die" # Used in com.rdk.hal.sensor.thermal/TemperatureReading.aidl
      location: "CPU"       # Used in com.rdk.hal.sensor.thermal/TemperatureReading.aidl

      # -------------------------------------------------------------------------
      # SENSOR CHARACTERISTICS
      # -------------------------------------------------------------------------
      # sensor_reading_range_celsius:
      #   Absolute measurable range of the physical sensor. These limits are
      #   hardware-defined and represent what the ADC/IC can report, not what is
      #   considered safe for operation.
      sensor_reading_range_celsius:
        min: -40
        max: 125

      # operational_temperature_celsius:
      #   Normal safe operating envelope for the platform under steady-state load.
      #   Middleware should regard this as the “normal zone”.
      #   Anything above the max here enters mitigation or alarm territory.
      operational_temperature_celsius:
        min: -20
        max: 95

      # -------------------------------------------------------------------------
      # POLICY TRIGGER POINTS
      # -------------------------------------------------------------------------
      # The trigger values define how the vendor policy transitions between
      # State values. They MUST satisfy:
      #   recovered  <  exceeded  <  shutdown
      #
      # • critical_temperature_recovered_celsius :
      #       Threshold below which the system is considered recovered and
      #       returns to NORMAL state.
      # • critical_temperature_exceeded_celsius :
      #       Point at which CRITICAL_TEMPERATURE_EXCEEDED is emitted and
      #       mitigation (if supported) becomes active.  This is usually a few
      #       degrees ABOVE operational max to allow early warning.
      # • entering_critical_shutdown_celsius :
      #       Hard limit at which ENTERING_CRITICAL_SHUTDOWN is emitted and
      #       hardware shutdown is initiated.
      triggers:
        critical_temperature_recovered_celsius: 90
        critical_temperature_exceeded_celsius: 98
        entering_critical_shutdown_celsius: 115

      policy:
        shutdown_min_downtime_s: 900
        recovery:
          strategy: TIME_BASED
          min_cooldown_seconds: 240

      vendor:
        vendorCode: 1001  # Used in com.rdk.hal.sensor.thermal/TemperatureReading.aidl
        vendorInfo: "Primary die sensor used for critical trip points." # Used in com.rdk.hal.sensor.thermal/TemperatureReading.aidl

    - id: "board"
      sensorName: "Mainboard Ambient" # Used in com.rdk.hal.sensor.thermal/TemperatureReading.aidl
      location: "Board"           # Used in com.rdk.hal.sensor.thermal/TemperatureReading.aidl

      # Absolute measurable range of this sensor device.
      sensor_reading_range_celsius:
        min: -20
        max: 80

      # Expected steady-state board temperature range under normal conditions.
      operational_temperature_celsius:
        min: 0
        max: 65

      # Trigger thresholds for this domain.
      #   Recovered < Exceeded < Shutdown  (must hold true)
      triggers:
        critical_temperature_recovered_celsius: 60
        critical_temperature_exceeded_celsius: 70
        entering_critical_shutdown_celsius: 72   # ~10% below sensor max

      policy:
        shutdown_min_downtime_s: 600
        recovery:
          strategy: TIME_BASED
          min_cooldown_seconds: 180

      vendor:
        vendorCode: 1002  # Used in com.rdk.hal.sensor.thermal/TemperatureReading.aidl
        vendorInfo: "Used for general system thermal monitoring." # Used in com.rdk.hal.sensor.thermal/TemperatureReading.aidl