This Xilinx Answer discuss the following topics:
Error Management Hardware
Zynq MPSoC has a dedicated error handler to aggregate all of the fatal errors across the SoC and handle them. Refer to the TRM/Architecture Spec for details.
To summarize, all of the fatal errors routed to Error Manager can be either set to be handled by Hardware ( and trigger a SRST/PoR) or trigger an interrupt to PMU.
Error Management in PMU Firmware
The PMU Firmware (PMUFW) provides APIs to register custom error handlers or assign a default SRST/PoR action in response to an Error. There is a specific module (xpfw_mod_em.c) already provided in the PMUFW and it is enabled by default.
All error handling code should reside in this module and there are already a couple of examples for handling WDT errors.
Actions for each error can be set up using the XPfw_EmSetAction API:
The following actions are supported for the parameter ActionId:
|EM_ACTION_POR||Trigger a Power-On-Reset|
|EM_ACTION_SRST||Trigger a System Reset|
|EM_ACTION_CUSTOM||Call the custom handler registered as ErrorHandler parameter|
Below is a list of Error IDs for the ErrorId parameter:
|EM_ERR_ID_CSU_ROM||Errors logged by CSU Boot ROM (CBR)|
|EM_ERR_ID_PMU_PB||Errors logged by PMU Boot ROM (PBR) in the pre-boot stage|
|EM_ERR_ID_PMU_SERVICE||Errors logged by PBR in service mode|
|EM_ERR_ID_PMU_FW||Errors logged by PMUFW|
|EM_ERR_ID_PMU_UC||Un-Correctable Errors logged by PMU HW. This includes PMU ROM validation Error, PMU TMR Error, uncorrectable PMU RAM ECC Error, and PMU Local Register Address Error.|
|EM_ERR_ID_CSU||CSU Hardware related Errors|
|EM_ERR_ID_PLL_LOCK||Errors set when a PLL looses lock (These need to be enabled only after the PLL locks-up)|
|EM_ERR_ID_PL||PL Generic Errors passed to PS|
|EM_ERR_ID_TO||All Time-out Errors [FPS_TO, LPS_TO]|
|EM_ERR_ID_AUX3||Auxiliary Error 3|
|EM_ERR_ID_AUX2||Auxiliary Error 2|
|EM_ERR_ID_AUX1||Auxiliary Error 1|
|EM_ERR_ID_AUX0||Auxiliary Error 0|
|EM_ERR_ID_DFT||Error associated with the unexpected enablement of DFT features|
|EM_ERR_ID_CLK_MON||Clock Monitor Error|
|EM_ERR_ID_XMPU||XPMU Errors [LPS XMPU, FPS XPMU]|
|EM_ERR_ID_PWR_SUPPLY||Supply Detection Failure Errors|
|EM_ERR_ID_FPD_SWDT||FPD System Watch-Dog Timer Error|
|EM_ERR_ID_LPD_SWDT||LPD System Watch-Dog Timer Error|
|EM_ERR_ID_RPU_CCF||Asserted if any of the RPU CCF errors are generated|
|EM_ERR_ID_RPU_LS||Asserted if any of the RPU CCF errors are generated|
|EM_ERR_ID_FPD_TEMP||FPD Temperature Shutdown Alert|
|EM_ERR_ID_LPD_TEMP||LPD Temperature Shutdown Alert|
|EM_ERR_ID_RPU1||RPU1 Error including both Correctable and Uncorrectable Errors|
|EM_ERR_ID_RPU0||RPU0 Error including both Correctable and Uncorrectable Errors|
|EM_ERR_ID_OCM_ECC||OCM Uncorrectable ECC Error|
|EM_ERR_ID_DDR_ECC||DDR Uncorrectable ECC Error|
Example for Error Management (Custom Handler)
In the example below, an OCM uncorrectable error (EM_ERR_ID_OCM_ECC) is considered.
A custom handler is registered for this error in the PMUFW and the handler in this case just prints out the error message. In a more realistic case, the corrupted memory might be reloaded, but this example is limited to clearing the error and printing a message.
Adding an Error Handler in the PMUFW:
Diff for xpfw_mod_em.c
Execute from an R5/A53 target on the XSDB:
The above code is in Tcl for debugging. The Same code can be easily ported to a 'C' source by replacing the mwr/mrd with Xil_Out32/Xil_In32
Example for Error Management ( PoR as a response to Error)
Some errors might be too fatal, and the system recovery from those errors might not be feasible without doing a Reset of the entire system.
In such cases PoR or SRST can be used as actions. In this example we use PoR reset as a response to the OCM ECC double-bit error.
Here is the code that adds the PoR as an action:
Diff for xpfw_mod_em.c
The Tcl script is the same as the one from the above example to inject an OCM ECC error.
Once you trigger the error, a PoR occurs and you can see that all processors are in a reset state, similar to how they would be in a fresh power-on state.
PMU RAM also gets cleared off during a PoR, so the PMUFW needs to be reloaded.