SPECIFY - WHEA error records - Corrected Machine Check Interrupt

Table of Contents

Specify has not caught the Machine Check Exceptions

Suppose you encounter a WHEA in specify where there are no specific MCE records (Instead it states CMCI_NOTIFY_TYPE_GUID in the Notify type - This stands for Corrected Machine Check Interrupt), this means you will have to go through the error packets and figure out what it means.

For context, CMCI - Corrected Machine Check Interrupt - Means that the WHEA error was encountered by Windows, but was corrected accordingly. This is why the severity of the crash is not fatal but “Warning”. /RTS-Extra-Docs/assets/img/WHEA_Analysis/WHEA_Error_Records.png

This however does not mean there isn’t an issue with the hardware. A WHEA error can still be a potential hardware problem after all.

Error Descriptors

To do this, we analyze what the error packet specifically says. The specify report has multiple error descriptors split up into sections (in accordance with the error packets). Each section of the packet refers to a section in the Error Descriptors.

Error Descriptor Sections

In this example, we have 4 sections A, B, C, D, corresponding to the 4 sections represented in the error packet: /RTS-Extra-Docs/assets/img/WHEA_Analysis/WHEA_Error_Descriptor.png

One way to tell one section from another is that all sections start with offset 0x00. This is true for all offsets present in the Error packet. Each error descriptor has their own details, mainly the section type. The section type depicts what GUID the error packet belongs to so it can be identified accordingly.

You can see in Sections A, B and D of this particular packet, the section types refer to the following generic calls that gives us more information about the WHEA:

  • Section A - MEMORY_ERROR_SECTION_GUID - This tells us the WHEA is reportedly within the memory section of the device. In this case it is the CPU.
  • Section B - PROCESSOR_GENERIC_ERROR_SECTION_GUID - This is what tells us the CPU is at fault. Still, we are not fully aware whether if by memory, it means the memory controller or cache.
  • Section D - RECOVERY_INFO_SECTION_GUID - Contains the recovery info, since this was a corrected machine check, the details are present here on what steps Windows has taken to fix this.

WHEA_XPF_MCA_SECTION

You may notice there is one section here which contains a “UNKNOWN GUID”, section C. The GUID states 8a1e1d01-42f9-4557-9c33-565e5cc3f7e8, which Googling reveals it is a WHEA_XPF_MCA_SECTION. This is a machine check exception error section structure, meaning we can find more details regarding it. This is exactly what we need out of this

The section structure is as follows:

typedef struct _WHEA_XPF_MCA_SECTION {
  ULONG             VersionNumber;
  WHEA_CPU_VENDOR   CpuVendor;
  LARGE_INTEGER     Timestamp;
  ULONG             ProcessorNumber;
  MCG_STATUS        GlobalStatus;
  ULONGLONG         InstructionPointer;
  ULONG             BankNumber;
  MCI_STATUS        Status;
  ULONGLONG         Address;
  ULONGLONG         Misc;
  ULONG             ExtendedRegisterCount;
  ULONG             ApicId;
  union {
    ULONGLONG                   ExtendedRegisters[WHEA_XPF_MCA_EXTREG_MAX_COUNT];
    WHEA_AMD_EXTENDED_REGISTERS AMDExtendedRegisters;
  };
  MCG_CAP           GlobalCapability;
  XPF_RECOVERY_INFO RecoveryInfo;
  ULONG             ExBankCount;
  ULONG             BankNumberEx[WHEA_XPF_MCA_EXBANK_COUNT];
  MCI_STATUS        StatusEx[WHEA_XPF_MCA_EXBANK_COUNT];
  ULONGLONG         AddressEx[WHEA_XPF_MCA_EXBANK_COUNT];
  ULONGLONG         MiscEx[WHEA_XPF_MCA_EXBANK_COUNT];
} WHEA_XPF_MCA_SECTION, *PWHEA_XPF_MCA_SECTION;

With this, we can now translate the error packet. Remember, all error packets start from offset 0x00 in Specify:

03 00 00 00 02 00 00 00 05 7e 33 ba 71 4a db 01  - 0x00 - .........~3ºqJÛ.
0b 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  - 0x10 - ................
00 00 00 00 00 00 00 00 35 01 01 01 00 40 20 9c  - 0x20 - ........5....@ .
08 19 b3 a4 06 00 00 00 00 00 00 00 fe 0f 1a d0  - 0x30 - ..³¤........þ..Ð
0a 00 00 00 0b 00 00 00 00 00 00 00 b0 00 10 00  - 0x40 - ............°...
01 24 1f 1a 6e 00 00 00 fd 01 00 00 27 00 00 00  - 0x50 - .$..n...ý...'...
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  - 0x60 - ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  - 0x70 - ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  - 0x80 - ................
3b 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  - 0x90 - ;...............
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  - 0xA0 - ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  - 0xB0 - ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  - 0xC0 - ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  - 0xD0 - ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  - 0xE0 - ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  - 0xF0 - ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  - 0x100 - ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  - 0x110 - ................

Translation and analysis

The error packet is quite hard to read by itself. So let’s pretty it up a bit by splitting up bytes into what we know so we can translate it properly. We would need to know the size of the items present, and the variable types in the struct already tells us how many bytes each variable is already:

Variable Name Variable Type Variable Size (Bytes)
Version Number ULONG 4
CPU Vendor WHEA_CPU_VENDOR 4
Timestamp LARGE_INTEGER 8
Processor Number ULONG 4
Global Status MCG_STATUS 8
Instruction Pointer ULONGLONG 8
Bank Number ULONG 4
MCI Status MCI_STATUS 8

These are the ones that we care for the most. As a matter of fact, almost everything after MCI status can be ignored as we do not have internal documentation regarding how those work (Damn you manufacturers!). We can now split the error packet accordingly:

03 00 00 00             - Version No.
02 00 00 00             - CPU Vendor
05 7e 33 ba 71 4a db 01 - Timestamp
0b 00 00 00             - Processor No.
00 00 00 00 00 00 00 00 - MCG Status
00 00 00 00 00 00 00 00 - Instruction Pointer
00 00 00 00             - Bank Number
35 01 01 01 00 40 20 9c - MCI Status - This is the one we need!!!

We can ignore the rest here:
08 19 b3 a4 06 00 00 00 00 00 00 00 fe 0f 1a d0
0a 00 00 00 0b 00 00 00 00 00 00 00 b0 00 10 00
01 24 1f 1a 6e 00 00 00 fd 01 00 00 27 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
3b 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Another thing to take note here is the fact that all of these values are little endian, so the bytes are “backwards”. In this case, 03 00 00 00 is read 00 00 00 03. 12 34 56 78 would be read 78 56 34 12, etc.

We can now translate the MCI status codes into the following:

  • CPU Vendor - AMD (01 is Intel, 02 is AMD)
  • Processor Number - 11 (Thread 11, Core 5)
  • MCI status - 01 35 - Memory Error - (After running wheaceerror.py - You can download the code here: https://github.com/Mr-KayZ/WHEA-CE-Error/blob/main/wheamceerror.py) /RTS-Extra-Docs/assets/img/WHEA_Analysis/wheamcerror_python.png

Considering all other surrounding errors, this could be a memory controller level error too. But this is enough to tell us all the details for now. We can assume this most likely is due to the Ryzen Voltage bug, but also keep an eye out on which core/thread this is always occurring on. If its the same core/thread, you may also potentially be dealing with a fried core as well.