AMD Instinct MI200, new Linux patches reveal the amount of HBM2E memory

AMD Instinct MI200, new Linux patches reveal the amount of HBM2E memory

AMD Instinct MI200

AMD won't be talking about its Instinct MI200 GPU compute for some time yet, but Linux patches continue to reveal new features and capabilities for these upcoming products. Apparently, the GPU that will be used in the Exascale Frontier supercomputer will have a memory subsystem that supports up to 128GB of HBM2E DRAM.

We already know that AMD's Instinct MI200 compute GPU, codenamed Aldebaran, will be based on the CDNA 2 architecture and uses two dies in a single package using AMD's high-performance Infinity interconnects. Phoronix colleagues reported that one of AMD's latest Linux patches for its AMD64 EDAC driver revealed the memory architecture of the Instinct MI200. Apparently, each of the Aldebaran die has four unified memory controllers (UMCs). Each UMC supports eight channels and each of them is connected to 2GB High Bandwidth Memory of the second generation (HBM2).

Recall that an HBM2 stack supports a 1024-bit interface, generally called the HBM2 channel. However, internally, an HBM2 stack consists of two, four, or eight DDR DRAM devices with two 128-bit channels per device on a basic logical die. Essentially, an HBM stack supports up to eight 128-bit channels on its 1024-bit interface.

At this point it is not entirely clear what AMD means by channel, but it seems likely that it indicates eight 128-bit DDR channels within the 1024-bit HBM2 stack. Essentially, this means that each of Aldebaran's UMCs can connect to four HBM2 stacks on a 4096-bit memory interface. With each channel addressing 2GB of memory, one die can address up to 64GB of memory, while two dies can run with up to 128GB of memory. The actual bandwidth of Aldebaran's memory subsystem is unknown, but assuming AMD uses SK Hynix's latest 3.6Gb / s HBM2 stacks, its memory subsystem will provide the GPU with up to 3.64TB / s of width of bandwidth.

GPU computes must use ECC memory, so some of the bandwidth and capacity is used for error correction. To that end, not all of Instinct MI200's 128GB of memory will actually be available for applications.

Looking for a good motherboard to pair with the new Ryzen processors? ASUS ROG Strix X570-F with 14 power phases might be a good choice. You can find it on Amazon at a good price.

Linux Prepares For AMD Servers With Aldebaran GPU Nodes Sporting HBM2

The latest public code patches on the mailing list today are preparing for newer AMD heterogeneous servers that will have Aldebaran GPU nodes connected via xGMI links to the CPU(s) and the GPU dies in turn having HBM2 memory.

These new heterogeneous AMD system details were revealed today as part of a set of patches prepping the AMD64 EDAC (Error Detection And Correction) kernel driver code for non-CPU nodes. The AMD64 EDAC driver is for traditionally dealing with and correcting system DRAM ECC errors while now being extended to GPU node memory accessible from the CPUs via the xGMI high-speed interconnect.

The public patches note that there will be systems with GPU nodes connected via xGMI links and the GPU dies have HBM2 memory. The patches go on to confirm those nodes as being Aldebaran, the codename for a next-gen AMD CDNA GPU/accelerator that saw initial kernel driver support in Linux 5.13 and continues seeing more open-source driver work around it. Aldebaran is the apparent successor to MI100 'Arcturus' and thus presumably will debut as something along the lines of the AMD Instinct MI200.

These patches published a short time ago note that Aldebaran has two dies (further confirming Aldebaran as an MCM design) with each having four unified memory controllers (UMCs). Each unified memory controller manages eight memory channels that each are connected to 2GB of HBM2 (or HBM2E) memory.

The seven patches posted prepare the EDAC memory driver for the notion of connected non-CPU nodes, recognizing the HBM Gen2 memory type, address translation on Data Fabric version 3.5, and related plumbing. Getting this Linux support squared away timely is being driven by the dominance of Linux in the HPC space and especially with AMD's increasing supercomputer design wins. Most notably Aldebaran and in turn this Linux code is likely what we are to see within the upcoming Frontier exascale supercomputer where it has been mentioned already to have the coherent interconnect between the EPYC CPUs and Radeon Instinct GPUs.

Given the timing of these patches with the Linux 5.14 merge window already open, these amd64_edac additions will likely land for Linux 5.15 unless drawn out by an extended review process.