IBM Telum, the new processor with a frequency above 5 GHz dedicated to AI

IBM Telum, the new processor with a frequency above 5 GHz dedicated to AI

IBM Telum

At this year's Hot Chips 33 conference, IBM unveiled its work on the IBM Telum processor that powers the next generation of Z-series mainframe computers, which are the backbone of the world's technology infrastructure. IBM Z mainframe computer line uses custom z / Architecture-based processors.

According to IBM, the Telum processor is manufactured with Samsung's 7nm process, a big step up from the 14nm process used for the Z15 chip. The chip has 22.5 billion transistors on a massive 530mm square die. It also runs at a clock speed of over 5 GHz and has 8 CPU cores with SMT2 functionality, allowing for 16 logical cores. Moreover, given the modular design, it is possible to combine 4 Telum chips, for a maximum of 32 cores and 64 threads.

Photo Credits: IBM The 8 cores present are combined with 256 MB of semi-private cache structure ”, Divided into eight groups of 32MB L2 cache (320GB / s). These structures are subsequently combined to create a shared virtual L3 cache with a capacity of 256MB, which connects all the cores.

The AI ​​accelerator is interconnected with the CPU: the CPU has a specific set of instructions that activate the accelerator. This has a processing throughput of over 6 TeraFLOPs.

IBM will make this processor available both standalone and as a 16-core Dual-Chip Module, formed by two 8-core Telum dies, with a total of 512MB of L3 cache.

The most interesting part of the system is perhaps the aforementioned AI accelerator, which shows that AI workloads have finally reached a point where even mainframe computers must have an accelerator. In business-critical fraud detection workloads, classic algorithmic detection is often no longer enough: AI-powered software is becoming much more needed, and IBM sees the need to allow this software to run much faster. . For more information, you can take a look at the IBM Z website.





IBM Re-Architects The Mainframe With New Telum Processor

The New IBM Z Telum processor that scan scale up to 32 chips and 256 CPU cores

IBM

Similar to what the company did with the new Power10 processors for cloud systems, IBM also started from scratch in designing a new processor for the company’s IBM Z mainframe. The IBM Z has a long history and is unique in that it still uses processors specially designed for enterprise security, reliability, scalability, and performance. The new Telum processor for the next generation IBM Z, enhances all these aspects and adds embedded acceleration, something most systems are accomplishing through discrete accelerators. IBM introduced the new Telum processor at the annual Hot Chips technology conference this morning.

The Telum processor

IBM Z Telum processor die photo

IBM

A key to the design of the Telum processor was to put everything on one die for performance and efficiency. The Telum processor features 8 CPU cores, on-chip workload accelerators, and 32MB of what IBM calls semi-private cache. Each chip module will feature two closely coupled Telum die for a total of 16 cores per socket. Just to indicate how different this architecture is, the prior z15 processor featured twelve cores and a separate chip for a shared cache. The Telum processor will also be manufactured on the Samsung 7nm process as opposed to the 14nm process used for the z15 processor.


Besides the processing cores themselves the most significant change is in the cache structure. Each CPU core has a dedicated L1 cache and 32MB of “semi-private” low-latency L2 cache. The reason it is “semi-private” is because the L2 caches are used together to build a shared virtual 256MB L3 between the cores on the chip. The L2 caches are connected through a bi-directional ring bus design for communications and capable of over 320 GB/s bandwidth with an average latency of just 12ns. The L2 caches are also used to build a virtual shared L4 cache between all chips in a drawer. There are up to four sockets per drawer, and two processors per socket, for a total of up to eight chips and 64 CPU cores with 2GB of shared L4 cache per drawer. That can then be scaled up to four drawers in rack for up to thirty-two chips and 256 CPU cores.

The IBM Z Telum processor dual-die chip module and four-chip drawer configuration

IBM

The cache architecture was matched with improvements in the CPU cores and accelerators. The Telum CPU cores are an out-of-order design with SMT2 (Simultaneous Multithreading) that can operate at or above a 5GHz base frequency. The CPU cores also feature, amongst other things, enhancements in branch prediction for large footprint and diverse enterprise workloads. The Telum processor also features encrypted memory and improvements to the trusted execution environment for enhanced security and dedicated on-chip accelerators for sort, compression, crypto, and artificial intelligence (AI) to scale with the workload.

Why is Acceleration so important?

One of the key dynamics of the electronics industry today is accelerated computing. Everything from smartphones to cloud servers are using custom or programmable processing blocks to perform tasks more efficiently than general purpose CPUs. This is occurring for two reasons. The first is that as certain tasks mature, it becomes more efficient to perform the tasks through dedicated hardware than through software. Even though some of these tasks may still be performed using a programmable processing engine, there are many programmable engines such as DSP, GPUs, NPUs, and FPGAs that may be able to perform certain tasks more efficiently than CPUs due to the nature of the workload and/or the design of the processing cores.


The second reason for the rise in accelerators is the slowing of Moore’s Law. As it becomes difficult to improved CPU performance and efficiency through the semiconductor manufacturing technology, the industry is shifting more towards heterogenous architectural improvements. By designing more efficient processing cores, whether they are dedicated to a specific function, or optimized around a specific type of workload or execution, significantly improved performance and efficiency can be achieved in the same or a similar amount of space. As a result, the direction going forward is accelerated computing. Even innovative technologies like quantum and neuromorphic computing, two areas where IBM research is leading the industry, are really forms of accelerating computing that will enhance traditional computing platforms.


AI is one of the most common workloads being accelerated and there is a wide variety of processors and accelerators under development for both AI training and inference processing. The benefits of each will depend on how efficiently the accelerator processes particular workloads. For servers, most AI accelerators are discrete chips. While this does offer more silicon area for higher peak performance, it also increases costs, power consumption, latency, and variability in performing AI tasks. IBM’s approach of adding the AI accelerator onto the chip and interfacing it directly with the CPU cores and sharing the memory will allow for secure real-time or close to real-time processing of AI models while increasing overall system efficiency. And because the processor is aimed at enterprise-class workloads, as opposed to large research workloads like scientific or financial modeling, the demands are likely to be spread across multiple AI models with low-latency requirements. The AI accelerators were designed for business workloads like fraud detection, as well as system and infrastructure management like workload placement, database query plans, and anomaly detection.


The AI accelerator features a matrix array with 128 processing tiles designed for 8-way FP-16 SIMD operations and an activation array with thirty-two tiles designed for 8-way FP-16/FP-32 SIMD operations. The reason for the two arrays is to divide the operations between more straightforward matrix multiplication and convolution functions and more complex functions like sigmoid or softmax while optimizing the execution of each. The two arrays are connected through an Intelligent Data Mover and Formatter capable of 600 GB/s of bandwidth internally and have programmable prefetchers and write-back engines connected to the on-chip caches with more than 120 GB/s bandwidth. According to IBM, the AI processor multiplexes AI workloads from the various CPUs and has an aggregate performance of over 6 TFLOPS per chip and is anticipated to be over 200 TLFOPS for a fully populated rack. The AI accelerator also uses the AI tools designed to work with other IBM platforms from the IBM Deep Learning Compiler for porting and optimizing trained models to the platform and Snap ML model libraries.

IBM Z Telum low-latency AI performance scales with the number of chips

IBMThe Result

According to IBM, the new cache structure has resulted in an estimated 40% performance increase in per socket performance. This is impressive given a platform that has evolved into a scalable mainframe that is optimized across the stack and all the way down to the processor. Ironically, just as Moore’s Law once drove the industry away from customized processor and systems, it now driving the industry back to customization in the era of accelerated computing. While IBM revenue is driven by software and services, having expertise that drives everything from semiconductor manufacturing to custom chips and systems, gives IBM a competitive advantage in this new accelerated world focused on workload optimization.