Seeing the Fifth Dimension Through Dark Matter

A theory of dark matter that goes beyond the Standard Model of physics suggests dark matter may be seen through interactions within the fifth dimension.

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Introduction

Authors: Jiong Gong at Intel, Vitaly Fedyunin at Facebook, Nikita Shustrov at Intel

Intel and Facebook previously collaborated to enable BF16, a first-class data type in PyTorch. It supports basic math and tensor operations and adds CPU optimization with multi-threading, vectorization, and neural network kernels from oneAPI Deep Neural Network Library (oneDNN, formerly known as MKL-DNN). The related work was published in an earlier blog during the launch of the 3rd Gen Intel® Xeon® scalable processors (formerly codename Cooper Lake). In that blog, we introduced the HW advancements for native BF16 support in Cooper Lake with BF16->FP32 fused multiply-add (FMA) Intel® Advanced Vector Extensions-512 (Intel® AVX-512) instructions that bring doubled theoretical compute throughput over FP32 FMA. Based on the HW advancement and SW optimization from Intel and Facebook, we showcased 1.40x-1.64x performance boost of PyTorch BF16 training over FP32 from DLRM, ResNet-50 and ResNext-101–32x4d, representative deep learning (DL) models for recommendation and computer vision tasks.

In this blog, we introduce the latest two SW advancements added in Intel Extension for PyTorch (IPEX) on top of PyTorch and oneDNN for PyTorch BF16 CPU optimization:

2. Graph fusion optimization to further boost BF16 inference performance with oneDNN fusion kernels. Graph-level model optimization is becoming more and more important for maximizing DL workload performance when individual compute-intensive ops are well optimized and more varieties of DL topology patterns are innovated. PyTorch started to support graph mode from 1.0 release and the support is getting mature recently for DL inference. oneDNN also brings fusion kernels for DL fusion patterns like conv and matmul with element-wise post-ops. We added graph rewrite pass to IPEX based on PyTorch graph intermediate representation. The optimization pass recognizes these oneDNN-supported common fusion patterns and replaces the corresponding sub-graphs with calls to oneDNN fusion kernels.

The following chapters are organized as follows. We first introduce IPEX and oneDNN as the background info. Then we introduce the ease-of-use IPEX API with examples and briefly explain the internal implementation that supports the API. After that, we show the performance result of PyTorch BF16 training and inference with IPEX and oneDNN on a couple of representative DL models. We conclude with remarks on the next steps.

IPEX brings the following key features:

Intel collaborates with Facebook to continuously upstream most of the optimizations from IPEX to PyTorch proper to better serve the PyTorch community with the latest Intel HW and SW advancements.

oneDNN is an open-source cross-platform performance library of basic building blocks for deep learning applications. The library is optimized for Intel Architecture Processors, Intel Processor Graphics and Xe architecture-based Graphics. It also supports the computation of a variety of data types including FP32, BF16 and INT8.

oneDNN is the performance library that IPEX relies on to optimize performance-critical deep neural network operations (convolution, matrix multiplication, pooling, batch normalization, non-linearities etc.). oneDNN also supports fusion kernels as computation primitives for common DL patterns like convolution or matrix multiplication plus a sequence of element-wise post-operations. These primitives are used by IPEX to implement graph fusion optimization.

IPEX brings “3-step” user-facing API to enable BF16 and oneDNN optimizations on CPU:

Following, we provide examples in pseudo Python code for PyTorch BF16 inference and training with IPEX API. The three steps are highlighted with yellow marks. These are the only changes to original user code.

Under the hood, IPEX registers optimized op kernels on “ipex.DEVICE” with “torch::RegisterOperators” PyTorch extension API and registers graph fusion pass with “torch::jit::RegisterPass” PyTorch extension API when the user imports IPEX Python module.

IPEX automatically converts input tensor layout to blocked layout for PyTorch ops optimized with oneDNN kernels and only converts the tensor to strided layout on demand when non-oneDNN ops are encountered. This maximizes oneDNN kernel efficiency without introducing unnecessary layout conversions. When the user turns on BF16 auto-mixed precision, IPEX automatically inserts type conversions between BF16 and FP32 according to the list of supported BF16 ops. IPEX applies graph fusion for patterns supported by oneDNN, pattern examples including “conv2d+relu”, “conv2d+swish”, “conv2d+add+relu”, “linear+relu” and “linear+gelu” etc. These are common patterns in popular DL models like DLRM, BERT and ResNet.

We evaluated the performance boost of PyTorch BF16 training and inference with IPEX and oneDNN on DLRM, BERT-Large and ResNext-101–32x4d, covering three representative DL tasks: recommendation, NLP and CV respectively. The performance speed-up is compared with that of FP32 on PyTorch proper, and is contributed by BF16 support, layout conversion optimization and graph optimization from IPEX.

Table 1. Single instance BF16 training performance gains over baseline (FP32 with Intel® Math Kernel Library for DLRM and BERT-Large, FP32 with Intel® oneDNN for ResNext-101–32x4d), measured on single-socket Intel(R) Xeon(R) Platinum 8380H Processor with 28 cores. The DLRM model uses 2K mini-batch-size with Criteo terabyte dataset, and the hyper-parameters use the MLPerf configuration. The BERT-large model uses 24 mini-batch-size with WikiText dataset. The ResNext-101–32x4d uses 128 mini-batch-size with ILSVRC2012 dataset.

The speedup ratio of ResNext-101–32x4d is larger than the other two models because it benefits more from layout conversion optimization and graph fusion optimization, e.g. batch-normalization folding, conv fused with relu and conv fused with add and relu.

In this blog, we introduced recent SW advancements in both user experience and performance for PyTorch CPU BF16 support using IPEX and oneDNN. With these SW advancements, we not only demonstrated the ease of use of IPEX API but also showcased 1.55X-2.42X speed-up with IPEX BF16 training over FP32 with PyTorch proper and 1.40X-4.26X speed-up with IPEX BF16 inference over FP32 with PyTorch proper. Both IPEX and oneDNN are available as open source projects. oneDNN is released as part of oneAPI optimization libraries, and IPEX is released as part of oneAPI-powered Intel AI analytics toolkit.

¹ Some pull requests have not been merged yet
² Model runs with FP32 optimization if this step is skipped. Auto-layout conversion still applies for FP32

Intel(R) Xeon(R) Platinum 8380H Processor, 8 socket, 28 cores HT On Turbo ON Total Memory 1536 GB (48 slots/ 32GB/ 3200 MHz),

BIOS: WLYDCRB1.SYS.0017.P06.2008230904 (ucode: 0x700001e)

Ubuntu 20.04.1 LTS, kernel 5.4.0–48-generic

GCC: 7.5.0

Training batch size (FP32/BF16): 2K/instance, 1 instance on a CPU socket

Inference batch size (FP32/BF16): 64/instance, 28 instances sharing model weights and running in a single process per socket, 224 instances on 8 CPU sockets in total.

Training batch size (FP32/BF16): 24/instance, 1 instance on a CPU socket

Inference batch size: 1/instance, 1 instance per process, 224 instances on 8 CPU sockets in total.

Training batch size (FP32/BF16): 128/instance, 1 instance on a CPU socket

Inference batch size (FP32/BF16): 1/instance, 1 instance per process, 224 instances on 8 CPU sockets in total.

Dataset (FP32/BF16): ILSVRC2012

Tested by Intel as of 11/4/2020.

Notices & Disclaimers

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex​.

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available ​updates. See backup for configuration details. No product or component can be absolutely secure.

Your costs and results may vary.

Intel technologies may require enabled hardware, software or service activation.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. ​

Add a comment

Related posts:

4 Simple Steps to Turn Your Life Around

Take these four productivity boosting steps every morning, and you'll be able to quickly turn your life around. Best of all, each step is super simple!

How Product Management has improved my daily life

Can Product Management Methodologies apply to our own improvement can help us grow faster and better? Since I started studying the principles and secrets of Product Management 6 months ago, I also…

BlockEden.xyz Partners with Secure3 to Bolster Blockchain Security

Here at BlockEden.xyz, we are always looking for ways to improve the security of digital assets on blockchain for developers and their customers. That’s why we’re glad to announce our official…