

# MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning

Talk at Mellanox booth (SC '19)

by

Dhabaleswar K. (DK) Panda

The Ohio State University

E-mail: panda@cse.ohio-state.edu

http://www.cse.ohio-state.edu/~panda

### Outline

- Overview of the MVAPICH2 Project
- MVAPICH2-GPU with GPUDirect-RDMA (GDR)
- What's new with MVAPICH2-GDR
- High-Performance Deep Learning (HiDL) with MVAPICH2-GDR
- Conclusions

### **Overview of the MVAPICH2 Project**

- High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)
  - MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002 (Supercomputing 2002)
  - MVAPICH2-X (MPI + PGAS), Available since 2011
  - Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014
  - Support for Virtualization (MVAPICH2-Virt), Available since 2015
  - Support for Energy-Awareness (MVAPICH2-EA), Available since 2015
  - Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015
  - Used by more than 3,050 organizations in 89 countries
  - More than 615,000 (> 0.6 million) downloads from the OSU site directly
  - Empowering many TOP500 clusters (Nov '19 ranking)
    - 3<sup>rd</sup>, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China
    - 5<sup>th</sup>, 448, 448 cores (Frontera) at TACC
    - 8<sup>th</sup>, 391,680 cores (ABCI) in Japan
    - 14<sup>th</sup>, 570,020 cores (Neurion) in South Korea and many others
  - Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, and OpenHPC)
  - <u>http://mvapich.cse.ohio-state.edu</u>
- Empowering Top500 systems for over a decade

#### Mellanox Booth (SC '19)



Years &

**Counting!** 

2001-2019

Partner in the 5<sup>th</sup> ranked TACC Frontera System

### **MVAPICH2** Release Timeline and Downloads



Network Based Computing Laboratory

## Architecture of MVAPICH2 Software Family (HPC and DL)

| High Performance Parallel Programming Models |                              |                            |  |  |  |  |  |  |
|----------------------------------------------|------------------------------|----------------------------|--|--|--|--|--|--|
| Message Passing Interface                    | PGAS                         | Hybrid MPI + X             |  |  |  |  |  |  |
| (MPI)                                        | (UPC, OpenSHMEM, CAF, UPC++) | (MPI + PGAS + OpenMP/Cilk) |  |  |  |  |  |  |



#### <sup>\*</sup> Upcoming

### **MVAPICH2 Software Family**

| Requirements                                                                            | Library       |  |  |  |
|-----------------------------------------------------------------------------------------|---------------|--|--|--|
| MPI with IB, iWARP, Omni-Path, and RoCE                                                 | MVAPICH2      |  |  |  |
| Advanced MPI Features/Support, OSU INAM, PGAS and MPI+PGAS with IB, Omni-Path, and RoCE | MVAPICH2-X    |  |  |  |
| MPI with IB, RoCE & GPU and Support for Deep Learning                                   | MVAPICH2-GDR  |  |  |  |
| HPC Cloud with MPI & IB                                                                 | MVAPICH2-Virt |  |  |  |
| Energy-aware MPI with IB, iWARP and RoCE                                                | MVAPICH2-EA   |  |  |  |
| MPI Energy Monitoring Tool                                                              | OEMT          |  |  |  |
| InfiniBand Network Analysis and Monitoring                                              | OSU INAM      |  |  |  |
| Microbenchmarks for Measuring MPI and PGAS Performance                                  | ОМВ           |  |  |  |

### **MVAPICH2-GDR: Optimizing MPI Data Movement on GPU Clusters**

Connected as PCIe devices – Flexibility but Complexity



Memory buffers

- 1. Intra-GPU
- Intra-Socket GPU-GPU
- 3. Inter-Socket GPU-GPU
- 4. Inter-Node GPU-GPU
- 5. Intra-Socket GPU-Host
- 6. Inter-Socket GPU-Host
- 7. Inter-Node GPU-Host

8. Inter-Node GPU-GPU with IB adapter on remote socket

#### and more ....

- For each path different schemes: Shared\_mem, IPC, GPUDirect RDMA, pipeline ...
- Critical for runtimes to optimize data movement while hiding the complexity

## **GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU**

- Standard MPI interfaces used for unified data movement
- Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)
- Overlaps data movement from GPU with RDMA transfers



## CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases

- Support for MPI communication from NVIDIA GPU device memory
- High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU)
- High performance intra-node point-to-point communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU)
- Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node
- Optimized and tuned collectives for GPU device buffers
- MPI datatype support for point-to-point and collective communication from GPU device buffers
- Unified memory

### MVAPICH2-GDR 2.3.2

- Released on 08/08/2019
- Major Features and Enhancements
  - Based on MVAPICH2 2.3.1
  - Support for CUDA 10.1
  - Support for PGI 19.x
  - Enhanced intra-node and inter-node point-to-point performance
  - Enhanced MPI\_Allreduce performance for DGX-2 system
  - Enhanced GPU communication support in MPI\_THREAD\_MULTIPLE mode
  - Enhanced performance of datatype support for GPU-resident data
    - Zero-copy transfer when P2P access is available between GPUs through NVLink/PCIe
  - Enhanced GPU-based point-to-point and collective tuning
    - OpenPOWER systems such as ORNL Summit and LLNL Sierra ABCI system @AIST, Owens and Pitzer systems @Ohio Supercomputer Center
  - Scaled Allreduce to 24,576 Volta GPUs on Summit
  - Enhanced intra-node and inter-node point-to-point performance for DGX-2 and IBM POWER8 and IBM POWER9 systems
  - Enhanced Allreduce performance for DGX-2 and IBM POWER8/POWER9 systems
  - Enhanced small message performance for CUDA-Aware MPI\_Put and MPI\_Get
  - Flexible support for running TensorFlow (Horovod) jobs

## **Optimized MVAPICH2-GDR Design**



Network Based Computing Laboratory

### **Application-Level Evaluation (HOOMD-blue)**

### 64K Particles

256K Particles



- Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB)
- HoomdBlue Version 1.0.5
  - GDRCOPY enabled: MV2\_USE\_CUDA=1 MV2\_IBA\_HCA=mlx5\_0 MV2\_IBA\_EAGER\_THRESHOLD=32768 MV2\_VBUF\_TOTAL\_SIZE=32768 MV2\_USE\_GPUDIRECT\_LOOPBACK\_LIMIT=32768 MV2\_USE\_GPUDIRECT\_GDRCOPY=1 MV2\_USE\_GPUDIRECT\_GDRCOPY\_LIMIT=16384

### Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland





- 2X improvement on 32 GPUs nodes
- 30% improvement on 96 GPU nodes (8 GPUs/node)

<u>Cosmo model: http://www2.cosmo-model.org/content</u> /tasks/operational/meteoSwiss/

#### On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application

C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee, H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems, IPDPS'16

Network Based Computing Laboratory

### Outline

- Overview of the MVAPICH2 Project
- MVAPICH2-GPU with GPUDirect-RDMA (GDR)
- What's new with MVAPICH2-GDR
  - Multi-stream Communication for IPC
  - CMA-based Intra-node Communication Support
  - Support for OpenPower and NVLink with GDRCOPY2
  - Maximal overlap in MPI Datatype Processing
- High-Performance Deep Learning (HiDL) with MVAPICH2-GDR
- Conclusions

### Multi-stream Communication using CUDA IPC on OpenPOWER and DGX-1

• Up to **16% higher** Device to Device (D2D) bandwidth on OpenPOWER + NVLink inter-connect

Pt-to-pt (D-D) Bandwidth:

• Up to **30% higher** D2D bandwidth on DGX-1 with NVLink

Pt-to-pt (D-D) Bandwidth:

**Benefits of Multi-stream CUDA IPC Design Benefits of Multi-stream CUDA IPC Design** 20000 40000 Million Bytes (MB)/second (MB)/second 18000 35000 16% better 30% better 16000 30000 14000 25000 12000 10000 20000 **Million Bytes** 8000 15000 6000 10000 4000 5000 2000 0 0 16K 32K 64K 128K 256K 512K 1M 2M 128K 256K 512K 2M 4M 4M 1M Message Size (Bytes) Message Size (Bytes) 1-stream 4-streams 1-stream 4-streams Available since MVAPICH2-GDR-2.3a **Network Based Computing Laboratory** Mellanox Booth (SC '19)

### **CMA-based Intra-node Communication Support**

Up to **30% lower** Host-to-Host (H2H) latency and **30% higher** H2H Bandwidth •



#### **INTRA-NODE Pt-to-Pt (H2H) BANDWIDTH**

Intel Broadwell (E5-2680 v4 @ 3240 GHz) node – 28 cores NVIDIA Tesla K-80 GPU, and Mellanox Connect-X4 EDR HCA CUDA 8.0, Mellanox OFED 4.0 with GPU-Direct-RDMA

**Network Based Computing Laboratory** 

### Scalable Host-based Collectives on OpenPOWER (Intra-node Reduce & AlltoAll)



### D-to-D Performance on OpenPOWER w/ GDRCopy (NVLink2 + Volta)





Inter-node Latency: 2.18 us (with GDRCopy 2.0)





#### Inter-node Bandwidth: 23 GB/sec for 4MB (via 2 Port EDR)

Platform: OpenPOWER (POWER9-ppc64le) nodes equipped with a dual-socket CPU, 4 Volta V100 GPUs, and 2port EDR InfiniBand Interconnect

**Network Based Computing Laboratory** 

Mellanox Booth (SC '19)

18

### D-to-H & H-to-D Performance on OpenPOWER w/ GDRCopy (NVLink2 + Volta)



H-D INTRA-NODE LATENCY (LARGE)

MV2-GDR

2M

4M

Spectrum MPI

400

200

100

16K

Latency (us) 300

#### Intra-node D-H Latency: 0.49 us (with GDRCopy)

Spectrum MPI

H-D INTRA-NODE LATENCY

(SMALL)

Message Size (Bytes)

MV2-GDR







Intra-node H-D Bandwidth: 26.09 GB/sec

for 2MB (via NVLINK2)

#### Intra-node H-D Latency: 0.49 us (with GDRCopy 2.0)

Platform: OpenPOWER (POWER9-ppc64le) nodes equipped with a dual-socket CPU, 4 Volta V100 GPUs, and 2port EDR InfiniBand Interconnect

**Network Based Computing Laboratory** 

4 00 16 32 64 128 256 512  $1 \mathrm{K}$ 2K 4× ×

60

40

20

Latency (us)

Mellanox Booth (SC '19)

32K 64K 128K 256K 512K 1M

Message Size (Bytes)

## Managed Memory Performance (OpenPOWER Intra-node)



Network Based Computing Laboratory

#### Mellanox Booth (SC '19)

20

### **MVAPICH2 with SHARP Support (Preliminary Results)**



Network Based Computing Laboratory

### Non-contiguous Data Exchange



### Halo data exchange

- Multi-dimensional data
  - Row based organization
  - Contiguous on one dimension
  - Non-contiguous on other dimensions
- Halo data exchange
  - Duplicate the boundary
  - Exchange the boundary in each iteration

## **MPI Datatype support in MVAPICH2**

- Datatypes support in MPI
  - Operate on customized datatypes to improve productivity
  - Enable MPI library to optimize non-contiguous data

#### At Sender:

•••

```
MPI_Type_vector (n_blocks, n_elements, stride, old_type, &new_type);
MPI_Type_commit(&new_type);
```

MPI\_Send(s\_buf, size, new\_type, dest, tag, MPI\_COMM\_WORLD);

- Inside MVAPICH2
  - Use datatype specific CUDA Kernels to pack data in chunks
  - Efficiently move data between nodes using RDMA
  - In progress currently optimizes vector and hindexed datatypes
  - Transparent to the user

H. Wang, S. Potluri, D. Bureddy, C. Rosales and D. K. Panda, GPU-aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation, IEEE Transactions on Parallel and Distributed Systems, Vol. 25, No. 10, pp. 2595-2605, Oct 2014.

**Network Based Computing Laboratory** 

### **MPI Datatype Processing (Computation Optimization )**

- Comprehensive support
  - Targeted kernels for regular datatypes vector, subarray, indexed\_block
  - Generic kernels for all other irregular datatypes
- Separate non-blocking stream for kernels launched by MPI library
  - Avoids stream conflicts with application kernels
- Flexible set of parameters for users to tune kernels
  - Vector
    - MV2\_CUDA\_KERNEL\_VECTOR\_TIDBLK\_SIZE
    - MV2\_CUDA\_KERNEL\_VECTOR\_YSIZE
  - Subarray
    - MV2\_CUDA\_KERNEL\_SUBARR\_TIDBLK\_SIZE
    - MV2\_CUDA\_KERNEL\_SUBARR\_XDIM
    - MV2\_CUDA\_KERNEL\_SUBARR\_YDIM
    - MV2\_CUDA\_KERNEL\_SUBARR\_ZDIM
  - Indexed\_block
    - MV2\_CUDA\_KERNEL\_IDXBLK\_XDIM

### **MPI Datatype Processing (Communication Optimization)**

### Common Scenario

MPI\_Isend (A,.. Datatype,...) MPI\_Isend (B,.. Datatype,...) MPI\_Isend (C,.. Datatype,...) MPI\_Isend (D,.. Datatype,...)

MPI\_Waitall (...);

...

\*A, B...contain non-contiguous MPI Datatype

### Waste of computing resources on CPU and GPU



## **Application: COMB**

Run Scripts pushed to COMB Github repo: <a href="https://github.com/LLNL/Comb/pull/2">https://github.com/LLNL/Comb/pull/2</a>

| L6 GPUs on POWER9 system (test Comm mpi Mesh cuda Device Buffers mpi_type) |              |           |               |           |               |               |          |               |                |             |
|----------------------------------------------------------------------------|--------------|-----------|---------------|-----------|---------------|---------------|----------|---------------|----------------|-------------|
|                                                                            | pre-<br>comm | post-recv | post-<br>send | wait-recv | wait-<br>send | post-<br>comm | start-up | test-<br>comm | bench-<br>comm |             |
| Spectrum MPI 10.3                                                          | 0.0001       | 0.0000    | 1.6021        | 1.7204    | 0.0112        | 0.0001        | 0.0004   | 7.7383        | 83.6229        | <b>18</b> × |
| MVAPICH2-GDR 2.3.2                                                         | 0.0001       | 0.0000    | 0.0862        | 0.0871    | 0.0018        | 0.0001        | 0.0009   | 0.3558        | 4.4396         | 27.         |
| MVAPICH2-GDR 2.3.3<br>(Upcoming)                                           | 0.0001       | 0.0000    | 0.0030        | 0.0032    | 0.0001        | 0.0001        | 0.0009   | 0.0133        | 0.1602         | Jerx        |

Improvements due to enhanced support for GPU-kernel based packing/unpacking routines

**Network Based Computing Laboratory** 

## **Application: HYPRE - BoomerAMG**

HYPRE - BoomerAMG



#### RUN MVAPICH2-GDR 2.3.2:

export MV2\_USE\_CUDA=1 MV2\_USE\_GDRCOPY=0 MV2\_USE\_RDMA\_CM=0 export MV2\_USE\_GPUDIRECT\_LOOPBACK=0 MV2\_HYBRID\_BINDING\_POLICY=spread MV2\_IBA\_HCA=mlx5\_0:mlx5\_3 OMP\_NUM\_THREADS=20 lrun -n 128 -N 32 mpibind ./ij -P 8 4 4 -n 50 50 50 -pmis -Pmx 8 -keepT 1 -rlx 18

#### RUN Spectrum-MPI 10.3.0.1:

OMP\_NUM\_THREADS=20 lrun -n 128 -N 32 --smpiargs "-gpu --disable\_gdr" mpibind ./ij -P 8 4 4 -n 50 50 50 -pmis -Pmx 8 -keepT 1 -rlx 18

### Outline

- Overview of the MVAPICH2 Project
- MVAPICH2-GPU with GPUDirect-RDMA (GDR)
- What's new with MVAPICH2-GDR
- High-Performance Deep Learning (HiDL) with MVAPICH2-GDR
  - Benefits of CUDA-Aware MPI with TensorFlow
  - Optimized Collectives for Deep Learning
  - Out-of-core DNN Training
- Conclusions

## Data Parallel Training with TensorFlow (TF)

- Need to understand several options currently available •
- gRPC (official support)
  - Open-source can be enhanced by others
  - Accelerated gRPC (add RDMA to gRPC)
- gRPC+X •
  - Use gRPC for bootstrap and rendezvous
  - Actual communication is in "X"
  - $X \rightarrow$  MPI, Verbs, GPUDirect RDMA (GDR), etc.
- No-gRPC ٠
  - Baidu the first one to use MPI Collectives for TF
  - Horovod Use NCCL, or MPI, or any other future library (e.g. IBM DDL support recently added)

gRPC

A. A. Awan, J. Bedorf, C.-H. Chu, H. Subramoni and D. K. Panda, "Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation", CCGrid '19. https://arxiv.org/abs/1810.11112



### **Exploiting CUDA-Aware MPI for TensorFlow (Horovod)**

- MVAPICH2-GDR offers excellent performance via advanced designs for MPI\_Allreduce.
- Up to 11% better performance on the RI2 cluster (16 GPUs)
- Near-ideal 98% scaling efficiency



🖽 Horovod-MPI 🛛 🖾 Horovod-NCCL2 🖓 Horovod-MPI-Opt (Proposed) 🗧 Ideal

A. A. Awan et al., "Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation", CCGrid '19, <u>https://arxiv.org/abs/1810.11112</u>

**Network Based Computing Laboratory** 

## **MVAPICH2-GDR vs. NCCL2: Allreduce Operation**

- Optimized designs in MVAPICH2-GDR 2.3 offer better/comparable performance for most cases
- MPI\_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) on 16 GPUs



Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 1 K-80 GPUs, and EDR InfiniBand Inter-connect

## **MVAPICH2-GDR vs. NCCL2: Allreduce Optimization (DGX-2)**

- Optimized designs in upcoming MVAPICH2-GDR offer better performance for most cases
- MPI\_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) on a DGX-2 machine



Platform: Nvidia DGX-2 system @ PSC (16 Nvidia Volta GPUs connected with NVSwitch), CUDA 9.2

## **MVAPICH2-GDR: MPI\_Allreduce (Device Buffers) on Summit**

- Optimized designs in MVAPICH2-GDR offer better performance for most cases
- MPI\_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) up to 1,536 GPUs



Platform: Dual-socket IBM POWER9 CPU, 6 NVIDIA Volta V100 GPUs, and 2-port InfiniBand EDR Interconnect

## Distributed Training with TensorFlow and MVAPICH2-GDR on Summit

- ResNet-50 Training using TensorFlow benchmark on SUMMIT -- 1536 Volta GPUs!
- 1,281,167 (1.2 mil.) images
- Time/epoch = 3.6 seconds
- Total Time (90 epochs)
   = 3.6 x 90 = 332 seconds =

#### 5.5 minutes!



\*We observed errors for NCCL2 beyond 96 GPUs

*Platform: The Summit Supercomputer (#1 on Top500.org) – 6 NVIDIA Volta GPUs per node connected with NVLink, CUDA 9.2* 

## New Benchmark for Image Segmentation on Summit

- Near-linear scaling may be achieved by tuning Horovod/MPI
  - Optimizing MPI/Horovod towards large message sizes for high-resolution images
- Develop a generic Image Segmentation benchmark
- Tuned DeepLabV3+ model using the benchmark and Horovod, up to 1.3X better than default



\*Anthony et al., "Scaling Semantic Image Segmentation using Tensorflow and MVAPICH2-GDR on HPC Systems" (Submission under review)

## **OSU-Caffe: Scalable Deep Learning**

- Caffe : A flexible and layered Deep Learning framework.
- Benefits and Weaknesses
  - Multi-GPU Training within a single node
  - Performance degradation for GPUs across different sockets
  - Limited Scale-out
- OSU-Caffe: MPI-based Parallel Training
  - Enable Scale-up (within a node) and Scale-out (across multi-GPU nodes)
  - Scale-out on 64 GPUs for training CIFAR-10 network on CIFAR-10 dataset
  - Scale-out on 128 GPUs for training GoogLeNet network on ImageNet dataset

## OSU-Caffe publicly available from http://hidl.cse.ohio-state.edu/

### GoogLeNet (ImageNet) on 128 GPUs



**Network Based Computing Laboratory** 

# Scalability and Large (Out-of-core) Models?

- Large DNNs cannot be trained on GPUs due to memory limitation!
  - ResNet-50 for Image Recognition but current frameworks can only go up to a small batch size of 45
  - Next generation models: Neural Machine Translation (NMT)
    - Ridiculously large (billions of parameters),
    - Will require even more memory!
  - Can we exploit new software features in CUDA 8/9 and hardware mechanisms in Pascal/Volta GPUs?
- General intuition is that managed allocations "will be" slow!
  - The proposed framework called OC-Caffe (Out-of-Core Caffe) shows the potential of managed memory designs that can provide performance with negligible/no overhead.
- OC-Caffe-Opt: up to 80% better than Intel-optimized CPU Caffe for ResNet-50 training on the Volta V100 GPU with CUDA9 and CUDNN7





A. A. Awan, C.-H. Chu, H. Subramoni, X. Lu, and D. K. Panda, OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training, HiPC '18

## HyPar-Flow (HF): Hybrid Parallelism for TensorFlow

- CPU based results
  - AMD EPYC
  - Intel Xeon
- Excellent speedups for
  - VGG-19
  - ResNet-110
  - ResNet-1000 (1k layers)
- Able to train "future" models
  - E.g. ResNet-5000 (a synthetic
     5000-layer model we
     benchmarked)



#### 110x speedup on 128 Intel Xeon Skylake nodes (TACC Stampede2 Cluster)

\*Awan et al., "HyPar-Flow: Exploiting MPI and Keras for Hybrid Parallel Training of TensorFlow models", arXiv '19. https://arxiv.org/pdf/1911.05146.pdf

### Outline

- Overview of the MVAPICH2 Project
- MVAPICH2-GPU with GPUDirect-RDMA (GDR)
- What's new with MVAPICH2-GDR
- High-Performance Deep Learning (HiDL) with MVAPICH2-GDR
- Conclusions

### **Conclusions**

- MVAPICH2-GDR Library provides optimized MPI communication on InfiniBand and RoCE clusters with GPUs
- Supports both X86 and OpenPower with NVLink
- Takes advantage of CUDA features like IPC and GPUDirect RDMA families
- Allows flexible solutions for streaming applications with GPUs
- Provides optimized solutions (scale-up and scale-out) for High-Performance Deep Learning

## **Commercial Support for MVAPICH2, HiBD, and HiDL Libraries**

- Supported through X-ScaleSolutions (<u>http://x-scalesolutions.com</u>)
- Benefits:
  - Help and guidance with installation of the library
  - Platform-specific optimizations and tuning
  - Timely support for operational issues encountered with the library
  - Web portal interface to submit issues and tracking their progress
  - Advanced debugging techniques
  - Application-specific optimizations and tuning
  - Obtaining guidelines on best practices
  - Periodic information on major fixes and updates
  - Information on major releases
  - Help with upgrading to the latest release
  - Flexible Service Level Agreements
- Support provided to Lawrence Livermore National Laboratory (LLNL) for the last two years



## Multiple Events at SC '19

- Presentations at OSU and X-Scale Booth (#2094)
  - Members of the MVAPICH, HiBD and HiDL members
  - External speakers
- Presentations at SC main program (Tutorials, Workshops, BoFs, Posters, and Doctoral Showcase)
- Presentation at many other booths (Mellanox, Intel, Microsoft, and AWS) and satellite events
- Complete details available at

http://mvapich.cse.ohio-state.edu/conference/752/talks/

### **Funding Acknowledgments**

**Funding Support by** 



### **Personnel Acknowledgments**

#### Current Students (Graduate)

#### A. Awan (Ph.D.)

- M. Bayatpour (Ph.D.) \_
- C.-H. Chu (Ph.D.) \_
- J. Hashmi (Ph.D.) \_
- A. Jain (Ph.D.) \_
- K. S. Kandadi (M.S.) \_

#### Past Students

- A. Augustine (M.S.)
- P. Balaji (Ph.D.)
- R. Biswas (M.S.) \_
- S. Bhagvat (M.S.)
- A. Bhat (M.S.) \_
- D. Buntinas (Ph.D.)
- L. Chai (Ph.D.) \_
- B. Chandrasekharan (M.S.) \_
- S. Chakraborthy (Ph.D.) \_
- N. Dandapanthula (M.S.)
- V. Dhanraj (M.S.) \_

#### Past Post-Docs

- D. Baneriee
- X. Besseron
- H.-W. Jin \_

- Kamal Raj (M.S.) \_ K. S. Khorassani (Ph.D.) \_ P. Kousha (Ph.D.) \_
  - A. Quentin (Ph.D.) \_
  - B. Ramesh (M. S.) \_
  - S. Xu (M.S.) \_

\_

\_

\_

- T. Gangadharappa (M.S.) K. Gopalakrishnan (M.S.) W. Huang (Ph.D.)
- W. Jiang (M.S.) \_
- J. Jose (Ph.D.) \_
- S. Kini (M.S.) \_
- M. Koop (Ph.D.) \_
- K. Kulkarni (M.S.) \_ \_
- \_

J. Lin

M. Luo

E. Mancini

- \_

\_

\_

\_

- \_ \_ \_ \_
  - \_
- R. Kumar (M.S.)
- S. Krishnamoorthy (M.S.)
- K. Kandalla (Ph.D.)
- M. Li (Ph.D.)

P. Lai (M.S.) J. Liu (Ph.D.) M. Luo (Ph.D.) A. Mamidala (Ph.D.) G. Marsh (M.S.)

Q. Zhou (Ph.D.)

\_

- V. Meshram (M.S.)
- A. Moody (M.S.)
- S. Naravula (Ph.D.)
- R. Noronha (Ph.D.) \_
- X. Ouvang (Ph.D.) \_
- S. Pai (M.S.) \_
- S. Potluri (Ph.D.) \_
  - S. Marcarelli \_ J. Vienne
    - H. Wang

- Current Research Scientist H. Subramoni \_ Current Students (Undergraduate) V. Gangal (B.S.) N. Sarkauskas (B.S.) \_
  - R. Rajachandrasekar (Ph.D.) \_ D. Shankar (Ph.D.) \_ G. Santhanaraman (Ph.D.) \_ A. Singh (Ph.D.) \_ \_
    - J. Sridhar (M.S.)
  - S. Sur (Ph.D.) \_
- H. Subramoni (Ph.D.) \_
- K. Vaidyanathan (Ph.D.) \_
- A. Vishnu (Ph.D.) \_
- J. Wu (Ph.D.) \_
- W. Yu (Ph.D.) \_
- J. Zhang (Ph.D.) \_

#### Current Post-doc

- M. S. Ghazimeersaeed
- A. Ruhela
- K. Manian

#### **Current Research Specialist**

J. Smith

#### Past Research Scientist

- K. Hamidouche \_
- S. Sur \_
- X. Lu \_

#### Past Programmers

- D. Bureddv \_
- J. Perkins \_

#### Past Research Specialist

M. Arnold \_

# **Thank You!**

panda@cse.ohio-state.edu





https://twitter.com/mvapich

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/



The High-Performance MPI/PGAS Project <u>http://mvapich.cse.ohio-state.edu/</u>



High-Performance Big Data

The High-Performance Big Data Project http://hibd.cse.ohio-state.edu/



The High-Performance Deep Learning Project <u>http://hidl.cse.ohio-state.edu/</u>

#### **Network Based Computing Laboratory**