1st Floor, Room 106 | 8:30 am - 3:15 pm
Limited Seating (20 registrants)
Course Skill Level: 25% basic content, 25% intermediate content, and 50% advanced content
Speakers:- John Mellor-Crummey, Professor of Computer Science and of Electrical and Computer Engineering, Rice University
- Sameer Shende, Research Professor and Director of the Performance Research Laboratory, University of Oregon
Schedule:- 8:30 - 9:00 am: Check-in + Breakfast
- 9:00 - 9:30 am: Setup ParaTools Pro for E4S on Cloud Platforms
- 9:30 - 11:30 am: HPCToolkit
- 11:30 am - 12:30pm: Lunch
- 12:30 - 2:30 pm: TAU
- 2:30 - 2:45 pm: Break
- 2:45 - 3:15 pm ParaTools Pro for E4S and Conclusion
Materials: Attendees will need to bring their laptop to access materials during the workshop. There will be power, but please charge in advance as some outlets may need to be shared.
Abstract:The hand-on workshop will present two performance evaluation tools; HPCToolkit and TAU to evaluate and optimize the performance of GPU accelerated HPC and AI applications.
HPCToolkit (https://hpctoolkit.org) is an integrated suite of tools for profiling and tracing of parallel programs on computers ranging from multicore desktop systems to GPU-accelerated supercomputers and cloud platforms. HPCToolkit can measure and analyze executions of fully optimized, dynamically linked parallel applications on tens of thousands of CPU cores and GPUs. It supports multi-lingual codes with external binary-only libraries. It collects sampling based measurements of CPU codes with a controllable overhead. It measures GPU performance using vendor APIs to collect fine-grained measurements using PC sampling or instrumentation and monitors asynchronous GPU operations using activity APIs. HPCToolkit can attribute performance measurements to rich dynamic calling contexts containing procedures, inlined functions, loop nests, and source lines on both CPUs and GPUs.
The TAU Performance System [http://tau.uoregon.edu] is a versatile performance evaluation toolkit supporting both profiling and tracing modes of measurement. It supports performance evaluation of applications running on CPUs and GPUs and supports runtime-preloading of a Dynamic Shared Object (DSO) that allows users to measure the performance without modifying the source code or binary. This tutorial will describe how TAU may be used with MVAPICH and support advanced performance introspection capabilities at the runtime layer. TAU's support for tracking the idle time spent in implicit barriers within collective operations will be demonstrated. TAU also supports event-based sampling at the function, file, and statement level. TAU's support for runtime systems such as CUDA (for NVIDIA GPUs),Level Zero (for Intel oneAPI DPC++/SYCL), ROCm (for AMD GPUs), OpenMP with support for OMPT and Target Offload directives, Kokkos, and MPI allow instrumentation at the runtime system layer while using sampling to evaluate statement-level performance data.
HPCToolkit and TAU will be demonstrated on AWS using the ParaTools Pro for E4S(TM) image. The Extreme-scale Scientific Software Stack (E4S) [https://e4s.io] is a curated, Spack based software distribution of 100+ HPC and AI/ML packages. The Spack package manager is a core component of E4S and it is a platform for product integration and deployment of performance evaluation tools such as HPCToolkit, TAU, DyninstAPI, PAPI, etc. and supports both bare-metal and containerized deployment for CPU and GPU platforms. E4S provides a Spack binary cache and a set of base and full-featured container images with vendor runtimes to support GPU architectures from NVIDIA, Intel, and AMD. E4S is a community effort to provide open-source software packages for developing, deploying, and running scientific applications and tools on HPC platforms.