In the paper we provide thorough benchmarking of deep neural network (DNN) training on modern multi- and many-core Intel processors in order to assess performance differences for various deep learning as well as parallel computing parameters. We present performance of DNN training for Alexnet, Googlenet, Googlenet_v2 as well as Resnet_50 for various engines used by the deep learning framework, for various batch sizes. Furthermore, we measured results for various numbers of threads with ranges depending on a given processor(s) as well as compact and scatter affinities. Based on results we formulate conclusions with respect to optimal parameters and relative performances which can serve as hints for researchers training similar networks using modern processors.
This paper provides a review of contemporary methodologies and APIs for parallel programming, with representative technologies selected in terms of target system type (shared memory, distributed, and hybrid), communication patterns (one-sided and two-sided), and programming abstraction level. We analyze representatives in terms of many aspects including programming model, languages, supported platforms, license, optimization goals, ease of programming, debugging, deployment, portability, level of parallelism, constructs enabling parallelism and synchronization, features introduced in recent versions indicating trends, support for hybridity in parallel execution, and disadvantages. Such detailed analysis has led us to the identification of trends in high-performance computing and of the challenges to be addressed in the near future. It can help to shape future versions of programming standards, select technologies best matching programmers’ needs, and avoid potential difficulties while using high-performance computing systems.
The paper presents state of the art of energy-aware high-performance computing (HPC), in particular identification and classification of approaches by system and device types, optimization metrics, and energy/power control methods. System types include single device, clusters, grids, and clouds while considered device types include CPUs, GPUs, multiprocessor, and hybrid systems. Optimization goals include various combinations of metrics such as execution time, energy consumption, and temperature with consideration of imposed power limits. Control methods include scheduling, DVFS/DFS/DCT, power capping with programmatic APIs such as Intel RAPL, NVIDIA NVML, as well as application optimizations, and hybrid methods. We discuss tools and APIs for energy/power management as well as tools and environments for prediction and/or simulation of energy/power consumption in modern HPC systems. Finally, programming examples, i.e., applications and benchmarks used in particular works are discussed. Based on our review, we identified a set of open areas and important up-to-date problems concerning methods and tools for modern HPC systems allowing energy-aware processing.
—In the paper we present investigation of
performance-energy trade-offs under power capping using
modern processors. The results are presented for systems
targeted at both server and client markets and were collected
from Intel Xeon E5 and Intel Xeon Phi server processors as well
as from desktop and mobile Intel Core i7 processors. The results,
when using power capping, show that we can find various
interesting combinations of energy savings and performance
drops as well as non-trivial minima of the energy-execution
time product. We performed this analysis for a subset of NAS
Parallel Benchmark applications: BT, CG, EP and FT and
sizes of the computational problem (classes A, B, C, D). We
can observe that the energy characteristics visualized by a
prototype of our new tool EnergyProfiler do not depend on the
size of a computational problem. Consequently, the proposed
tool can potentially support quick energy/performance trade-off
estimation for codes similar to the tested, well-recognized
This paper presents the architecture, main components and performance results for a parallel and modu-lar agent-based environment aimed at crowd simulation. The environment allows to simulate thousandsor more agents on maps of square kilometers or more, features a modular design and incorporates non-volatile RAM (NVRAM) with a fail-safe mode that can be activated to allow to continue computationsfrom a recently analyzed state in case of a failure. We show results for an evacuation scenario for an areaof up to 6 km2in a district of Gdansk, Poland, performed on two clusters, one with hardware simulationof NVRAM. We have shown a very small overhead of using NVRAM compared to the RAM only solutionand an overhead of 20% with the fail-safe mode on using NVRAM, shown up to 30 000 agents and up to25 000 iterations of the simulation. We also show the benefit of using NVRAM for file synchronizationwith a slow growth of the execution time while increasing the map size. We then present how the fre-quency of visualization affects execution time and very good scaling of the proposed solution in a clusterenvironment for more than 650 processes and 60 000 agents.
The paper presents assessment of Unified Memory performance with data prefetching
and memory oversubscription. Several versions of code are used with: standard
memory management, standard Unified Memory and optimized Unified Memory
with programmer-assisted data prefetching. Evaluation of execution times is provided
for four applications: Sobel and image rotation filters, stream image processing
and computational fluid dynamic simulation, performed on Pascal and Volta
architecture GPUs—NVIDIA GTX 1080 and NVIDIA V100 cards. Furthermore,
we evaluate the possibility of allocating more memory than available on GPUs
and assess performance of codes using the three aforementioned implementations,
including memory oversubscription available in CUDA. Results serve as recommendations
and hints for other similar codes regarding expected performance on modern
and already widely available GPUs.
In the paper we present an approach and results from application of the modern power capping mechanism available for NVIDIA GPUs to the bench- marks such as NAS Parallel Benchmarks BT, SP and LU as well as cublasgemm- benchmark which are widely used for assessment of high performance computing systems’ performance. Specifically, depending on the benchmarks, various power cap configurations are best for desired trade-off of performance and energy con- sumption. We present two: both energy savings and performance drops for same power caps as well as a normalized performance-energy consumption product. It is important that optimal configurations are often non-trivial i.e. are obtained for power caps smaller than default and larger than minimal allowed limits. Tests have been performed for two modern GPUs of Pascal and Turing generations i.e. NVIDIA GTX 1070 and NVIDIA RTX 2080 respectively and thus results can be useful for many applications with profiles similar to the benchmarks executed on modern GPU based systems.
The paper presents a new approach to parallel image processing using byte addressable, non-volatile memory (NVRAM). We show that our custom built MPI I/O implementation of selected functions that use a distributed cache that incorporates NVRAMs located in cluster nodes can be used for efficient processing of large images. We demonstrate performance benefits of such a solution compared to a traditional implementation without NVRAM for various sizes of buffers used to read image parts, process and write back to storage. We also show that our implementation benefits from overlapping reading subsequent images while processing already loaded ones. We present results obtained in a cluster environment for three parallel implementation of blur, multipass blur and Sobel filters, for various NVRAM parameters such as latencies and bandwidth values.
In the paper we present extensive results from analyzing energy/performance trade-offs with power capping observed on four different modern CPUs, for three different parallel applications such as 2D heat distribution, numerical integration and Fast Fourier Transform. The CPU tested represent both multi-core type CPUs such as Intel⃝R Xeon⃝R E5, desktop and mobile i7 as well as many-core Intel⃝R Xeon PhiTM x200 but also server, desktop and mobile solutions used widely nowadays. We show that using enforced power caps we can find points of lower than default energy consumption but mostly for desktop and mobile solutions at the cost of increased execution time. We show with particular numbers how energy consumed, power consumption and execution time change for the point of minimum energy used versus the default configuration with no power limit, for each application and each tested CPU.
The paper presents benchmarking a multi-stream
application processing a set of input data arrays. Tests have
been performed and execution times measured for various
numbers of streams and various compute intensities measured
as the ratio of kernel compute time and data transfer time.
As such, the application and benchmarking is representative of
frequently used operations such as vector weighted sum, matrix
multiplication etc. The paper shows benefits of using multiple
data streams for various compute intensities compared to one
stream, benchmarked for 4 GPUs: professional NVIDIA Tesla
V100, Tesla K20m, desktop GTX 1060 and mobile GeForce
940MX. Additionally, relative performances are shown for various
numbers of kernel computations for these GPUs.