2nd communication: from RERO Explore to swisscovery: Part of RERO's libraries joined the SLSP (Swiss Library Service Platform) network at the beginning of December 2020. See the implications (2nd communication in French / in German) of this change for you as a user.
Title: Optimized HPL for AMD GPU and multi-core CPU usage Author:Bach, Matthias; Kretz, Matthias; Lindenstruth, Volker; Rohr, David Subject:Heterogeneous computing ; Linpack ; HPL ; DGEMM ; CALDGEMM ; GPGPU Description:
The installation of the LOEWE-CSC ( http://csc.uni-frankfurt.de/csc/?51 ) supercomputer at the Goethe University in Frankfurt lead to the development of a Linpack which can fully utilize the installed AMD Cypress GPUs. At its core, a fast DGEMM for combined GPU and CPU usage was created. The DGEMM library is tuned to hide all DMA transfer times and thus maximize the GPU load. A work stealing scheduler was implemented to add the remaining CPU resources to the DGEMM. On the GPU, the DGEMM achieves 497 GFlop/s (90.9% of the theoretical peak). Combined with the 24-core Magny-Cours CPUs, 623 GFlop/s (83.6% of the peak) are achieved. The HPL () benchmark was modified to perform well with one MPI-process per node. The modifications include multi-threading, vectorization, use of the GPU DGEMM, cache optimizations, and a new Lookahead algorithm. A Linpack performance of 70% theoretical peak is achieved and this performance scales linearly to hundreds of nodes.
Is part of:
Computer Science - Research and Development, 2011, Vol.26(3), pp.153-164
1865-2034 (ISSN); 1865-2042 (E-ISSN); 10.1007/s00450-011-0161-5 (DOI)
Title: Alice hlt tpc tracking of pb-pb events on gpus Author:Rohr, David; Gorbunov, Sergey; Szostak, Artur; Kretz, Matthias; Kollegger, Thorsten; Breitner, Timo; Alt, Torsten Subject:Physics; Description:
The online event reconstruction for the ALICE experiment at CERN requires processing capabilities to process central Pb-Pb collisions at a rate of more than 200 Hz, corresponding to an input data rate of about 25 GB/s. The reconstruction of particle trajectories in the Time Projection Chamber (TPC) is the most compute intensive step. The TPC online tracker implementation combines the principle of the cellular automaton and the Kalman filter. It has been accelerated by the usage of graphics cards (GPUs). A pipelined processing allows to perform the tracking on the GPU, the data transfer, and the preprocessing on the CPU in parallel. In order for CPU pre- and postprocessing to keep step with the GPU the pipeline uses multiple threads. A splitting of the tracking in multiple phases searching for short local track segments first improves data locality and makes the algorithm suited to run on a GPU. Due to special optimizations this course of action is not second to a global approach. Because of non-associative floating-point arithmetic a binary comparison of GPU and CPU tracker is infeasible. A track by track and cluster by cluster comparison shows a concordance of 99.999%. With current hardware, the GPU tracker outperforms the CPU version by about a factor of three leaving the processor still available for other tasks.
Is part of:
Journal of Physics: Conference Series, 2012, Vol.396(1), p.012044 (8pp)
1742-6588 (ISSN); 1742-6596 (E-ISSN); 10.1088/1742-6596/396/1/012044 (DOI)