A4 Article in conference proceedings
Implementation of multioperations in thick control flow processors

List of Authors: Martti Forsell, Jussi Roivainen, Ville Leppänen, Jesper Larsson Träff
Publisher: Institute of Electrical and Electronics Engineers Inc.
Publication year: 2018
Book title *: 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
Journal name in source: Proceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018
ISBN: 978-1-5386-5556-6
eISBN: 978-1-5386-5555-9


Multioperations are primitives of parallel computation for which
processors perform a reduction, e.g. addition, on values provided by
multiple threads into a single value in a constant number of steps.
Algorithmically, multioperations can speed up execution by a logarithmic
factor over their single operation counterparts. In this paper, we
propose an architectural technique for realizing multioperations in
thick control flow processors. Thick control flows (TCF) are
computational constructs that simplify parallel programming by bundling a
number of homogeneous threads following the same control path into
universalized vector-like entities. The elements of TCFs are called
fibers to distinguish them from ordinary threads having their own
individual control. Processors designed for executing TCFs feature a
unique frontend-backend structure to provide low-latency processing of
TCF-common computations and high-throughput execution of data parallel
fibers. Our proposal relies on step caches and equally sized
multioperation scratchpads, while on the memory side, we make use of
active memory modules. The idea is to compute partial results in backend
units to reduce the traffic to the referred shared memory location. The
final result is then computed in the active memory unit of the target
memory module. According to the evaluation made with our TCF-aware
processor equipped with multioperation scratchpads and active memory
units, it indeed executes certain N data element-algorithms log N times
faster than the baseline processor. The cost of the implementation is
preliminarily evaluated.

Last updated on 2019-19-06 at 10:37