Implementation of multioperations in thick control flow processors - UTU Tutkimustietojärjestelmä

A4 Vertaisarvioitu artikkeli konferenssijulkaisussa

Implementation of multioperations in thick control flow processors

Tekijät: Martti Forsell, Jussi Roivainen, Ville Leppänen, Jesper Larsson Träff

Konferenssin vakiintunut nimi: IEEE International Parallel and Distributed Processing Symposium

Kustantaja: Institute of Electrical and Electronics Engineers Inc.

Julkaisuvuosi: 2018

Kokoomateoksen nimi: 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

Tietokannassa oleva lehden nimi: Proceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018

Aloitussivu: 744

Lopetussivu: 752

ISBN: 978-1-5386-5556-6

eISBN: 978-1-5386-5555-9

DOI: https://doi.org/10.1109/IPDPSW.2018.00121

Tiivistelmä

Multioperations are primitives of parallel computation for which
processors perform a reduction, e.g. addition, on values provided by
multiple threads into a single value in a constant number of steps.
Algorithmically, multioperations can speed up execution by a logarithmic
factor over their single operation counterparts. In this paper, we
propose an architectural technique for realizing multioperations in
thick control flow processors. Thick control flows (TCF) are
computational constructs that simplify parallel programming by bundling a
number of homogeneous threads following the same control path into
universalized vector-like entities. The elements of TCFs are called
fibers to distinguish them from ordinary threads having their own
individual control. Processors designed for executing TCFs feature a
unique frontend-backend structure to provide low-latency processing of
TCF-common computations and high-throughput execution of data parallel
fibers. Our proposal relies on step caches and equally sized
multioperation scratchpads, while on the memory side, we make use of
active memory modules. The idea is to compute partial results in backend
units to reduce the traffic to the referred shared memory location. The
final result is then computed in the active memory unit of the target
memory module. According to the evaluation made with our TCF-aware
processor equipped with multioperation scratchpads and active memory
units, it indeed executes certain N data element-algorithms log N times
faster than the baseline processor. The cost of the implementation is
preliminarily evaluated.