DAI-NET: Toward communication-aware collaborative training for the industrial edge - UTU Tutkimustietojärjestelmä

A1 Vertaisarvioitu alkuperäisartikkeli tieteellisessä lehdessä

DAI-NET: Toward communication-aware collaborative training for the industrial edge

Tekijät: Mwase Christine, Jin Yi, Westerlund Tomi, Tenhunen Hannu, Zou Zhuo

Kustantaja: Elsevier BV

Julkaisuvuosi: 2024

Lehti:Future Generation Computer Systems

Tietokannassa oleva lehden nimiFuture Generation Computer Systems

Vuosikerta: 155

Aloitussivu: 193

Lopetussivu: 203

ISSN: 0167-739X

eISSN: 1872-7115

DOI: https://doi.org/10.1016/j.future.2024.01.027

Verkko-osoite: https://doi.org/10.1016/j.future.2024.01.027

Tiivistelmä

The industrial edge generates an abundance of spatially distributed and dynamic data that needs to remain on-site for privacy and security reasons. Collaborative training at the edge can leverage this data to refine pre-trained models locally for specific industrial tasks and environments and have them adapt to local changes for enhanced performance, agility, and resilience. However, communication between the devices during training is a key bottleneck and is not modelled by existing frameworks such as MxNet, PyTorch and TensorFlow. This paper introduces DAI-NET, a co-simulation framework for examining communication and its associated costs, and provides results from an implementation using Python, OMNET++ and INET. To validate it and showcase its utility, the developed platform is applied in the analysis of (i) the performance and cost of collaboratively training a Multilayer Perceptron model, and (ii) the influence of computational heterogeneity. Communication costs generated during the training are captured at the device and system levels. In computationally heterogeneous clusters, the root cause of stragglers is exposed. In addition, the key performance contributors are identified to be a cluster’s computation capability and the variation in the relative computation capabilities of its devices. This study is particularly useful for artificial intelligence of things (AIoT) systems, whose bandwidth and energy resources are limited. It lends the way for more practical research on communication-efficient algorithms, network protocols and architectures for the AIoT edge.