A1 Vertaisarvioitu alkuperäisartikkeli tieteellisessä lehdessä
DAI-NET: Toward communication-aware collaborative training for the industrial edge
Tekijät: Mwase Christine, Jin Yi, Westerlund Tomi, Tenhunen Hannu, Zou Zhuo
Kustantaja: Elsevier BV
Julkaisuvuosi: 2024
Journal: Future Generation Computer Systems
Tietokannassa oleva lehden nimi: Future Generation Computer Systems
Vuosikerta: 155
Aloitussivu: 193
Lopetussivu: 203
ISSN: 0167-739X
eISSN: 1872-7115
DOI: https://doi.org/10.1016/j.future.2024.01.027
Verkko-osoite: https://doi.org/10.1016/j.future.2024.01.027
Tiivistelmä
The industrial edge generates an abundance of spatially distributed and dynamic data that needs to remain on-site for privacy and security reasons. Collaborative training at the edge can leverage this data to refine pre-trained models locally for specific industrial tasks and environments and have them adapt to local changes for enhanced performance, agility, and resilience. However, communication between the devices during training is a key bottleneck and is not modelled by existing frameworks such as MxNet, PyTorch and TensorFlow. This paper introduces DAI-NET, a co-simulation framework for examining communication and its associated costs, and provides results from an implementation using Python, OMNET++ and INET. To validate it and showcase its utility, the developed platform is applied in the analysis of (i) the performance and cost of collaboratively training a Multilayer Perceptron model, and (ii) the influence of computational heterogeneity. Communication costs generated during the training are captured at the device and system levels. In computationally heterogeneous clusters, the root cause of stragglers is exposed. In addition, the key performance contributors are identified to be a cluster’s computation capability and the variation in the relative computation capabilities of its devices. This study is particularly useful for artificial intelligence of things (AIoT) systems, whose bandwidth and energy resources are limited. It lends the way for more practical research on communication-efficient algorithms, network protocols and architectures for the AIoT edge.
The industrial edge generates an abundance of spatially distributed and dynamic data that needs to remain on-site for privacy and security reasons. Collaborative training at the edge can leverage this data to refine pre-trained models locally for specific industrial tasks and environments and have them adapt to local changes for enhanced performance, agility, and resilience. However, communication between the devices during training is a key bottleneck and is not modelled by existing frameworks such as MxNet, PyTorch and TensorFlow. This paper introduces DAI-NET, a co-simulation framework for examining communication and its associated costs, and provides results from an implementation using Python, OMNET++ and INET. To validate it and showcase its utility, the developed platform is applied in the analysis of (i) the performance and cost of collaboratively training a Multilayer Perceptron model, and (ii) the influence of computational heterogeneity. Communication costs generated during the training are captured at the device and system levels. In computationally heterogeneous clusters, the root cause of stragglers is exposed. In addition, the key performance contributors are identified to be a cluster’s computation capability and the variation in the relative computation capabilities of its devices. This study is particularly useful for artificial intelligence of things (AIoT) systems, whose bandwidth and energy resources are limited. It lends the way for more practical research on communication-efficient algorithms, network protocols and architectures for the AIoT edge.