Efficient run-time systems for edge AI inference - UTU Tutkimustietojärjestelmä

G5 Artikkeliväitöskirja

Efficient run-time systems for edge AI inference

Tekijät: Taufique, Zain

Kustannuspaikka: Turku

Julkaisuvuosi: 2026

Sarjan nimi: Annales Universitatis Turkuensis F

Numero sarjassa: 88

ISBN: 978-952-02-0753-3

eISBN: 978-952-02-0754-0

ISSN: 2736-9390

eISSN: 2736-9684

Julkaisun avoimuus kirjaamishetkellä: Avoimesti saatavilla

Julkaisukanavan avoimuus : Kokonaan avoin julkaisukanava

Verkko-osoite: https://urn.fi/URN:ISBN:978-952-02-0754-0

Tiivistelmä

Efficient orchestration of AI inference on heterogeneous edge platforms is crucial for meeting the stringent latency, energy, and accuracy requirements of edge AI applications. The edge devices integrate heterogeneous compute clusters, includ ing CPUs, GPUs, and Neural Processing Units, each exhibiting asymmetric energy performance characteristics. Modern AI workloads handle diverse runtime inference requests that arrive both continuously and in response to user prompts, each with dis tinct latency, accuracy, and priority requirements. Despite the availability of hetero geneous computational resources, existing scheduling mechanisms remain primarily conservative, failing to utilize them and violating workload and system constraints. This dissertation addresses the scheduling challenges of inferring multiple AI work loads on resource-constrained edge platforms. This dissertation develops an approximation-aware runtime resource manage ment approach that jointly optimizes power, performance, and accuracy under strict energy constraints on heterogeneous edge platforms. The proposed TangoX frame worktargets systems with asymmetric CPUandGPUclusters, usingareinforcement learning-based orchestrator to coordinate cluster assignment, DVFS, and model pre cision to reduce latency while constraining accuracy degradation. To support com pound AI workloads, Twill extends runtime coordination across heterogeneous CPU, GPU, and DLA clusters, integrating vision, transformer, and language models while enforcing task prioritization and cluster affinity to satisfy latency requirements within power budgets. A distributed edge framework further enables adaptive workload distribution by partitioning data across devices, balancing latency and accuracy un der dynamic conditions. Complementing this, HiDP introduces hierarchical DNN partitioning across global and local edge layers, enabling scalable and coordinated inference. All frameworks are implemented on heterogeneous edge hardware and evaluated using diverse workloads. Results demonstrate consistent reductions in in ference latency and energy consumption compared with state-of-the-art techniques, with minimal impact on model accuracy. Collectively, these contributions advance the design of efficient, adaptive, and scalable edge AI systems.