Empirical evaluation of amplifying privacy by subsampling for GANs to create differentially private synthetic tabular data - UTU Tutkimustietojärjestelmä

A4 Vertaisarvioitu artikkeli konferenssijulkaisussa

Empirical evaluation of amplifying privacy by subsampling for GANs to create differentially private synthetic tabular data

Tekijät: Nieminen Valtteri A., Pahikkala Tapio, Airola Antti

Toimittaja: Jussi Kasurinen, Tero Päivärinta

Konferenssin vakiintunut nimi: Annual Symposium for Computer Science

Kustantaja: CEUR-WS

Julkaisuvuosi: 2023

Lehti: CEUR Workshop Proceedings

Kokoomateoksen nimi: TKTP 2023: Annual Symposium for Computer Science 2023: Proceedings of the 40th Anniversary Symposium of the Finnish Society for Computer Science

Tietokannassa oleva lehden nimi: CEUR Workshop Proceedings

Sarjan nimi: CEUR Workshop Proceedings

Vuosikerta: 3506

Aloitussivu: 72

Lopetussivu: 81

ISSN: 1613-0073

Julkaisun avoimuus kirjaamishetkellä: Avoimesti saatavilla

Julkaisukanavan avoimuus : Kokonaan avoin julkaisukanava

Verkko-osoite: https://ceur-ws.org/Vol-3506/

Rinnakkaistallenteen osoite: https://research.utu.fi/converis/portal/detail/Publication/181712336

Rinnakkaistallenteen lisenssi: CC BY

Rinnakkaistallennetun julkaisun versio: Kustantajan versio

Tiivistelmä

Privacy concerns often limit sharing sensitive data collected from individuals. One proposed solution to make secondary use possible is privacy-preserving synthetic data that attempts to mimic real data. Due to their success on non-private tasks, GAN networks trained with differentially private stochastic gradient descent (DPSGD) have been popular for generating DP synthetic data. In recent years, a prominent approach to achieving better privacy guarantees has been to train ensembles of discriminator networks with DPSDG on mutually exclusive subsets to obtain better differential privacy guarantees by taking advantage of the synergy between GANs and privacy amplification by subsampling. However, this research has been done almost exclusively on images, and empirical evaluations of this strategy on other types of data are lacking. This work focuses on the effects of subsampling in creating DP synthetic tabular data with GANs. We evaluate synthetic data utility by training classification models on synthetic- and testing on real data at varying subsampling rates. Further, we complement the evaluation with a qualitative examination of the generated data. Our findings show that while subsampling does bring benefits with tabular data in terms of the prediction performance for classifiers trained on synthetic data, the resulting samples can be very distorted compared to original real data. The results suggest that the benefits obtainable via this method of training DP GAN can differ significantly based on the type of data used.

Ladattava julkaisu

This is an electronic reprint of the original article.
This reprint may differ from the original in pagination and typographic detail. Please cite the original version.

paper06.pdf