Lost in Translation: Analyzing Non-English Cybercrime Forums - UTU Research Portal

A4 Refereed article in a conference publication

Lost in Translation: Analyzing Non-English Cybercrime Forums

Authors: Mischinger, Mariella; Hughes, Jack; Vitiugin, Fedor; Pastrana, Sergio; Hutchings, Alice; Suarez-Tangil, Guillermo

Editors: N/A

Conference name: APWG Symposium on Electronic Crime Research

Publication year: 2025

Book title : 2025 APWG Symposium on Electronic Crime Research (eCrime)

ISBN: 979-8-3315-8970-7

eISBN: 979-8-3315-8969-1

DOI: https://doi.org/10.1109/eCrime66972.2025.11327989

Publication's open availability at the time of reporting: No Open Access

Publication channel's open availability : No Open Access publication channel

Web address : https://ieeexplore.ieee.org/document/11327989

Abstract

Cybercrime analysis and Cyber Threat Intelligence are crucial for understanding and defending against cyber threats, with online underground communities serving as a key source of information. Classification tasks are popular but demand significant manual effort and language-specific expertise. Prior work focuses on English-language forums, as non-English languages require fluent domain experts. We evaluate machine translation tools for suitability in preserving contextual information in posts and find GPT-4 is most reliable. We leverage existing underground forum post classification pipelines to compare their performance on translated text and original language text. We find classification performed on translated underground forum data is as effective as on original language text, enabling researchers to reuse existing pipelines. Finally, we investigate a fully machine-generated few-shot and zero-shot classification to reduce reliance on manual labeling, followed by a two-step machine-based classification, combining machine-generated labels with the existing classification pipeline. We find machine-based labeling causes errors to propagate downstream. For tasks requiring high-quality label creation, human expertise remains essential. Finally, we provide a qualitative evaluation of disagreements in annotator labels of the original language and the translations, as well as disagreements between annotators and machine labeling.

Funding information in the publication:
We are grateful to Andrew Caines for organizing the Spanish and German annotations (supported by the Economic and Social Research Council (ESRC) (grant number ES/T008466/1)), as well as Anh V. Vu and Medhi Benatallah for the Vietnamese and Arabic annotations. Medhi Benatallah was supported by the King’s College Summer Research Programme. This work was supported by the project PID2022-143304OB-I00 funded by MICIU/AEI/10.13039/- 501100011033/ and by the ERDF, EU. Guillermo SuarezTangil is a 2020 RyC fellow RYC2020-029401-I, funded by
MCIU/AEI/10.13039/501100011033 and the ESF Investing in your future. The same grant has funded Mariella Mischinger’s work. Jack Hughes and Alice Hutchings are supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 949127). Sergio Pastrana was supported by grant PID2023-150310OB-I00 (MORE4AIO) of the Spanish AEI. ChatGPT was used to support coding for data analysis and visualization, as well as to improve the writing style and grammar and syntax checks.