Novel Textual Data Cleaning Techniques for Cybersecurity Recommendation Extraction and Prioritization using Local LLMs
: Adeseye, Aisvarya; Isoaho, Jouni; Virtanen, Seppo; Mohammad, Tahir
: N/A
: International Conference on AI in Cybersecurity
: 2026
: 2026 IEEE 5th International Conference on AI in Cybersecurity (ICAIC)
: 1
: 6
: 978-1-6654-7762-8
: 978-1-6654-7761-1
DOI: https://doi.org/10.1109/ICAIC67076.2026.11395706
: https://ieeexplore.ieee.org/document/11395706
It is important to understand different perspectives on how people perceive security risks associated with the use of digital services to improve user protection and system security. Interviews help elicit rich information to capture user perspectives. Traditional qualitative analysis of interviews is slow and time-consuming. However, Large Language Models (LLMs) offers a faster alternative; cybersecurity analysts can quickly gain helpful insights with little expertise in qualitative studies. Also, while expert interviews contain clear technical terms, non-experts rely on simple, non-technical language that makes responses unclear, noisy, and difficult to interpret for both human analysts and LLMs. To improve the extraction and prioritization of cybersecurity recommendations for qualitative transcripts, this study proposes nine novel textual data cleaning techniques with roots in digital signal processing (DSP). This systematic cleaning pipeline reduces textual noise, structures interview data, and improves ambiguous language for a more consistent and accurate data analysis. The impact of the proposed cleaning pipeline was evaluated on an interview dataset of 82 (28 cybersecurity experts and 54 non-experts from diverse organizational sectors) using both software-assisted (NVivo) manual analysis and local LLM-based analysis with LLaMA v3.1 (8B) for theme extraction, recommendation extraction, and impact-based prioritization. The pipeline performance was measured using the F1-score, Precision, False Positive Rate (FPR), Spearman’s correlation (ρ), and Rank Hallucination Rate (RHR). The results showed improved accuracy with significantly lower hallucinations for both evaluation methods, with the strongest improvements observed in the LLM output.