Modern artificial intelligence models protection methods from adversarial attacks

Authors

DOI:

https://doi.org/10.18372/2225-5036.31.21160

Keywords:

artificial intelligence, cybersecurity, artificial intelligence protection, large language models, adversarial attacks

Abstract

This article provides a comprehensive overview of the current state of artificial intelligence (AI) model security, systematizing vectors of targeted attacks and corresponding defense methods. The evolution of the threat landscape is analyzed, from classic attacks on machine learning (ML) models to specific vulnerabilities inherent in modern large language models (LLMs). The introduction outlines the problem's relevance in the context of AI's deep integration into critical infrastructure and business processes, emphasizing the shift from reactive vulnerability patching to proactive risk management, as reflected in industry standards like the NIST AI Risk Management Framework. The main body of the research begins with a detailed classification of modality-agnostic attacks, including adversarial examples, data poisoning, backdoors, model stealing, membership inference attacks, and model inversion. It then analyzes the most critical threats to modern systems based on criteria of prevalence, potential damage, and detection difficulty, focusing on supply chain attacks and data leakage. General-purpose defense methods are systematized according to the model lifecycle stages: at the data level (sanitization, differential privacy), during training (adversarial training, robust optimization), and at the inference stage (monitoring, security policies). A separate section is dedicated to the paradigm shift caused by LLMs. LLM-specific threats are examined in detail: prompt injection, including direct (jailbreaks) and indirect attacks in RAG systems, insecure output handling, and risks associated with fine-tuning and the use of external tools. Accordingly, multi-layered defense strategies for LLMs are analyzed, such as prompt hardening, the implementation of guardrails, red-teaming, and secure design of RAG systems and tools. An analytical synthesis summarizes the strengths and weaknesses of the reviewed approaches, assesses the risks of false positives and false negatives (FP/FN), and discusses their application conditions. The conclusion summarizes key findings and identifies promising directions for future research, including the development of formal security guarantees for LLMs, the standardization of benchmarks, and the advancement of comprehensive monitoring systems.

Published

2025-12-25

How to Cite

Bondarovets, S., & Okhrimenko, T. (2025). Modern artificial intelligence models protection methods from adversarial attacks. Ukrainian Scientific Journal of Information Security, 31(3), 142–150. https://doi.org/10.18372/2225-5036.31.21160

Issue

Section

Cybersecurity & Critical Information Infrastructure Protection (CIIP)