LLM Security Threats

Module 05 · LLM Security

Large Language Models introduce an entirely new attack surface. Understanding these threats is the first step to deploying AI systems that don't become your biggest liability.

From prompt injection to model extraction, this is the OWASP Top 10 for LLMs — and what your security program needs to address right now.

Attack Surface

The LLM attack surface

LLMs are not traditional software. Every layer — from training data to user input to output consumption — is a potential attack vector.

Input layer

Prompt injection

Malicious instructions embedded in user input or retrieved context that override system prompts and intended behavior.

Model layer

Training & weights

Poisoned training data, backdoors in fine-tuned models, stolen weights, and supply chain attacks on model artifacts.

Output layer

Generated content

Hallucinated facts, leaked training data, harmful content generation, and downstream code execution from LLM output.

OWASP Top 10 for LLM Applications maps these into ten critical risk categories that every security team must understand.

OWASP LLM01

Prompt injection

Direct injection

User → Model

Attacker crafts input that overrides system instructions. "Ignore previous instructions and..." is the simplest form. Jailbreak prompts, role-play attacks, and encoding tricks (Base64, ROT13) bypass naive filters. The model treats all text as instructions — there is no separation between code and data.

Indirect injection

Context → Model

Malicious instructions hidden in external data the LLM retrieves: web pages, emails, documents, database records. When the model processes this context via RAG or tool use, it executes the embedded instructions. The user never sees the attack payload.

Prompt injection is the SQL injection of the AI era — a fundamental architectural flaw, not just a missing filter. No production-ready solution fully eliminates it today.

Training Attacks

Data poisoning & training data attacks

Attack	Method	Impact
Data poisoning	Inject malicious samples into training data	Model learns attacker-chosen behaviors
Backdoor attacks	Trigger phrases activate hidden behaviors	Model appears normal until trigger is present
Training data extraction	Prompt model to regurgitate training data	PII, copyrighted content, credentials leaked
Fine-tuning attacks	Malicious fine-tuning datasets on shared platforms	Safety guardrails silently removed
Supply chain compromise	Tampered model files on Hugging Face, etc.	Backdoored models deployed into production

Verify model provenance. Use signed model cards and hash verification. Audit fine-tuning datasets the same way you audit code dependencies.

Model Extraction

Stealing the model itself

Technique

Query-based extraction

Attacker sends thousands of carefully crafted queries and uses the responses to train a clone model that replicates the target's behavior. Works against any model exposed via API. The clone may retain proprietary knowledge, fine-tuning, and system prompts.

Technique

Side-channel attacks

Timing analysis, token probability extraction, and logit lens techniques reveal model architecture and parameters. Even API-only access leaks information about hidden layers, vocabulary, and training distribution through response patterns.

Model extraction threatens intellectual property, enables offline attack development, and can expose proprietary training data. Rate limiting alone is not sufficient — implement output perturbation and query anomaly detection.

Jailbreaking & Hallucinations

Breaking guardrails, fabricating facts

Jailbreaking

Bypassing safety alignment

DAN ("Do Anything Now") prompts, persona attacks, multi-turn manipulation, and hypothetical framing systematically defeat safety training. New jailbreaks surface daily. Automated red-teaming tools like Garak generate novel bypass techniques faster than vendors can patch.

Hallucinations

Confident fabrication

LLMs generate plausible but false information with high confidence. In security contexts this means fabricated CVE numbers, nonexistent compliance requirements, invented API endpoints, and false incident attribution. Automated pipelines that trust LLM output without verification create real-world risk.

Hallucinations in security tooling are not just wrong — they're dangerously misleading. A fabricated CVE in an automated scan report wastes incident response time and erodes trust in AI-assisted workflows.

Defensive Measures

Building LLM defenses

Layer 1

Input guardrails

Classifier-based prompt injection detection, input sanitization, length limits, and semantic analysis of user queries before they reach the model.

Layer 2

Output filtering

Scan LLM responses for PII leakage, harmful content, prompt leakage, and policy violations before returning to users.

Layer 3

Red teaming

Continuous adversarial testing with automated tools (Garak, PyRIT) and manual red team exercises against deployed LLM systems.

Layer 4

Architecture

Least-privilege tool access, human-in-the-loop for high-risk actions, sandboxed execution, and deterministic validation layers.

Defense in depth applies to AI systems too. No single guardrail is sufficient — layer your defenses and assume every layer will be bypassed.

Real-World Incidents

When LLM security fails

Bing Chat

Indirect prompt injection via web pages. Researchers embedded hidden instructions in web pages that Bing Chat retrieved and executed, exfiltrating conversation history to attacker-controlled servers.

Samsung

Source code leaked via ChatGPT. Engineers pasted proprietary semiconductor source code into ChatGPT for debugging. The data became part of OpenAI's training pipeline. Samsung subsequently banned internal LLM use.

Chevrolet

Dealership chatbot jailbroken. Users convinced a Chevrolet dealer's ChatGPT-powered bot to agree to sell a Tahoe for $1 and to recommend competitor vehicles. The "binding agreement" went viral.

Air Canada

Chatbot hallucinated refund policy. An AI chatbot fabricated a bereavement fare discount that didn't exist. A tribunal ruled Air Canada liable for the hallucinated policy, setting legal precedent for AI-generated commitments.

Key lesson

Every incident above was predictable and preventable. The common thread: organizations deployed LLMs without threat modeling the AI-specific attack surface.

1 / 8