Module 05 · LLM Security

LLM Security Threats

Large Language Models introduce an entirely new attack surface. Understanding these threats is the first step to deploying AI systems that don't become your biggest liability.

From prompt injection to model extraction, this is the OWASP Top 10 for LLMs — and what your security program needs to address right now.

Attack Surface

The LLM attack surface

LLMs are not traditional software. Every layer — from training data to user input to output consumption — is a potential attack vector.

Input layer
Prompt injection
Malicious instructions embedded in user input or retrieved context that override system prompts and intended behavior.
Model layer
Training & weights
Poisoned training data, backdoors in fine-tuned models, stolen weights, and supply chain attacks on model artifacts.
Output layer
Generated content
Hallucinated facts, leaked training data, harmful content generation, and downstream code execution from LLM output.

OWASP Top 10 for LLM Applications maps these into ten critical risk categories that every security team must understand.

OWASP LLM01

Prompt injection

Direct injection
User → Model
Attacker crafts input that overrides system instructions. "Ignore previous instructions and..." is the simplest form. Jailbreak prompts, role-play attacks, and encoding tricks (Base64, ROT13) bypass naive filters. The model treats all text as instructions — there is no separation between code and data.
Indirect injection
Context → Model
Malicious instructions hidden in external data the LLM retrieves: web pages, emails, documents, database records. When the model processes this context via RAG or tool use, it executes the embedded instructions. The user never sees the attack payload.

Prompt injection is the SQL injection of the AI era — a fundamental architectural flaw, not just a missing filter. No production-ready solution fully eliminates it today.

Training Attacks

Data poisoning & training data attacks

AttackMethodImpact
Data poisoningInject malicious samples into training dataModel learns attacker-chosen behaviors
Backdoor attacksTrigger phrases activate hidden behaviorsModel appears normal until trigger is present
Training data extractionPrompt model to regurgitate training dataPII, copyrighted content, credentials leaked
Fine-tuning attacksMalicious fine-tuning datasets on shared platformsSafety guardrails silently removed
Supply chain compromiseTampered model files on Hugging Face, etc.Backdoored models deployed into production

Verify model provenance. Use signed model cards and hash verification. Audit fine-tuning datasets the same way you audit code dependencies.

Model Extraction

Stealing the model itself

Technique
Query-based extraction
Attacker sends thousands of carefully crafted queries and uses the responses to train a clone model that replicates the target's behavior. Works against any model exposed via API. The clone may retain proprietary knowledge, fine-tuning, and system prompts.
Technique
Side-channel attacks
Timing analysis, token probability extraction, and logit lens techniques reveal model architecture and parameters. Even API-only access leaks information about hidden layers, vocabulary, and training distribution through response patterns.

Model extraction threatens intellectual property, enables offline attack development, and can expose proprietary training data. Rate limiting alone is not sufficient — implement output perturbation and query anomaly detection.

Jailbreaking & Hallucinations

Breaking guardrails, fabricating facts

Jailbreaking
Bypassing safety alignment
DAN ("Do Anything Now") prompts, persona attacks, multi-turn manipulation, and hypothetical framing systematically defeat safety training. New jailbreaks surface daily. Automated red-teaming tools like Garak generate novel bypass techniques faster than vendors can patch.
Hallucinations
Confident fabrication
LLMs generate plausible but false information with high confidence. In security contexts this means fabricated CVE numbers, nonexistent compliance requirements, invented API endpoints, and false incident attribution. Automated pipelines that trust LLM output without verification create real-world risk.

Hallucinations in security tooling are not just wrong — they're dangerously misleading. A fabricated CVE in an automated scan report wastes incident response time and erodes trust in AI-assisted workflows.

Defensive Measures

Building LLM defenses

Layer 1
Input guardrails
Classifier-based prompt injection detection, input sanitization, length limits, and semantic analysis of user queries before they reach the model.
Layer 2
Output filtering
Scan LLM responses for PII leakage, harmful content, prompt leakage, and policy violations before returning to users.
Layer 3
Red teaming
Continuous adversarial testing with automated tools (Garak, PyRIT) and manual red team exercises against deployed LLM systems.
Layer 4
Architecture
Least-privilege tool access, human-in-the-loop for high-risk actions, sandboxed execution, and deterministic validation layers.

Defense in depth applies to AI systems too. No single guardrail is sufficient — layer your defenses and assume every layer will be bypassed.

Real-World Incidents

When LLM security fails

Bing Chat
Indirect prompt injection via web pages. Researchers embedded hidden instructions in web pages that Bing Chat retrieved and executed, exfiltrating conversation history to attacker-controlled servers.
Samsung
Source code leaked via ChatGPT. Engineers pasted proprietary semiconductor source code into ChatGPT for debugging. The data became part of OpenAI's training pipeline. Samsung subsequently banned internal LLM use.
Chevrolet
Dealership chatbot jailbroken. Users convinced a Chevrolet dealer's ChatGPT-powered bot to agree to sell a Tahoe for $1 and to recommend competitor vehicles. The "binding agreement" went viral.
Air Canada
Chatbot hallucinated refund policy. An AI chatbot fabricated a bereavement fare discount that didn't exist. A tribunal ruled Air Canada liable for the hallucinated policy, setting legal precedent for AI-generated commitments.
Key lesson

Every incident above was predictable and preventable. The common thread: organizations deployed LLMs without threat modeling the AI-specific attack surface.

1 / 8