Skip to content

LLM Bug Detection

At a Glance

Attribute Detail
Category LLM Bug Detection
Maturity Emerging
Key Idea Use large language models and transformer-based architectures to identify vulnerabilities in source code through pattern recognition, semantic understanding, and learned representations of insecure code
Representative Approaches General-purpose LLMs (GPT-4, Claude), fine-tuned models (VulBERTa, LineVul, CodeBERT), prompt engineering strategies
Primary Targets Source code in mainstream languages (C, C++, Java, Python, JavaScript)

Overview

Large language models have demonstrated surprising capability at understanding and reasoning about source code. Trained on billions of lines of public code and natural-language documentation, models like GPT-4 and Claude can identify common vulnerability patterns, explain why code is insecure, and suggest fixes; all through natural-language interaction. Simultaneously, a parallel line of research has produced smaller, specialized transformer models fine-tuned specifically on vulnerability datasets, trading general reasoning ability for focused accuracy on narrow detection tasks.

The appeal of LLM-based bug detection is clear: traditional static analysis tools require extensive rule engineering by domain experts, operate on rigid syntactic patterns, and produce findings that are difficult for non-specialists to interpret. LLMs promise to lower these barriers by offering flexible, language-agnostic analysis with human-readable explanations. A developer can paste a code snippet into a chat interface and receive a plain-English assessment of potential vulnerabilities; no tool configuration, no build integration, no false-positive triage workflow.

However, the reality is more nuanced than the promise. LLMs operate on statistical pattern matching, not formal program analysis. They cannot execute code, track data flow across compilation units, or guarantee that their findings are sound. The field is evolving rapidly, with significant advances appearing quarterly, but practitioners should approach LLM-based vulnerability detection as a complement to (not a replacement for) established static and dynamic analysis tools.

This page surveys three categories of LLM-based bug detection: general-purpose LLMs used for ad-hoc code review, fine-tuned transformer models trained on vulnerability datasets, and prompt engineering techniques that improve detection quality.

Approach Profiles

General-Purpose LLMs for Code Review

General-purpose large language models (including OpenAI's GPT-4, Anthropic's Claude, Google's Gemini, and open-weight models like Meta's Llama and Mistral) were not trained specifically for vulnerability detection, but their broad code understanding enables them to identify many categories of security issues when prompted appropriately.

Capabilities. In controlled evaluations, general-purpose LLMs can reliably detect several categories of vulnerabilities:

  • Memory safety issues: buffer overflows, use-after-free, double-free, and null pointer dereferences in C/C++ code. LLMs can trace pointer lifecycles through moderate-complexity functions and identify missing bounds checks.
  • Injection vulnerabilities: SQL injection, command injection, XSS, and path traversal in web application code. These pattern-based vulnerabilities align well with the statistical patterns LLMs learn from training data.
  • Authentication and authorization flaws: missing access checks, insecure session handling, hardcoded credentials. LLMs can reason about the semantic intent of code and flag logic that deviates from common security patterns.
  • Cryptographic misuse: use of deprecated algorithms, weak key generation, improper IV handling. Well-known anti-patterns appear frequently in training data.

Natural language explanations are a distinctive advantage. When an LLM identifies a potential vulnerability, it can explain the attack scenario, describe the conditions under which it is exploitable, and suggest a remediation; all in plain language that a developer without security expertise can understand. This contrasts with traditional SAST tools, whose findings often require security domain knowledge to interpret.

Limitations in practice. General-purpose LLMs face several fundamental constraints when used for vulnerability detection:

  • Context window limits restrict analysis to individual files or functions. Real vulnerabilities often span multiple files, modules, or even repositories. An LLM analyzing a single function cannot detect a use-after-free where allocation and deallocation occur in different compilation units.
  • No execution or data-flow tracking. LLMs simulate reasoning about code but do not perform actual symbolic execution, taint analysis, or pointer analysis. Their "analysis" is pattern matching on syntactic structure, which misses vulnerabilities that require semantic reasoning beyond what the model has learned.
  • Inconsistency. The same code snippet may receive different assessments in different conversations or with different phrasing. This non-determinism makes LLMs unsuitable as a sole gating mechanism in security workflows.
  • Training data contamination. LLMs trained on public code may have memorized known CVEs and their patches, leading to inflated performance on benchmarks that use historical vulnerabilities.

Practical Usage Pattern

General-purpose LLMs are most valuable as a first-pass review tool: a "second pair of eyes" that can flag areas of concern for deeper investigation with formal analysis tools. They are particularly useful during code review, where a developer can ask the model to assess a diff for security implications before merging.

Fine-Tuned Models: VulBERTa, LineVul, and CodeBERT

While general-purpose LLMs offer broad reasoning, a parallel research track has produced smaller transformer models fine-tuned specifically on vulnerability datasets. These models sacrifice general language understanding for focused performance on vulnerability classification and localization.

CodeBERT (Feng et al., EMNLP 2020) is a bimodal pre-trained model for programming languages and natural language. Developed by Microsoft Research, it is trained on six programming languages using both code and associated documentation. While not vulnerability-specific, CodeBERT serves as a foundation model that downstream vulnerability detection systems fine-tune for their specific tasks. Its bidirectional transformer architecture captures both local syntax and cross-function dependencies within its context window.

VulBERTa (Hanif and Maffeis, AAAI 2022 Workshop) adapts the RoBERTa architecture specifically for vulnerability detection in C/C++ code. It is pre-trained on a large corpus of C/C++ functions and then fine-tuned on labeled vulnerability datasets. VulBERTa introduces custom tokenization that preserves code-specific tokens (pointer operators, preprocessor directives) that generic NLP tokenizers would split or discard. On benchmark datasets, VulBERTa achieves competitive F1 scores with significantly fewer parameters than general-purpose LLMs.

LineVul (Fu and Tantithamthavorn, MSR 2022) extends the fine-tuned model approach by adding line-level vulnerability localization. Rather than classifying an entire function as vulnerable or not, LineVul identifies the specific lines of code that contribute to the vulnerability. This is achieved through an attention-based mechanism that highlights which tokens in the input receive the highest attention weights during the vulnerability classification decision. LineVul uses a fine-tuned CodeBERT backbone and was evaluated on the Big-Vul dataset, achieving strong results on both function-level detection and line-level localization tasks.

Strengths:

  • Smaller model size enables deployment on standard hardware without GPU clusters or API costs
  • Task-specific fine-tuning achieves higher precision on narrow vulnerability categories than general-purpose models
  • Deterministic inference; the same input always produces the same output, enabling reproducible workflows
  • LineVul's localization capability provides actionable, line-specific findings similar to traditional SAST tools

Weaknesses:

  • Limited to vulnerability patterns present in training data; novel vulnerability classes are missed
  • Training datasets (Big-Vul, Devign, D2A) contain known label-noise issues that propagate into model behavior
  • Smaller context windows than general-purpose LLMs (typically 512 tokens) severely limit the code span that can be analyzed
  • Require retraining to support new languages or vulnerability classes

Prompt Engineering for Security Analysis

The effectiveness of general-purpose LLMs for vulnerability detection depends heavily on how they are prompted. Naive prompts ("Is this code secure?") produce superficial, unreliable responses. Research and practitioner experience have identified several prompting strategies that significantly improve detection quality.

Chain-of-thought prompting asks the model to reason step-by-step about potential vulnerabilities before stating a conclusion. For example, instructing the model to "First, identify all external inputs. Then, trace each input through the function to determine if it reaches a security-sensitive operation without validation. Finally, assess whether a vulnerability exists." This structured reasoning forces the model to engage with the code's data flow rather than pattern-matching on surface syntax, and produces more reliable findings.

Few-shot examples provide the model with examples of vulnerable and non-vulnerable code before asking it to analyze new code. By showing the model what a buffer overflow looks like alongside a correctly bounds-checked version, the prompt establishes a calibration baseline that improves classification accuracy. Few-shot examples are particularly valuable for domain-specific vulnerability classes that may be underrepresented in the model's training data.

Role-based prompting instructs the model to adopt the perspective of a security auditor or attacker. Prompts like "You are a senior security researcher performing a code audit. Identify all potential vulnerabilities in the following code, rate their severity using CVSS criteria, and explain each finding with an attack scenario." This framing activates the model's security-domain knowledge and produces more thorough analysis than generic code-review prompts.

Structured output formats: requesting findings in a consistent format (CWE classification, severity, affected lines, remediation); improve the actionability of LLM findings and enable integration with vulnerability management workflows.

Effective Security Prompt Structure

A well-structured security analysis prompt typically includes: (1) a role definition, (2) specific vulnerability categories to check, (3) a request for step-by-step reasoning, (4) an output format specification, and (5) an instruction to state confidence levels and flag uncertain findings explicitly.

Honest Assessment of Limitations

LLM-based vulnerability detection is a powerful new tool, but the field suffers from hype that can mislead practitioners into over-reliance. A clear-eyed assessment of current limitations is essential.

Hallucinations and false positives. LLMs can and do fabricate vulnerabilities that do not exist. A model may report a buffer overflow in code that correctly bounds-checks all accesses, confusing a superficially similar pattern for a genuinely vulnerable one. Unlike traditional SAST tools whose false-positive patterns are well-characterized, LLM false positives are unpredictable and context-dependent. In evaluations, false-positive rates for general-purpose LLMs on vulnerability detection tasks can exceed 50%, making unassisted use impractical for automated gating.

False negatives and blind spots. Equally concerning, LLMs miss real vulnerabilities, particularly those involving multi-step logic, complex control flow, or domain-specific semantics. A model that correctly identifies a simple SQL injection may miss a time-of-check-time-of-use (TOCTOU) race condition in the same codebase. There is no way to know what an LLM has missed, which means negative results ("the model found no vulnerabilities") provide no meaningful assurance.

Context window constraints. Even models with 100K+ token context windows can only analyze a fraction of a real codebase at once. Vulnerabilities that span multiple files, depend on configuration, or emerge from the interaction of separately-safe components are fundamentally invisible to single-pass LLM analysis. Techniques like retrieval-augmented generation (RAG) can partially mitigate this, but add complexity and latency.

Inability to execute code. LLMs reason about code textually; they cannot compile, execute, or dynamically test it. This means they cannot verify whether a suspected vulnerability is actually triggerable, cannot generate proof-of-concept exploits with high reliability, and cannot distinguish between theoretical and practical exploitability.

Reproducibility Concerns

LLM outputs are sensitive to model version, temperature setting, system prompt, and conversation history. A vulnerability assessment performed with GPT-4 in March 2025 may produce different results than the same analysis in June 2025 due to model updates. This creates audit and compliance challenges for organizations that need reproducible security assessments.

Benchmark contamination. Published evaluations of LLM vulnerability detection often use datasets derived from public CVEs. Since LLMs are trained on public data, they may have memorized specific vulnerability patterns and their fixes, leading to inflated benchmark performance that does not generalize to novel, unpublished vulnerabilities.

Comparison with Traditional SAST Tools

LLM-based and traditional static analysis approaches have complementary strengths:

Dimension Traditional SAST General-Purpose LLMs Fine-Tuned Models
Soundness Rule-dependent; some tools offer formal guarantees No formal guarantees No formal guarantees
Precision Moderate; well-characterized false-positive patterns Low-moderate; unpredictable false positives Moderate-high on trained categories
Scalability Whole-program analysis (millions of LOC) Limited by context window (single files/functions) Limited by context window (single functions)
Language Support Configured per language; broad coverage Any language in training data Trained language(s) only
Explanation Quality Terse rule IDs; require domain expertise Natural-language; accessible to non-specialists Classification labels; limited explanation
Setup Cost Build integration, rule configuration API key or chat interface Model hosting, fine-tuning pipeline
Novel Vulnerability Classes Requires new rules Can reason about unfamiliar patterns Requires retraining

Complementary, Not Competitive

The practical path forward is to use LLMs alongside traditional SAST, not as a replacement. SAST tools provide systematic, reproducible whole-program analysis. LLMs provide flexible, natural-language-explained findings that catch patterns SAST rules may not cover. Together, they offer broader coverage than either approach alone.


tags: - glossary


Glossary

Term Definition
AFL American Fuzzy Lop, coverage-guided fuzzer
ASan AddressSanitizer, memory error detector
CVE Common Vulnerabilities and Exposures
AFL++ Community-maintained successor to AFL, the de facto standard coverage-guided fuzzer
AEG Automatic Exploit Generation, automated creation of working exploits from vulnerability information
ANTLR ANother Tool for Language Recognition, parser generator used by grammar-aware fuzzers like Superion
AST Abstract Syntax Tree, tree representation of source code structure used by static analyzers
BOF Buffer Overflow, writing data beyond allocated memory bounds, a common memory safety vulnerability
CFG Control Flow Graph, directed graph representing all possible execution paths through a program
CGC Cyber Grand Challenge, DARPA competition for autonomous vulnerability detection and patching
ClusterFuzz Google's distributed fuzzing infrastructure that powers OSS-Fuzz
CodeQL GitHub's query-based static analysis engine that treats code as a queryable database
Concolic Concrete + Symbolic, execution that runs concrete values while tracking symbolic constraints
Corpus Collection of seed inputs used by a coverage-guided fuzzer as the basis for mutation
Coverity Synopsys commercial static analysis platform with deep interprocedural analysis
CPG Code Property Graph, unified representation combining AST, CFG, and data-flow graph, used by Joern
CVSS Common Vulnerability Scoring System, standard for rating vulnerability severity
CWE Common Weakness Enumeration, categorization of software weakness types
DAST Dynamic Application Security Testing, testing running applications for vulnerabilities
DBI Dynamic Binary Instrumentation, modifying program behavior at runtime without recompilation
DFG Data Flow Graph, graph representing how data values propagate through a program
DPA Differential Power Analysis, extracting cryptographic keys by analyzing power consumption variations
Frida Dynamic instrumentation toolkit for injecting scripts into running processes
Harness Glue code connecting a fuzzer to its target, defining how fuzzed input is delivered
HWASAN Hardware-assisted AddressSanitizer, ARM-based variant of ASan with lower overhead
IAST Interactive Application Security Testing, combines elements of SAST and DAST during testing
Infer Meta's open-source static analyzer based on separation logic and bi-abduction
KLEE Symbolic execution engine built on LLVM for automatic test generation
LLM Large Language Model, neural network trained on text/code, used for bug detection and code generation
LSAN LeakSanitizer, detector for memory leaks, often used alongside AddressSanitizer
Meltdown CPU vulnerability exploiting out-of-order execution to read kernel memory from user space
MITRE Non-profit organization that maintains CVE, CWE, and ATT&CK frameworks
MSan MemorySanitizer, detector for reads of uninitialized memory
NVD National Vulnerability Database, NIST-maintained repository of vulnerability data
NIST National Institute of Standards and Technology, US agency maintaining security standards and NVD
OSS-Fuzz Google's free continuous fuzzing service for open-source software
OWASP Open Worldwide Application Security Project, community producing security guides and tools
RCE Remote Code Execution, vulnerability allowing an attacker to run arbitrary code on a target system
RL Reinforcement Learning, ML paradigm where agents learn through reward-based feedback
S2E Selective Symbolic Execution, whole-system analysis platform combining QEMU with KLEE
SARIF Static Analysis Results Interchange Format, standard for exchanging static analysis findings
SAST Static Application Security Testing, analyzing source code for vulnerabilities without execution
SCA Software Composition Analysis, identifying known vulnerabilities in third-party dependencies
Seed Initial input provided to a fuzzer as the starting point for mutation
Semgrep Lightweight open-source static analysis tool using pattern-matching rules
Side-channel Attack vector exploiting physical implementation artifacts rather than algorithmic flaws
SMT Satisfiability Modulo Theories, solver used by symbolic execution to find inputs satisfying path constraints
Spectre Family of CPU vulnerabilities exploiting speculative execution to leak data across security boundaries
SQLi SQL Injection, injecting malicious SQL into queries via unsanitized user input
SSRF Server-Side Request Forgery, tricking a server into making requests to unintended destinations
SymCC Compilation-based symbolic execution tool that is 2--3 orders of magnitude faster than KLEE
Taint analysis Tracking the flow of untrusted data from sources to security-sensitive sinks
TOCTOU Time-of-Check-Time-of-Use, race condition between validating a resource and using it
TSan ThreadSanitizer, detector for data races in multithreaded programs
UAF Use-After-Free, accessing memory after it has been deallocated
UBSan UndefinedBehaviorSanitizer, detector for undefined behavior in C/C++
Valgrind Dynamic binary instrumentation framework for memory debugging and profiling
XSS Cross-Site Scripting, injecting malicious scripts into web pages viewed by other users
Fine-tuning Adapting a pre-trained ML model to a specific task using additional training data
Abstract interpretation Mathematical framework for approximating program behavior using abstract domains
Dataflow analysis Tracking how values propagate through a program to detect bugs like taint violations