Skip to content

AI/ML-Guided Fuzzing

At a Glance

Attribute Detail
Category AI/ML-Guided Fuzzing
Maturity Emerging / Research
Key Idea Replace or augment hand-crafted heuristics in fuzzers with learned models that predict useful mutations, schedule seeds, or generate structured inputs
Representative Tools NEUZZ, MTFuzz, FuzzGPT, TitanFuzz, ChatAFL
Primary Targets C/C++ binaries, protocol implementations, API surfaces

Overview

Traditional coverage-guided fuzzers like AFL and libFuzzer rely on evolutionary algorithms and manually designed mutation operators to explore program state spaces. While remarkably effective (Google's OSS-Fuzz has found tens of thousands of bugs using these techniques) they share a fundamental limitation: their exploration strategies are static heuristics that cannot adapt to the structure of the program under test.

Machine learning offers a path beyond this ceiling. By training models on execution traces, code structure, or input-output relationships, ML-guided fuzzers can learn which mutations are most likely to reach new code, how to schedule seeds for maximum exploration, and even how to generate syntactically valid inputs from scratch. The core insight is that program behavior is not random, it exhibits learnable patterns that a model can exploit to make fuzzing more efficient.

Three broad strategies have emerged for integrating ML into the fuzzing loop:

  1. Learned mutation strategies: neural networks predict which byte positions to mutate and what values to substitute, replacing or supplementing random bit flips and dictionary-based mutations.
  2. Coverage prediction: models estimate which inputs are likely to reach uncovered code regions, allowing the fuzzer to focus effort on high-potential seeds.
  3. Seed scheduling: rather than using simple heuristics like queue cycling or energy-based scheduling, ML models learn to prioritize seeds that are most likely to yield new coverage or trigger bugs.

The maturity of these approaches varies widely. Reinforcement-learning and neural-network-guided mutation have the strongest empirical backing, with multiple peer-reviewed evaluations on standard benchmarks. LLM-guided fuzzing is newer and less thoroughly benchmarked but shows significant promise for structured input domains. All approaches share a common challenge: the overhead of model inference must be offset by sufficiently large gains in bug-finding efficiency to justify the added complexity.

ML-in-the-Loop Fuzzing Architecture

The following diagram illustrates how a machine learning component integrates into a standard fuzzing feedback loop:

graph LR
    A[Seed Corpus] --> B[Seed Scheduler]
    B --> C[ML Model]
    C --> D[Mutation Engine]
    D --> E[Test Harness]
    E --> F[Execution Monitor]
    F --> G{New Coverage?}
    G -- Yes --> H[Corpus Update]
    G -- No --> B
    H --> B
    F --> I[Crash Triage]
    C <--> J[Training Data<br/>Coverage Maps<br/>Execution Traces]
    F --> J

    style C fill:#1a7a6d,color:#fff
    style J fill:#1a7a6d,color:#fff

The ML model (highlighted) sits between the seed scheduler and the mutation engine. It consumes execution feedback (coverage maps, branch hit counts, execution traces) and uses that data to guide mutation decisions. The model may be pre-trained offline or updated online during the fuzzing campaign.

Approach Profiles

Reinforcement Learning Approaches: NEUZZ and MTFuzz

Reinforcement learning (RL) and neural-network-guided mutation represent the most mature category of ML-assisted fuzzing. The central idea is to train a neural network that approximates the relationship between input bytes and program branches, then use gradient information from this model to guide mutations toward unexplored code.

NEUZZ (She et al., IEEE S&P 2019) pioneered this approach by training a feedforward neural network to model the program's branching behavior as a smooth, differentiable function. Given an input and its observed coverage bitmap, NEUZZ trains a surrogate model that predicts which branches will be taken. It then computes gradients of this model with respect to input bytes (identifying which byte positions have the highest influence on uncovered branches) and applies gradient-guided mutations to maximize the probability of reaching new code.

The key innovation is neural program smoothing: real programs have discrete, discontinuous branching behavior that is hostile to gradient-based optimization. NEUZZ's surrogate model smooths this landscape, creating a differentiable approximation that gradient descent can navigate. In evaluations on the LAVA-M benchmark and real-world programs, NEUZZ found significantly more bugs than AFL in the same time budget, demonstrating that learned mutation strategies can outperform random mutation on structured programs.

MTFuzz (She et al., IEEE S&P 2022) extends NEUZZ's approach with multi-task learning. Rather than training a single model, MTFuzz trains a shared neural network across multiple related fuzzing tasks simultaneously. This allows the model to transfer knowledge between programs; patterns learned from fuzzing one library can accelerate exploration of another with similar structure. MTFuzz also improves the training efficiency of the surrogate model by using a more sophisticated loss function that better captures branch reachability.

Strengths:

  • Gradient-guided mutations are more targeted than random bit flips, concentrating effort on input bytes that influence control flow
  • Neural program smoothing converts a discrete optimization problem into a continuous one amenable to well-understood gradient descent
  • Demonstrated improvements over AFL on standard benchmarks (LAVA-M, FuzzBench programs)
  • MTFuzz's multi-task approach enables cross-program knowledge transfer

Weaknesses:

  • Model training introduces overhead; the fuzzer must periodically pause mutation to retrain the surrogate model on new coverage data
  • The surrogate model's approximation degrades as the program's branching behavior becomes more complex or stateful
  • Gradient-guided mutations are most effective for flat input formats (e.g., binary blobs) and less naturally suited to highly structured inputs like parsers or protocol messages
  • Scalability to very large programs (millions of branches) remains an open question

Use Cases:

  • Fuzzing C/C++ libraries where traditional coverage-guided fuzzing has plateaued
  • Programs with complex input-to-branch relationships that random mutation struggles to navigate
  • Research environments with compute budget to absorb model training overhead

LLM-Guided Approaches: FuzzGPT and TitanFuzz

Large language models bring a fundamentally different capability to fuzzing: they can understand code semantics, generate syntactically valid programs, and reason about edge cases in ways that mutation-based approaches cannot. Rather than learning byte-level mutation strategies, LLM-guided fuzzers leverage the model's pre-trained understanding of programming languages to generate meaningful test inputs.

TitanFuzz (Deng et al., ISSTA 2023) targets deep learning library APIs (PyTorch, TensorFlow) by using LLMs to generate valid API call sequences. The key challenge in fuzzing DL libraries is that valid inputs are not random byte strings; they are structured Python programs that must correctly compose API calls with type-compatible arguments. TitanFuzz uses a two-phase approach:

  1. Generation phase: an LLM (Codex or StarCoder) generates seed programs that exercise target API functions, using the model's knowledge of common usage patterns.
  2. Mutation phase: the LLM mutates existing programs by applying semantics-aware transformations: swapping equivalent API calls, modifying tensor shapes, or introducing edge-case argument values.

TitanFuzz discovered over 60 previously unknown bugs in PyTorch and TensorFlow, including several crashes and silent correctness errors. Its success demonstrates that LLMs can serve as effective test-case generators for APIs with complex preconditions that traditional fuzzers cannot easily satisfy.

FuzzGPT (Deng et al., ISSTA 2024) builds on TitanFuzz's foundation by fine-tuning LLMs specifically on historical bug-triggering programs. The insight is that bugs tend to cluster around unusual or edge-case usage patterns that differ from typical code. FuzzGPT fine-tunes models on programs extracted from past bug reports, teaching the model to generate "unusual" code that is more likely to trigger errors. It also introduces an auto-prompting technique where the model iteratively refines its own prompts to produce more diverse test cases.

FuzzGPT found additional bugs in DL libraries beyond what TitanFuzz discovered, and its fine-tuning approach improved the bug-finding rate compared to using general-purpose LLMs with standard prompts.

Strengths:

  • LLMs can generate structurally valid, semantically meaningful test inputs without explicit grammar specifications
  • Pre-trained code understanding allows the model to target edge cases and unusual API usage patterns
  • Particularly effective for API fuzzing where valid inputs are structured programs rather than byte strings
  • FuzzGPT's fine-tuning on historical bugs focuses generation on high-value input regions

Weaknesses:

  • LLM inference is computationally expensive, generating one test case may take seconds, compared to millions of mutations per second in traditional fuzzers
  • Model outputs are non-deterministic and may produce syntactically invalid programs that waste execution cycles
  • Limited to domains where the LLM has sufficient pre-training data (popular libraries and languages)
  • Difficult to integrate into continuous fuzzing pipelines due to cost and latency

Use Cases:

  • API fuzzing for complex libraries (deep learning frameworks, compilers, database engines)
  • Generating test cases for language interpreters and runtimes
  • Augmenting traditional fuzz corpora with semantically diverse seed programs

ChatAFL: Protocol Fuzzing with LLM Assistance

ChatAFL (Meng et al., NDSS 2024) applies LLMs to a domain where traditional fuzzing struggles most: stateful network protocol implementations. Protocols like TLS, SMTP, and FTP require a specific sequence of messages in the correct order, with each message depending on prior state. Random mutation overwhelmingly produces invalid message sequences that are rejected in early parsing stages.

ChatAFL enriches a protocol fuzzer's state machine using ChatGPT. Given an RFC or protocol specification, the LLM extracts message types, valid state transitions, and field constraints. During fuzzing, the LLM helps the fuzzer:

  • Enrich the state machine by inferring valid message sequences that the fuzzer's initial state model may have missed
  • Generate protocol-aware mutations that modify message fields while maintaining structural validity
  • Suggest state transitions that reach deeper protocol states where bugs are more likely to lurk

In evaluations on protocol implementations including live555 (RTSP), OpenSSL (TLS), and Exim (SMTP), ChatAFL achieved higher code coverage and found more unique bugs than the baseline protocol fuzzer AFLNet. The approach effectively bridges the gap between the fuzzer's limited protocol model and the complexity of real-world protocol implementations.

Strengths:

  • LLM-extracted protocol knowledge reduces the manual effort of writing protocol grammars
  • State machine enrichment enables deeper exploration of protocol state spaces
  • Combines well with existing protocol fuzzers as an augmentation layer

Weaknesses:

  • Depends on the LLM having accurate knowledge of the target protocol; obscure or proprietary protocols may yield poor results
  • LLM queries add latency to each fuzzing iteration
  • Protocol state machines extracted by LLMs may contain hallucinated transitions that waste fuzzing effort

Use Cases:

  • Fuzzing network protocol implementations (TLS, HTTP/2, DNS, MQTT)
  • Rapid prototyping of protocol fuzzers without writing custom grammars
  • Augmenting existing stateful fuzzers like AFLNet or StateAFL

Neural Program Smoothing

Neural program smoothing is the theoretical foundation underlying approaches like NEUZZ and is worth understanding as a standalone concept. Real programs make hard binary decisions at branch points, a comparison if (x > 42) creates a discontinuity that gradient-based optimization cannot traverse. Neural program smoothing replaces the actual program with a neural network approximation that maps inputs to coverage outcomes through smooth, differentiable functions.

The surrogate model is trained on observed input-coverage pairs and learns to interpolate between them. Where the real program has a cliff edge at x = 42, the neural model produces a smooth sigmoid-like transition. Gradient descent on this smooth landscape can identify input modifications that cross the real program's branch boundaries, guiding mutations toward new coverage.

This technique is most effective for programs with moderate branching complexity. For programs with very deep or highly stateful control flow, the surrogate model's approximation becomes increasingly inaccurate, and the gradients it produces may mislead the fuzzer.

Comparison Table

Technique Input Domain Training Data Overhead Maturity Notable Results
NEUZZ (Neural-guided) Binary / flat formats Coverage bitmaps Moderate (periodic retraining) Research (2019) Outperformed AFL on LAVA-M
MTFuzz (Multi-task) Binary / flat formats Coverage bitmaps (multi-program) Moderate Research (2022) Cross-program transfer learning
TitanFuzz (LLM-gen) API call sequences Pre-trained LLM High (LLM inference) Research (2023) 60+ bugs in PyTorch/TF
FuzzGPT (Fine-tuned LLM) API call sequences Bug-triggering programs High (LLM inference) Research (2024) Improved on TitanFuzz results
ChatAFL (LLM-protocol) Network protocols RFC / protocol specs Moderate-High Research (2024) Deeper coverage on TLS, SMTP
Neural smoothing Binary / flat formats Coverage bitmaps Moderate Foundational Enables gradient-guided mutation

Current Limitations and Open Problems

Despite promising results, ML-guided fuzzing faces several challenges that limit its practical adoption:

Overhead versus throughput trade-off. Traditional fuzzers achieve their effectiveness through sheer speed; AFL can execute millions of test cases per second on simple targets. ML-guided fuzzers sacrifice throughput for smarter mutations, and the break-even point is not always favorable. For targets where random mutation can quickly achieve high coverage, the overhead of model training and inference may not be justified. The community lacks clear guidelines for when ML-guided fuzzing is worth the investment.

Benchmark representativeness. Many ML-guided fuzzers are evaluated on artificial benchmarks like LAVA-M, which inject synthetic bugs into programs. Performance on LAVA-M does not always predict effectiveness on real-world targets. More recent evaluations use FuzzBench and real-world CVEs, but the benchmark gap remains a concern.

Integration with Existing Workflows

ML-guided fuzzers are overwhelmingly research prototypes. They typically lack the engineering polish, documentation, and CI/CD integration that practitioners expect. Integrating a neural-network-guided fuzzer into a continuous fuzzing pipeline requires ML infrastructure (GPU access, model serving) that most software teams do not have.

Reproducibility and determinism. Neural network training introduces non-determinism that makes fuzzing campaigns harder to reproduce. Two runs with the same seed corpus may explore different paths due to stochastic model initialization and training dynamics. For security teams that need reproducible results, this is a significant concern.

Hybrid Approaches

The most promising near-term direction may be hybrid systems that use ML models for seed scheduling and corpus distillation while relying on traditional mutation for throughput-critical inner loops. This architecture captures the benefits of learned strategies without sacrificing the raw speed that makes fuzzing effective.

Scalability to large programs. Most ML-guided fuzzing evaluations target programs with tens of thousands of branches. Real-world targets like web browsers, operating system kernels, and database engines have millions of branches. Whether surrogate models and gradient-guided mutations scale to this complexity is an open question with limited empirical evidence.


tags: - glossary


Glossary

Term Definition
AFL American Fuzzy Lop, coverage-guided fuzzer
ASan AddressSanitizer, memory error detector
CVE Common Vulnerabilities and Exposures
AFL++ Community-maintained successor to AFL, the de facto standard coverage-guided fuzzer
AEG Automatic Exploit Generation, automated creation of working exploits from vulnerability information
ANTLR ANother Tool for Language Recognition, parser generator used by grammar-aware fuzzers like Superion
AST Abstract Syntax Tree, tree representation of source code structure used by static analyzers
BOF Buffer Overflow, writing data beyond allocated memory bounds, a common memory safety vulnerability
CFG Control Flow Graph, directed graph representing all possible execution paths through a program
CGC Cyber Grand Challenge, DARPA competition for autonomous vulnerability detection and patching
ClusterFuzz Google's distributed fuzzing infrastructure that powers OSS-Fuzz
CodeQL GitHub's query-based static analysis engine that treats code as a queryable database
Concolic Concrete + Symbolic, execution that runs concrete values while tracking symbolic constraints
Corpus Collection of seed inputs used by a coverage-guided fuzzer as the basis for mutation
Coverity Synopsys commercial static analysis platform with deep interprocedural analysis
CPG Code Property Graph, unified representation combining AST, CFG, and data-flow graph, used by Joern
CVSS Common Vulnerability Scoring System, standard for rating vulnerability severity
CWE Common Weakness Enumeration, categorization of software weakness types
DAST Dynamic Application Security Testing, testing running applications for vulnerabilities
DBI Dynamic Binary Instrumentation, modifying program behavior at runtime without recompilation
DFG Data Flow Graph, graph representing how data values propagate through a program
DPA Differential Power Analysis, extracting cryptographic keys by analyzing power consumption variations
Frida Dynamic instrumentation toolkit for injecting scripts into running processes
Harness Glue code connecting a fuzzer to its target, defining how fuzzed input is delivered
HWASAN Hardware-assisted AddressSanitizer, ARM-based variant of ASan with lower overhead
IAST Interactive Application Security Testing, combines elements of SAST and DAST during testing
Infer Meta's open-source static analyzer based on separation logic and bi-abduction
KLEE Symbolic execution engine built on LLVM for automatic test generation
LLM Large Language Model, neural network trained on text/code, used for bug detection and code generation
LSAN LeakSanitizer, detector for memory leaks, often used alongside AddressSanitizer
Meltdown CPU vulnerability exploiting out-of-order execution to read kernel memory from user space
MITRE Non-profit organization that maintains CVE, CWE, and ATT&CK frameworks
MSan MemorySanitizer, detector for reads of uninitialized memory
NVD National Vulnerability Database, NIST-maintained repository of vulnerability data
NIST National Institute of Standards and Technology, US agency maintaining security standards and NVD
OSS-Fuzz Google's free continuous fuzzing service for open-source software
OWASP Open Worldwide Application Security Project, community producing security guides and tools
RCE Remote Code Execution, vulnerability allowing an attacker to run arbitrary code on a target system
RL Reinforcement Learning, ML paradigm where agents learn through reward-based feedback
S2E Selective Symbolic Execution, whole-system analysis platform combining QEMU with KLEE
SARIF Static Analysis Results Interchange Format, standard for exchanging static analysis findings
SAST Static Application Security Testing, analyzing source code for vulnerabilities without execution
SCA Software Composition Analysis, identifying known vulnerabilities in third-party dependencies
Seed Initial input provided to a fuzzer as the starting point for mutation
Semgrep Lightweight open-source static analysis tool using pattern-matching rules
Side-channel Attack vector exploiting physical implementation artifacts rather than algorithmic flaws
SMT Satisfiability Modulo Theories, solver used by symbolic execution to find inputs satisfying path constraints
Spectre Family of CPU vulnerabilities exploiting speculative execution to leak data across security boundaries
SQLi SQL Injection, injecting malicious SQL into queries via unsanitized user input
SSRF Server-Side Request Forgery, tricking a server into making requests to unintended destinations
SymCC Compilation-based symbolic execution tool that is 2--3 orders of magnitude faster than KLEE
Taint analysis Tracking the flow of untrusted data from sources to security-sensitive sinks
TOCTOU Time-of-Check-Time-of-Use, race condition between validating a resource and using it
TSan ThreadSanitizer, detector for data races in multithreaded programs
UAF Use-After-Free, accessing memory after it has been deallocated
UBSan UndefinedBehaviorSanitizer, detector for undefined behavior in C/C++
Valgrind Dynamic binary instrumentation framework for memory debugging and profiling
XSS Cross-Site Scripting, injecting malicious scripts into web pages viewed by other users
Fine-tuning Adapting a pre-trained ML model to a specific task using additional training data
Abstract interpretation Mathematical framework for approximating program behavior using abstract domains
Dataflow analysis Tracking how values propagate through a program to detect bugs like taint violations