Static Analysis¶

At a Glance

Category: Static Analysis Key Tools: CodeQL, Coverity, Infer, Semgrep, Clang Static Analyzer, Checkmarx Maturity: Mature Core Value: Find bugs in source code without executing it; scalable, repeatable, and automatable in CI/CD pipelines.

Overview¶

Static analysis examines source code, bytecode, or binaries without executing them, using mathematical and logical reasoning to identify potential defects, security vulnerabilities, and code quality issues. The field has matured considerably since the early days of lint-style checkers, and modern tools employ a range of sophisticated techniques.

AST-based analysis operates on the Abstract Syntax Tree representation of source code. Tools parse source into a tree structure and then apply pattern-matching rules to detect known anti-patterns; for example, identifying unchecked return values or insecure API usage. This approach is fast and straightforward but limited to syntactic patterns, it cannot reason about data flowing through a program.

Dataflow analysis tracks how values propagate through a program. Taint analysis (a specialized form of dataflow analysis) follows user-controlled input from sources (e.g., network reads, environment variables) to sensitive sinks (e.g., SQL queries, system calls). This enables detection of injection vulnerabilities, buffer overflows, and information leaks. Dataflow analysis can be intraprocedural (within a single function) or interprocedural (across function boundaries), with the latter being more precise but computationally expensive.

Abstract interpretation provides a mathematically rigorous framework for approximating program behavior. Rather than tracking concrete values, it reasons over abstract domains; for instance, tracking whether an integer is positive, negative, or zero rather than its exact value. This approach can provide soundness guarantees (no false negatives for the modeled properties) but at the cost of potential over-approximation (false positives).

Query-based analysis treats code as data, building a relational database from the program's structure and semantics, then allowing users to write queries against that database. This paradigm (pioneered by Semmle's QL language (now CodeQL)) offers remarkable flexibility, enabling security researchers to express complex vulnerability patterns as declarative queries.

flowchart LR
    A["Source Code"] --> B["Parse & Build Model"]
    B --> C["AST / IR / Database"]
    C --> D["Apply Analysis Rules"]
    D --> E["Generate Findings"]
    E --> F["Triage & Report"]
    F --> G["Developer Fix"]

    style A fill:#1a1a2e,stroke:#16213e,color:#e0e0e0
    style C fill:#0f3460,stroke:#16213e,color:#e0e0e0
    style E fill:#533483,stroke:#16213e,color:#e0e0e0
    style G fill:#0a6847,stroke:#16213e,color:#e0e0e0

Key Tools¶

CodeQL¶

CodeQL is a semantic code analysis engine developed by GitHub (acquired from Semmle in 2019). It represents a paradigm shift in static analysis: rather than encoding bug patterns as procedural checker code, CodeQL treats the entire codebase as a queryable relational database.

How It Works. CodeQL's workflow has three phases. First, a specialized compiler-like process called the extractor runs alongside the normal build process, capturing a comprehensive snapshot of the codebase (ASTs, type information, control flow graphs, and dataflow edges) into a relational database. Second, security researchers write queries in QL, a declarative, object-oriented query language with Datalog-like semantics. Third, the CodeQL engine evaluates these queries against the database, producing structured results that can be displayed in IDEs, CI dashboards, or GitHub code scanning.

The QL Language. QL is a logic programming language designed specifically for code analysis. It supports classes, predicates, and recursive query patterns. A taint-tracking query, for instance, defines a source (where untrusted data enters), a sink (where it reaches a security-sensitive operation), and lets the engine compute all paths between them. The language's declarative nature means that researchers specify what to find, not how to traverse the code.

Community and Ecosystem. GitHub maintains an extensive set of default query suites covering the OWASP Top 10, CWE/SANS Top 25, and language-specific vulnerability classes. The open-source query repository contains thousands of queries contributed by both GitHub's security lab and the broader community. CodeQL is free for open-source projects on GitHub and powers the code scanning feature used by millions of repositories.

Strengths:

Extremely expressive query language allows modeling complex vulnerability patterns
Deep interprocedural and inter-file dataflow analysis
First-class GitHub integration with automatic PR scanning
Supports C/C++, Java, C#, Python, JavaScript/TypeScript, Go, Ruby, Swift, and Kotlin
Large community-maintained query library

Weaknesses:

Database creation requires a successful build (for compiled languages), which can be difficult for complex projects
Query writing has a steep learning curve
Analysis can be slow on very large codebases (millions of lines of code)
Commercial use beyond GitHub requires licensing

Use Cases: Variant analysis (finding all instances of a known vulnerability pattern), supply chain security audits, large-scale vulnerability research across open-source ecosystems, CI/CD integration for automated security gates.

Community & Maintenance: Actively developed by GitHub with a dedicated security research team. Regular releases, active GitHub Discussions, and a growing body of academic research building on the QL framework. CodeQL CTF competitions further build community expertise.

Coverity¶

Coverity (now part of Synopsys) is one of the most established commercial static analysis platforms, with roots in academic research at Stanford University on the MC Checker project led by Dawson Engler.

Interprocedural Analysis. Coverity's core strength is its deep whole-program analysis. It builds a comprehensive model of program behavior across function boundaries, tracking data flow, control flow, and resource state through complex call chains. This interprocedural approach enables detection of bugs that simpler tools miss; for example, a memory allocation in one function that is never freed in any calling context.

Enterprise Focus. Coverity is designed for integration into enterprise development workflows. It supports incremental analysis (analyzing only changed code against the existing whole-program model), provides a web-based defect management interface for triage, and offers compliance reporting aligned with standards such as MISRA, CERT C/C++, and OWASP. The platform includes role-based access control, audit trails, and integrations with major CI/CD systems (Jenkins, Azure DevOps, GitHub Actions).

False Positive Rates. Coverity has historically invested heavily in reducing false positive rates, recognizing that developer trust is critical for adoption. Synopsys reports a false positive rate under 15% for its default checker configurations, though this varies by checker category and codebase characteristics. The triage workflow allows teams to classify and suppress findings, building an institutional knowledge base over time.

Strengths:

Deep interprocedural analysis with low false positive rates
Mature enterprise features (triage, compliance, auditing)
Broad language support (C/C++, Java, C#, JavaScript, Python, and more)
Incremental analysis for faster CI/CD integration
Strong track record in safety-critical industries (automotive, aerospace, medical devices)

Weaknesses:

Commercial license cost can be significant for smaller organizations
Proprietary; no visibility into analysis internals
Requires infrastructure investment (build integration, results server)
Customization of checker rules is limited compared to query-based tools

Use Cases: Enterprise SDLC security, safety-critical software development (DO-178C, ISO 26262 compliance), large-scale C/C++ codebases, regulated industries requiring auditable security testing.

Community & Maintenance: Backed by Synopsys with dedicated engineering and support teams. Coverity Scan provides free analysis for open-source projects, which has been used by projects such as the Linux kernel, FreeBSD, and Python CPython.

Infer¶

Infer is an open-source static analysis tool developed by Meta (formerly Facebook). It is notable for its foundation in formal methods (specifically, separation logic and the technique of bi-abduction) enabling it to reason about memory safety and concurrency with mathematical rigor.

Compositional Analysis. Infer's key innovation is compositional (or modular) analysis. Rather than analyzing an entire program at once, Infer analyzes each function independently, producing a summary of that function's behavior in terms of pre-conditions and post-conditions expressed in separation logic. These summaries are then composed at call sites. This approach scales to very large codebases (Meta runs Infer on codebases with tens of millions of lines of code) because adding or changing a function only requires re-analyzing that function and its callers, not the entire program.

Bi-Abduction. The bi-abduction technique, introduced in a POPL 2011 paper by Calcagno et al., allows Infer to automatically infer function specifications. Given a function body, bi-abduction simultaneously discovers what the function requires of its callers (the pre-condition) and what it guarantees to them (the post-condition). This eliminates the need for manual annotation, making the tool practical for real-world codebases where developers rarely write formal specifications.

Checker Suite. Infer includes checkers for null pointer dereferences, memory leaks, use-after-free, data races (via its RacerD checker), and performance issues (via Litho lifecycle analysis for Android). The tool has been deployed at scale inside Meta, where it runs on every diff submitted to the monorepo and has reportedly caught tens of thousands of bugs before they reached production.

Strengths:

Scales to massive codebases through compositional analysis
Mathematically grounded in separation logic (soundness properties)
Fast incremental analysis (only re-analyzes changed code)
Open-source (MIT license)
Proven at scale inside Meta

Weaknesses:

Narrower scope of bug types compared to general-purpose SAST tools
Can produce false positives, especially for complex aliasing patterns
Documentation and community support are less extensive than commercial tools
Primarily focused on C/C++, Java, and Objective-C; limited support for other languages

Use Cases: Memory safety analysis for C/C++ and Java, concurrency bug detection, CI/CD integration for large codebases, academic research on program analysis.

Community & Maintenance: Developed by Meta's programming languages and analysis team. Open-source on GitHub with contributions from both Meta engineers and external researchers. Active development with regular releases.

Semgrep¶

Semgrep is a lightweight, open-source static analysis tool that emphasizes speed, ease of use, and developer experience. Originally developed at r2c (now Semgrep, Inc.), it takes a pattern-matching approach that makes writing custom rules accessible to developers who are not static analysis experts.

Pattern Matching. Semgrep rules are written in YAML and use a syntax that closely mirrors the target language's code. A rule to find SQL injection in Python, for instance, looks almost exactly like the vulnerable code pattern itself, with metavariables (e.g., $USER_INPUT) standing in for arbitrary expressions. This low barrier to entry means that developers can write project-specific rules in minutes rather than days.

Community Registry. The Semgrep Registry contains thousands of community-contributed rules organized by language, framework, and vulnerability type. Semgrep also offers pro rules (via Semgrep Supply Chain and Semgrep Code) that add cross-file dataflow analysis and supply chain vulnerability detection.

Strengths:

Very fast analysis (typically seconds to minutes, even on large codebases)
Easy rule authoring with language-aware pattern matching
Supports 30+ languages with a single tool
Open-source core (LGPL 2.1) with commercial extensions
Excellent CI/CD integration and developer experience

Weaknesses:

Single-file analysis in the open-source version limits detection of cross-function bugs
Pattern matching is less powerful than full dataflow analysis for complex vulnerability patterns
Pro features (cross-file analysis, secrets detection) require a commercial license

Use Cases: Enforcing coding standards, finding known anti-patterns, lightweight security scanning in CI/CD, custom rule creation for project-specific vulnerabilities.

Community & Maintenance: Actively developed by Semgrep, Inc. Large and growing open-source community. Regular releases with expanding language support.

Clang Static Analyzer¶

The Clang Static Analyzer is a source-code analysis tool built on the LLVM/Clang compiler infrastructure. It performs path-sensitive, interprocedural analysis of C, C++, and Objective-C code.

Path-Sensitive Analysis. Unlike simpler AST-based checkers, the Clang Static Analyzer symbolically executes paths through a function, tracking the state of variables along each path. This enables it to detect bugs that only manifest under specific conditions; for example, a null pointer dereference that only occurs when a particular branch is taken. The analysis is sound along each explored path but may not explore all paths in complex functions (it uses heuristics to bound analysis time).

Clang-Tidy Integration. Clang-Tidy is a complementary tool that provides AST-based lint checks and can apply automatic fixes. While the Static Analyzer focuses on bug finding through path exploration, Clang-Tidy focuses on style enforcement, modernization (e.g., migrating to C++17 idioms), and simpler bug patterns. Together, they cover a broad spectrum of code quality concerns.

Strengths:

Free and open-source, integrated with the LLVM ecosystem
Path-sensitive analysis catches bugs that simpler tools miss
Excellent for C/C++/Objective-C codebases already using Clang
Can be run as part of the normal build process (scan-build)
Active LLVM community with regular improvements

Weaknesses:

Limited to C-family languages
Analysis can be slow for large translation units
False positive rate can be higher than commercial tools for some checker categories
Less suitable for whole-program analysis across many translation units

Use Cases: C/C++ development workflows using LLVM/Clang, pre-commit bug finding, integration with IDEs (Xcode uses it natively), open-source project quality assurance.

Community & Maintenance: Part of the LLVM project with a large contributor base. Maintained alongside the Clang compiler with regular releases.

Checkmarx¶

Checkmarx is an enterprise application security testing (AST) platform with a strong focus on SAST. It supports a broad range of languages and frameworks and is positioned primarily for organizations with compliance and governance requirements.

Checkmarx provides configurable scan policies, integration with ticketing systems (Jira, ServiceNow), and compliance dashboards aligned with standards such as PCI DSS, HIPAA, and SOC 2. Its query language allows customization of detection rules, though this is typically managed by dedicated security teams rather than developers. The platform includes SCA (Software Composition Analysis) and DAST capabilities alongside its core SAST engine.

Strengths:

Broad language and framework coverage
Strong compliance and governance features
Unified AST platform (SAST, SCA, DAST)
Enterprise integrations and support

Weaknesses:

Commercial-only with significant licensing costs
Can produce a high volume of findings that require triage effort
Less transparent analysis methodology compared to open-source tools

Use Cases: Enterprise application security programs, regulatory compliance, DevSecOps integration in large organizations.

Community & Maintenance: Commercial product backed by Checkmarx (acquired by Hellman & Friedman in 2020). Regular product updates and dedicated customer support.

Comparison Matrix¶

Tool	Technique	Languages	CI/CD Integration	License	False Positive Rate
CodeQL	Query-based dataflow	C/C++, Java, C#, Python, JS/TS, Go, Ruby, Swift, Kotlin	GitHub Actions, CLI	Free for OSS; commercial license	Low--Medium
Coverity	Interprocedural dataflow	C/C++, Java, C#, JS, Python, and more	Jenkins, Azure DevOps, GitHub	Commercial (free for OSS via Scan)	Low
Infer	Compositional / separation logic	C/C++, Java, Objective-C	CLI, Buck/Gradle	MIT	Medium
Semgrep	Pattern matching (+ dataflow in Pro)	30+ languages	GitHub, GitLab, CLI	LGPL 2.1 (core); commercial (Pro)	Low--Medium
Clang Static Analyzer	Path-sensitive symbolic execution	C/C++, Objective-C	scan-build, CMake	Apache 2.0 (LLVM)	Medium
Checkmarx	Dataflow, pattern matching	30+ languages	Jenkins, Azure DevOps, GitHub, GitLab	Commercial	Medium--High

When to Use What¶

Selecting the right static analysis tool depends on your context; language ecosystem, team size, budget, and the types of bugs you need to find.

For open-source C/C++ projects, the Clang Static Analyzer is a natural starting point: it is free, integrates seamlessly with Clang-based build systems, and requires no additional infrastructure. Complement it with Infer for memory safety analysis, particularly if your codebase is large enough to benefit from Infer's incremental compositional approach.

For security-focused vulnerability hunting, CodeQL stands out. Its query language enables researchers to express complex vulnerability patterns that would be difficult or impossible to capture with simpler tools. If you are hunting for variants of a known CVE across a codebase, CodeQL's variant analysis workflow is purpose-built for this task. The free tier for open-source projects on GitHub makes it accessible to security researchers.

For developer-facing CI/CD integration, Semgrep offers the best balance of speed, ease of use, and breadth. Its fast scan times (typically under a minute) mean it can run on every commit without slowing down development. Custom rules can encode project-specific security policies, and the community registry provides a strong baseline.

For enterprise and compliance-driven environments, Coverity and Checkmarx provide the governance features, audit trails, and compliance reporting that regulated industries require. Coverity's lower false positive rate tends to make it more popular with development teams, while Checkmarx's broader AST platform may appeal to organizations seeking a single vendor for SAST, SCA, and DAST.

Defense in Depth

No single static analysis tool catches everything. Many organizations layer tools; for example, running Semgrep for fast feedback on every PR and CodeQL for deeper analysis on nightly builds. This layered approach maximizes coverage while keeping developer friction low.

Research Landscape¶

Static analysis has deep roots in programming language theory and formal methods, and several of the tools discussed here originated in academic research.

Infer and Separation Logic. Infer's theoretical foundation lies in the work of Peter O'Hearn, John Reynolds, and others on separation logic (a formal system for reasoning about programs that manipulate heap memory. The key breakthrough enabling Infer's practical deployment was the bi-abduction technique, published at POPL 2011. O'Hearn's work on "continuous reasoning") applying formal methods incrementally in CI/CD; was recognized with the CAV Award and influenced how the industry thinks about deploying analysis at scale.

CodeQL and the QL Lineage. CodeQL descends from the QL language developed at Semmle, a company founded in 2006 by Oxford researchers Oege de Moor and others. The underlying ideas draw on Datalog and logic programming, adapted for the specific needs of code analysis. Academic publications on the QL framework have appeared at venues including OOPSLA and ECOOP. GitHub's acquisition of Semmle in 2019 brought these academic ideas to a massive user base.

Ongoing Research Directions. Current academic work in static analysis focuses on several frontiers: improving scalability through demand-driven analysis (analyzing only code relevant to a specific query), combining machine learning with traditional analysis to reduce false positives, and extending formal verification techniques to new domains such as smart contracts and distributed systems. The intersection of static analysis with LLM-based code understanding is an active and rapidly evolving area; see Emerging Tech for more on this trend.

Research-to-Practice Pipeline

The static analysis field has one of the strongest research-to-practice pipelines in software engineering. Tools like Infer and CodeQL demonstrate that deep theoretical work (separation logic, Datalog-based querying) can translate into tools used daily by millions of developers. This pipeline continues to produce new techniques ripe for commercialization.

Dynamic Analysis: runtime approaches that complement static analysis
Hybrid Approaches: combining static and dynamic techniques
Fuzzing Tools Overview: dynamic testing through automated input generation

tags: - glossary

Glossary¶

Term	Definition
AFL	American Fuzzy Lop, coverage-guided fuzzer
ASan	AddressSanitizer, memory error detector
CVE	Common Vulnerabilities and Exposures
AFL++	Community-maintained successor to AFL, the de facto standard coverage-guided fuzzer
AEG	Automatic Exploit Generation, automated creation of working exploits from vulnerability information
ANTLR	ANother Tool for Language Recognition, parser generator used by grammar-aware fuzzers like Superion
AST	Abstract Syntax Tree, tree representation of source code structure used by static analyzers
BOF	Buffer Overflow, writing data beyond allocated memory bounds, a common memory safety vulnerability
CFG	Control Flow Graph, directed graph representing all possible execution paths through a program
CGC	Cyber Grand Challenge, DARPA competition for autonomous vulnerability detection and patching
ClusterFuzz	Google's distributed fuzzing infrastructure that powers OSS-Fuzz
CodeQL	GitHub's query-based static analysis engine that treats code as a queryable database
Concolic	Concrete + Symbolic, execution that runs concrete values while tracking symbolic constraints
Corpus	Collection of seed inputs used by a coverage-guided fuzzer as the basis for mutation
Coverity	Synopsys commercial static analysis platform with deep interprocedural analysis
CPG	Code Property Graph, unified representation combining AST, CFG, and data-flow graph, used by Joern
CVSS	Common Vulnerability Scoring System, standard for rating vulnerability severity
CWE	Common Weakness Enumeration, categorization of software weakness types
DAST	Dynamic Application Security Testing, testing running applications for vulnerabilities
DBI	Dynamic Binary Instrumentation, modifying program behavior at runtime without recompilation
DFG	Data Flow Graph, graph representing how data values propagate through a program
DPA	Differential Power Analysis, extracting cryptographic keys by analyzing power consumption variations
Frida	Dynamic instrumentation toolkit for injecting scripts into running processes
Harness	Glue code connecting a fuzzer to its target, defining how fuzzed input is delivered
HWASAN	Hardware-assisted AddressSanitizer, ARM-based variant of ASan with lower overhead
IAST	Interactive Application Security Testing, combines elements of SAST and DAST during testing
Infer	Meta's open-source static analyzer based on separation logic and bi-abduction
KLEE	Symbolic execution engine built on LLVM for automatic test generation
LLM	Large Language Model, neural network trained on text/code, used for bug detection and code generation
LSAN	LeakSanitizer, detector for memory leaks, often used alongside AddressSanitizer
Meltdown	CPU vulnerability exploiting out-of-order execution to read kernel memory from user space
MITRE	Non-profit organization that maintains CVE, CWE, and ATT&CK frameworks
MSan	MemorySanitizer, detector for reads of uninitialized memory
NVD	National Vulnerability Database, NIST-maintained repository of vulnerability data
NIST	National Institute of Standards and Technology, US agency maintaining security standards and NVD
OSS-Fuzz	Google's free continuous fuzzing service for open-source software
OWASP	Open Worldwide Application Security Project, community producing security guides and tools
RCE	Remote Code Execution, vulnerability allowing an attacker to run arbitrary code on a target system
RL	Reinforcement Learning, ML paradigm where agents learn through reward-based feedback
S2E	Selective Symbolic Execution, whole-system analysis platform combining QEMU with KLEE
SARIF	Static Analysis Results Interchange Format, standard for exchanging static analysis findings
SAST	Static Application Security Testing, analyzing source code for vulnerabilities without execution
SCA	Software Composition Analysis, identifying known vulnerabilities in third-party dependencies
Seed	Initial input provided to a fuzzer as the starting point for mutation
Semgrep	Lightweight open-source static analysis tool using pattern-matching rules
Side-channel	Attack vector exploiting physical implementation artifacts rather than algorithmic flaws
SMT	Satisfiability Modulo Theories, solver used by symbolic execution to find inputs satisfying path constraints
Spectre	Family of CPU vulnerabilities exploiting speculative execution to leak data across security boundaries
SQLi	SQL Injection, injecting malicious SQL into queries via unsanitized user input
SSRF	Server-Side Request Forgery, tricking a server into making requests to unintended destinations
SymCC	Compilation-based symbolic execution tool that is 2--3 orders of magnitude faster than KLEE
Taint analysis	Tracking the flow of untrusted data from sources to security-sensitive sinks
TOCTOU	Time-of-Check-Time-of-Use, race condition between validating a resource and using it
TSan	ThreadSanitizer, detector for data races in multithreaded programs
UAF	Use-After-Free, accessing memory after it has been deallocated
UBSan	UndefinedBehaviorSanitizer, detector for undefined behavior in C/C++
Valgrind	Dynamic binary instrumentation framework for memory debugging and profiling
XSS	Cross-Site Scripting, injecting malicious scripts into web pages viewed by other users
Fine-tuning	Adapting a pre-trained ML model to a specific task using additional training data
Abstract interpretation	Mathematical framework for approximating program behavior using abstract domains
Dataflow analysis	Tracking how values propagate through a program to detect bugs like taint violations

Static Analysis¶

Overview¶

Key Tools¶

CodeQL¶

Coverity¶

Infer¶

Semgrep¶

Clang Static Analyzer¶

Checkmarx¶

Comparison Matrix¶

When to Use What¶

Research Landscape¶

Related Pages¶

Glossary¶