Claude vs Codex: A Deep Dive into Code Audit Skill Performance Differences

Introduction: The AI Code Audit Landscape

In the current cybersecurity ecosystem, virtually every security company and independent researcher has embraced artificial intelligence as a critical assistant for code auditing and vulnerability discovery. This transformation represents a fundamental shift in how security professionals approach their work—codifying audit methodologies into structured skills that guide large language models through systematic security analysis.

However, practitioners in this space have encountered a significant and persistent challenge: when large language models process substantial codebases, their limited context window capacity often leads to hallucinations or critical information loss. To mitigate this issue, skill designs have constrained how models access code—requiring them to first analyze business workflows and then retrieve only the function code invoked within those specific flows.

Yet during subsequent testing phases, an intriguing pattern emerged. This particular skill configuration demonstrated notably fewer vulnerability discoveries on the Claude Code platform (powered by Claude Opus 4.6 with 1M context), while producing significantly better results on the Codex platform (running GPT-5.4 xhigh). This performance discrepancy prompted a deeper investigation into how these two model platforms approach the same auditing tasks.

Fundamental Differences in Analytical Approach and Execution

Execution Architecture and Cognitive Style

The underlying architectural differences between these two models fundamentally shape their cognitive styles when facing audit tasks.

Codex: The "Complete Tasks Until They Pass" Loop

According to OpenAI's documentation on "Unrolling the Codex Agent Loop," Codex operates through a continuous agent loop: analyze intent → invoke tools → obtain results → feed back to context → reason again, repeating until task completion. Codex is built on the o3 series models, which have undergone reinforcement learning training with an objective function centered on "iteratively running tests until they pass."

This training methodology produces a critical characteristic: Codex naturally tends to treat each analysis step as a test case requiring "pass/fail" verification. When the skill's audit methodology states "enumerate each hypothesis," Codex's reinforcement learning training makes it more inclined to actually enumerate them one by one and verify each—because it has been trained that "skipping a step that leads to final result failure will receive negative reward."

Claude: Long-Context Reasoning + Interactive Dialogue + Extended Thinking

Claude Opus 4.6 excels at "structured, multi-step reasoning"—understanding architectures, tracking dependency chains, and reasoning across multiple files. Its extended thinking mechanism enables it to maintain coherence across long reasoning chains.

However, this brings a notable side effect: Claude is better at "understanding how systems work" rather than "exhaustively enumerating how systems can be broken." It constructs a coherent mental model ("depositToken has balance check → the protocol has fee-on-transfer protection"), then makes inferences based on this model. If the model is correct, the inferences are correct; if the model misses a path, all inferences based on it will deviate.

Comparative Summary

Codex functions more as an "execution-first" agent. Its default strengths lie in: rapidly translating tasks into shell commands, tests, patches, and validation loops—particularly suitable for fix/refactor/CLI/environment debugging tasks where "objectives are clear and acceptance criteria are well-defined."
Claude Code operates more as an "analysis-first" agent. Its default strengths include: consuming large repository contexts, coordinating multi-point edits, and maintaining semantic consistency over extended periods—especially suitable for multi-file feature development, complex code understanding, long-chain planning, and semantic consistency repairs.

However, this is not an absolute distinction between GPT-5.4 and Claude Opus 4.6 as "model differences." A significant portion actually stems from "product execution method differences" between Codex and Claude Code themselves.

Practical Application Scenarios

For tasks involving "finding bugs, running commands, fixing tests, repeated verification," public evidence more strongly supports Codex as the preferable choice.
For tasks requiring "consuming large repository contexts, performing multi-point coordinated edits, maintaining semantic consistency over extended periods," public evidence more strongly supports Claude Code.

Direct Project Manifestation

In one specific case study, this difference manifested quite directly.

Taking the ammHandleTransfer fund flow analysis as an example:

Claude constructed a global mental model: "This protocol performs balance delta verification on all incoming funds"—then used this model to judge all paths as secure. In reality, the fillOrder path lacked this verification, but this path was not included in the model.

Restricting code context to retrieve function code on-demand based on business flows may cause Claude to expand conclusions drawn from one business flow to the entire protocol. Constraining Claude's context may limit its global thinking capabilities.

Codex did not construct a global model—it checked path by path, marking ammHandleTransfer's inflow as "assumed to already be sitting in the handler." It doesn't care whether other paths have protection, only whether the current path does.

Cognitive Models of the Two Systems

Claude: Mental Model-Driven

Claude's reasoning approach involves modeling first, then inferring. When facing a set of contracts, it constructs a coherent model based on context—"what this protocol does, how parts collaborate, where security mechanisms exist"—then judges whether each function is secure based on this model.

Advantages and Trade-offs of This Strategy:

Advantage: Cross-module insights. The model can naturally see cross-boundary relationships like "Contract A's output is Contract B's input" because it treats the entire protocol as a whole in its mental model.
Trade-off: Inference substitutes verification. Once the mental model labels "this protocol has X protection," all paths involving X will be inferred as secure—even if a particular path lacks this protection. Mental models are compressions of reality, and compression inevitably loses details.

EVMbench data confirms: Claude scores highest on Detect tasks (45.6%) because detection requires global understanding; but on Exploit tasks requiring precise path tracking, it performs weaker than Codex.

Codex: Instruction Execution-Driven

Codex (GPT-5.4) reasons through step-by-step execution and verification. Its reinforcement learning training objective is "execute operation sequences until task completion" (see Unrolling the Codex Agent Loop). Facing the same set of contracts, it doesn't first build a global model, but analyzes each function step-by-step according to instructions, producing results for each checkpoint.

Advantage: Execution discipline. If your methodology states "check each parameter for signature coverage," Codex is more likely to actually check parameter by parameter. It won't skip steps because "it looks safe enough."
Trade-off: Tunnel vision. When executing step 3, Codex doesn't actively recall findings from step 1 to make associations. Unless you explicitly establish these connections in your instructions.

EVMbench data confirms: Codex scores highest on Exploit tasks (72%+) because exploitation requires precise path construction and iterative debugging—exactly what RL training optimizes.

Exploit tasks that can be "verified" align more closely with Codex's working style.

Supplementary Evidence from Semgrep Research

Semgrep's comparative study provides complementary perspective:

Claude reported 2x more findings (46 vs 21), but with higher false positive rates (86% vs 82%)
Claude performed stronger on IDOR (requiring business semantic understanding); Codex performed stronger on Path Traversal (requiring step-by-step data flow tracking)
Codex showed greater variability across multiple runs (higher non-determinism)

This aligns with the cognitive models described above: global understanding → high recall, high false positives; step-by-step execution → low recall, high precision.

Summary: Different Cognitive Strategies Have Respective Strengths

Characteristic	Claude Code (Claude Opus 4.6 1M)	Codex (GPT-5.4 xhigh)
Core Capability	Global understanding + long-chain reasoning	Step-by-step execution + verification loop
Methodology	Understanding as guide, autonomous judgment on execution	Understanding as instruction, mechanical execution step-by-step
Assumptions	Builds global model for inference	Tends to verify one by one

Claude excels at "discovering potential problems" (higher recall), while Codex excels at "verifying problems actually exist" (higher precision). In audit skill scenarios requiring every step to be executed thoroughly, Codex's execution discipline becomes a key advantage.

Skill Design Guidelines

Based on the cognitive differences outlined above, the core philosophy of skill design can be distilled into two principles:

Designing for Claude = Guidance: Prevent it from skipping steps, prevent it from substituting inference for verification, prevent it from being distracted by interfering items.
Designing for Codex = Connection: Help it see associations between steps, help it combine isolated findings into attack chains, help it understand global context.

Project Pattern Recognition

In the pattern of performing project type recognition first, then conducting vulnerability mining, the documents produced by business analysis become the "worldview" for all subsequent vulnerability mining agents. It's not just information—it's a framework that determines what questions agents will ask and what they will ignore in subsequent analysis.

Key Insight: If the business analysis phase describes an unverified assumption as fact, all subsequent agents will never question it.

In A/B experiments, for the same callback function's fund inflow:

One side described it as "the protocol receives X tokens" (declarative mood) → all agents treated it as verified
The other side described it as "assuming X tokens have arrived" (questioning mood) → agents actively verified the actual arrival amount

This single wording difference led to the discovery or non-discovery of a High-level vulnerability.

Design implication for Claude: Claude naturally tends to describe systems in declarative mood (because it builds coherent models), so your Phase 1 instructions must强制 use questioning mood. For every assumption not explicitly verified in code, require it to be labeled as assumption rather than fact.
Design implication for Codex: Codex doesn't need questioning mood as much (it doesn't build global models), but it needs business analysis phases to produce explicit connection information—which functions share state, which functions constitute transaction sequences, which invariants must be maintained across functions. Otherwise, each function will be analyzed in isolation.

Analysis Granularity Must Match the Model's Natural Stopping Layer

Each model has a "natural stopping layer" when analyzing—the level at which it tends to draw conclusions.

Claude's natural stopping layer is the conceptual level. It draws conclusions like "has EIP-712 signature protection," "has reentrancy guard," "has balance check," then stops diving deeper.
Codex's natural stopping layer is the task completion level. It completes the steps required by instructions, then stops. If instructions only require "check if there's signature verification," it will answer "yes" and stop—not actively追问 "which parameters does the signature cover."

Design Implication: Your skill must push analysis requirements below the natural stopping layer.

For Claude: Provide clear thinking paths and questioning angles to help it analyze deeply. Instead of asking "is there signature protection," ask "which parameters does the signature cover, and which parameters affecting economic outcomes are not included in the signature." Instead of asking "is there balance check," ask "on this specific path, what is the data source for the booking amount—is it actual balance measurement or nominal parameters."

This well explains Claude's excellent performance when pairing with senior code auditors for collaborative audit exchanges. Auditors provide analysis thinking and questioning angles, while Claude serves as an extension of auditors' thinking and actions, helping them solve problems quickly and efficiently.

For Codex: Provide a detailed executable task list. Don't ask "analyze this function," decompose it into clear subtasks, each with clear output formats. Codex will precisely execute each subtask you define, so your subtask decomposition granularity determines analysis depth.

When pairing with Codex for collaborative audit exchanges, there's indeed a strong feeling of "kick it, it moves." It simply completes tasks fed back by auditors, without excessive extension or expansion.

Execution Instructions

When designing a set of smart contract audit skills, we need to establish a framework and process, then write execution instructions within that framework to guide large models in analysis operations. So how should we design appropriate execution instructions to guide large models in analysis?

Claude needs "have to do," Codex needs "know what to do"

This is the fundamental difference between these two models when facing methodologies.

Claude understands methodologies but executes selectively. If your dimensional methodology says "construct repeated attacks to calculate cumulative errors," Claude will understand this requirement, but if it judges that single-step errors are small, it will autonomously decide to skip this step. It treats methodologies as guidelines, filtering which steps are "worth doing" with its own judgment.
Codex executes methodologies but doesn't understand intent. If your methodology says "analyze this function," Codex will execute, but it may not know what to focus on. If your methodology says "check if parameter A is signature-covered," Codex will check precisely—but it won't actively追问 "what other parameters should also be checked."

If one can design a rigorous, concretely executable audit process and methodology, execution may be more efficient and controllable on Codex.

Multi-Agent Information Sharing

If your skill uses multi-agent parallel mining, the division of labor among agents should adapt to model characteristics.

Principle: Put information requiring association into the same agent's context.

Claude excels at association, so you can give each agent fewer functions but more analysis dimensions—letting it freely associate across dimensions within one contract. The topology tends to group by contract.
Codex is not good at spontaneous association, so let each agent focus on one analysis dimension but see all contracts—it can compare how different contracts handle the same dimension ("Contract A did this check, did Contract B do it?"). The topology tends to group by dimension.

When writing contract audit Skills, the vulnerability type matching or checklist matching methods we commonly use seem more suitable for Codex's execution habits.

This is not an absolute rule, but a default preference. If two contracts in your protocol have extremely strong state coupling, even for Codex you should put them in the same agent.

Vulnerability PoC Verification

Claude's natural advantage is causal reasoning. It excels at explaining "why this is a bug" and "what happens if there is/isn't this bug." Leverage this: require Claude's PoC to include both "proving the vulnerability exists" and "proving correct behavior" tests—it excels at constructing such causal contrasts.

This article's core finding: AI programming agents can only discover 13% of production-grade critical bugs when self-reviewing code, but when you directly ask "what could go wrong," they can provide correct diagnoses 100% of the time.

Codex's natural advantage is iterative debugging. Its reinforcement training makes it excel at the "run → fail → diagnose → fix → retry" loop. Leverage this: design PoC verification as a test loop, allowing multiple fix attempts. Further: let Codex write a test for every SAFE conclusion—if it thinks "there's balance check protection here," have it write a test using a fee-on-transfer mock to prove this protection really exists. If the test fails, the SAFE conclusion is automatically overturned.

This is Codex's unique advantage: transforming static analysis into dynamic verification. Claude finds this difficult because its thinking mode is "reasoning to conclusions" rather than "testing to verify conclusions."

Conclusion

As AI capabilities update by the day and business process AI-ization sweeps like a tidal wave, we all find ourselves in this wave—using AI to replace our own work while worrying about being replaced by AI. This确实 causes anxiety, resistance, and confusion. But when you don't know where the future path lies, you should precisely maintain a mindset ready to adapt at any moment. When the wave hits, what we should do is throw ourselves into the wave and surf with it, rather than desperately clinging to that reef.