OpenAI Codex Security in Practice: A One-Month Evaluation on TorMap
1. Context and motivation
I have been closely following the research space around LLM-based vulnerability detection since my first research project in the space in January 2024 (vuldra). Vuldra aimed at source code based vulnerability detection combining traditional SAST tooling with OpenAI's GPT models of the time. In June 2024 I started a second research project, AutoPentest, focusing on the black-box exploitation side of things and later also published a paper on this.
The results reported in this post reflect a single evaluation on one open-source codebase by a single reviewer (me). No ground-truth vulnerability set exists for TorMap, so only precision can be measured. All figures should be interpreted accordingly, as a directional signal rather than a rigorous benchmark.
Rise of LLM-based code security scanning tools
From October/November 2025 we saw coding agents (Claude Code, GitHub Copilot, OpenAI Codex, Gemini CLI) and their
underlying LLM models becoming increasingly good at autonomously completing real world software-engineering
tasks (SWE-rebench).

Shortly after major frontier AI labs started announcing they were working on LLM-based code security scanning tools:
These labs all seem to have come to a similar conclusion of what makes such a security tool useful (frontier LLM agent(s) + analysis tools + automated patches + human review). Looking at the earlier results in code generation, tools like Codex Security, Claude Code Security and CodeMender are essentially the existing agent paradigm pointed directly at application security.
We also see frontier AI models getting increasingly good at vulnerability analysis on the well
recognized CyberGym benchmark.

I was eager to try out one of these tools, so I chose the one that seemed to have the least friction to get access to as an individual currently: OpenAI Codex Security. I got access to the research preview end of April 2026 and have been using it for a month now.
The target project: TorMap
As the target codebase for evaluation, I chose one of my own open source projects, TorMap. The Tor network currently consists of thousands of relays which route anonymous internet traffic daily. TorMap is a world map displaying approximate locations where Tor relays are being hosted. With the website you can group, filter and analyze Tor relays. The historic state of the network can be viewed for any day between October 2007 and today.

The application is implemented using a backend Spring Boot webserver in Kotlin and a React frontend in TypeScript.
Languages and LoC (tokei)
| Language | Files | Lines | Code |
|---|---|---|---|
| Java | 96 | 19299 | 11630 |
| Kotlin | 59 | 3431 | 2783 |
| TSX | 33 | 2603 | 2223 |
| TypeScript | 13 | 979 | 782 |
| JSON | 7 | 53648 | 53648 |
| Markdown | 6 | 1126 | 0 |
| YAML | 5 | 167 | 154 |
| SQL | 3 | 89 | 79 |
| Plain Text | 3 | 679 | 0 |
| XML | 2 | 61 | 50 |
| Batch | 1 | 94 | 73 |
| CSS | 1 | 56 | 50 |
| Dockerfile | 1 | 16 | 11 |
| HTML | 1 | 47 | 33 |
| JavaScript | 1 | 53 | 47 |
| Shell | 1 | 251 | 106 |
| SVG | 1 | 72 | 72 |
| Total | 234 | 82673 | 71742 |
2. Methodology
Create Scan
After completing the short onboarding process for Codex Security, I connected it to my GitHub repository containing the
TorMap codebase. I then created the first scan, providing brief scope guidance for the threat model.

Configure Codex Environment
Next, I configured the codex environment in which it can scan the code and test patches. I left this
configured with the default, while a custom config likely would allow you to optimize the execution environment.

Start Scan
Next I started the scan and waited for the results to come in. This can take several hours, as Codex first generates a threat model and then goes over commit by commit to see if any introduced vulnerabilities. During scan creation it reports it will scan commit history of the past 2 months, and may also scan earlier commits beyond that until the 1K cap is reached. Scans are capped at 10K commits.
Triage and Fix Workflow
Once results would come in, I would do this workflow for each finding:
- Review finding description
- Review generated patch (request one if not generated yet)
- If false positive / already fixed / won't fix, close finding with a comment
- If true positive, let Codex create a pull request or create one manually for complex cases
- Let GitHub Copilot review the pull request and suggest improvements
- Perform manual review ironing out any issues and ensuring CI tests pass, then merge the PR
I considered a finding true positive if it represents exploitable or realistically harmful behavior under the defined threat model. This criterion is inherently judgment-dependent. A different security practitioner reviewing the same findings might classify borderline cases differently, introducing subjectivity.
The Codex Security UI presents you with a filterable list of findings on the left side, with details for the selected
finding on the right. The details include severity, related commit, summary, validation, evidence with vulnerable code
snippets and attack-path analysis. You have actions for viewing suggested patch, requesting a patch if not available yet
and creating a pull request with the patch in GitHub. You can also close the finding or chat with Codex about the
finding also allowing you to instruct further tests.

3. Results: Threat model generated by Codex Security
Let's take a look at the threat model that Codex generated for the TorMap codebase:
Intro
TorMap is a public data platform that ingests Tor relay descriptors and exposes read-only APIs for map visualizations
and relay/family detail lookups. The backend (backend/) is the security-critical component: a Spring Boot service that
continuously downloads descriptor archives from Tor Project Collector, parses and enriches them, persists results in
PostgreSQL, and serves public HTTP endpoints consumed by the React frontend.
From a security perspective, TorMap is unusual: there is little classic user-account logic, but high exposure to untrusted data and availability pressure. Most of the backend’s inputs are either:
- Internet client API traffic (direct attacker-controlled), or
- descriptor content originating from Tor relay operators (indirectly attacker-controlled via Tor network publication, then rehosted by Collector).
Primary assets are
- Service availability (API responsiveness, scheduler stability, DB health).
- Data integrity (correct relay/family/location mapping).
- Admin plane integrity (actuator endpoints and admin credentials).
- Host and filesystem safety (descriptor download directory, password file).
- Operational secrets (admin password, DB credentials, New Relic ingest key).
Code areas with highest security impact include
- Auth/admin exposure:
backend/src/main/kotlin/org/tormap/config/SecurityConfig.kt,backend/src/main/resources/application.yml - Public API handling:
backend/src/main/kotlin/org/tormap/adapter/controller/ - Descriptor ingestion and parsing:
backend/src/main/kotlin/org/tormap/service/Descriptor*,backend/src/main/java/org/torproject/descriptor/index/,backend/src/main/java/org/torproject/descriptor/impl/DescriptorReaderImpl.java - Scheduling/concurrency/cache:
backend/src/main/kotlin/org/tormap/service/SchedulingService.kt,CacheService.kt - Deployment defaults:
backend/production.yml
System trust boundaries
| Boundary | Inputs crossing boundary | Control level | Security impact |
|---|---|---|---|
Public HTTP API (/relay/**) | Path params (day, id), JSON arrays of IDs | Attacker-controlled | DoS, query amplification, cache thrash, malformed input handling |
| Descriptor ingestion from Collector | index.json, descriptor files/tarballs, metadata (path, file names, sizes, modified times) | Semi-trusted / adversarial if upstream compromised | Path traversal/file overwrite risk, resource exhaustion, data poisoning |
| Tor relay-published descriptor fields | Nicknames, contact, family entries, protocol/platform metadata | Indirect attacker-controlled | Parser edge-case handling, DB/storage pressure, UI data safety |
Admin plane (/login, /actuator/**) | Auth attempts, actuator requests | Attacker-controlled externally; operator-controlled credentials | Admin takeover, sensitive diagnostics exposure, service disruption |
| Local config and runtime env | application.yml, env vars, password file path, datasource URL/creds | Operator-controlled | Misconfiguration can dominate practical risk |
| CI/build inputs | Dependencies, workflow actions, gradle/yarn tooling | Developer-controlled | Supply-chain compromise at build/deploy time |
Assumptions used for risk calibration
- The API is intentionally public and mostly read-only; confidentiality of relay data is low compared to availability/integrity.
- Deployment likely places backend behind TLS termination/reverse proxy, but the app itself does not enforce HTTPS by default.
- Tor Collector is treated as trusted in normal operation, but compromise/DNS tampering must be considered for high-impact scenarios.
- Single-instance scheduling is implied; no distributed locking exists for multi-instance deployments.
- Frontend users are untrusted, but frontend does not directly allow arbitrary HTML/script injection in normal rendering paths.
Existing controls observed
- Actuator paths are explicitly set to authenticated in
SecurityConfig.kt. - Random admin password generation on first start with persistence to file (
spring.security.user.passwordFile). - DB uniqueness/indexing controls in migrations (
V1__Create_tables.sql) reduce duplicate corruption and support query performance. - Parameterized JPQL/native query binding reduces SQL injection risk (
RelayDetailsRepositoryImpl.kt,RelayLocationRepositoryImpl.kt). - Caching and cache warming/eviction logic reduce DB pressure under normal usage.
- CI uses pinned GitHub Action SHAs and Gradle wrapper validation, improving supply-chain robustness.
[...]
Attacker stories (realistic scenarios)
- Unauthenticated API flooder sends massive POST bodies and rapid day queries to exhaust memory/DB and degrade public API latency.
- Malicious relay operator publishes descriptor edge-case values that trigger repeated parser/storage errors, causing repeated processing and scheduler backlog.
- Compromised upstream Collector/index serves crafted paths/metadata or huge compressed files, aiming at filesystem impact and process memory exhaustion.
- Credential exposure attacker obtains admin password from logs/password file and abuses actuator endpoints (
heapdump,logfile) for sensitive runtime data and service disruption. - Misconfiguration opportunist scans deployments using default compose credentials/open DB port and takes over PostgreSQL.
Vulnerabilities that are less critical in this repo’s context
- Classic horizontal privilege escalation is less relevant: there are no normal user accounts/tenants.
- Many confidentiality leaks of relay metadata are lower impact because data is intentionally public.
- SQL injection is less likely due parameterized repository patterns; injection would require introducing new raw concatenated SQL paths.
Criticality calibration
| Severity | Definition |
|---|---|
| Critical | Vulnerabilities that enable full admin/host compromise, severe data integrity loss, or sustained service outage with low attacker effort. [...] |
| High | Vulnerabilities with major availability/integrity impact, but requiring more conditions or sustained effort. [...] |
| Medium | Vulnerabilities with bounded impact, partial preconditions, or mostly operational abuse. [...] |
| Low | Issues with limited exploitability or minor security consequences in this public-data context. [...] |
The generated threat model is comprehensive and aligns well with my own understanding of the TorMap codebase. The risk calibration also seems reasonable given the context.
4. Results: Summary
Now we take a look at the results of Codex Security scans over the past month and my manual triaging efforts.
Codex identified 0 critical, 12 high, 25 medium and 8 low severity issues across the codebase after
scanning 1015/1318 historical commits on branch master. In my case the 1015 commits scanned was very close to the
1K cap that Codex reportedly applies beyond the last 2 months of commits. This means anything that is not touched by
these commits will be out of coverage and likely not produce findings.

Triaging Results
After triaging all findings, I classified 32/37 (~87 %) as true positives needing a fix. For some of these, Codex would detect implemented fixes fast enough and autoclose the finding, while for others I had to manually close after fixing.
For 4/37 (~10 %) true positives I decided to not apply a fix as they would not pose a significant risk in the threat model of the TorMap project. I would accept the remaining risk. These can be seen as low severity issues with incorrect initial severity rating by Codex.
Only 1/37 (~3 %) finding was a clear false positive. In this case Codex suggested turning off scheduled async enrichment of data to mitigate risk of overlapping tasks, which in reality had no risk of overlapping due to Spring's @Scheduled annotation with fixed delay.
Observed precision on this sample is 36/37 = ~97.3%, subject to small-sample bias (n=37).

In my prior experience with traditional SAST tools on similar JVM projects, out-of-the-box precision tended to be lower without rule tuning. Results can of course vary significantly by tool and codebase.
No conclusions about recall or detection coverage of Codex can be drawn, as no ground-truth vulnerability set exists for TorMap. But the tool does not scan the whole commit history, limiting coverage by design.
Pull Requests
In total, I merged 17 PRs over the course of ~1 month to fix the critical/high/medium issues raised by Codex.
Using Codex's patch suggestions and PR generation capabilities, I opened 10 PRs directly via Codex. I was able to merge 3 PRs generated by Codex without modification, while the rest required some adjustments after Github Copilot and manual review.
I manually opened 7 PRs as I was not satisfied with Codex's suggested patch or it required more complex refactoring.
Overall, ~59% of fixes involved Codex-generated patches, though most required refinement:
- Codex-generated and merged unchanged: 3/17 PRs (~18%)
- Codex-generated, human/AI-refined: 7/17 PRs (~41%)
- Fully manual: 7/17 PRs (~41%)
5. Results: Example Finding
High: Ref-name deploy check allows tag-based prod deployment bypass
The following finding is nice example where Codex recognized an unintended live deployment that could be triggered
by a GitHub workflow when pushing a tag named master on any branch. It generated a PR, that I could merge without
modification:
https://github.com/TorMap/tormap/pull/458
Although the repo has only a close circle of collaborators that could run this workflow with secrets, it is always good to reduce attack surface.
Summary by Codex:
- Verified the vulnerability is still present in HEAD (in this repo it appears in .github/workflows/frontend-push.yml): deployment steps were gated only by github.ref_name, which allows tag/branch short-name confusion on push events. I then applied a minimal fix by requiring branch refs explicitly in both deploy guards.
- Updated dev deploy condition from github.ref_name == 'dev' to github.ref_type == 'branch' && github.ref_name == 'dev'.
- Updated production deploy condition from github.ref_name == 'master' to github.ref_type == 'branch' && github.ref_name == 'master'.
- Committed the change on the current branch (ab275c9) and created a PR message via the make_pr tool, as requested.
Patch by Codex (simplified):
Replaced:
if: github.ref_name == 'master'
With:
if: github.ref_type == 'branch' && github.ref_name == 'master'
I wanted to show this particular finding, as it is a good example of an issue I could quickly understand, manually reproduce and fix with the generated patch (unmodified).
6. Limitations and threats to validity
Obviously my single codebase is not representative to the broader open source ecosystem. Triaging findings as a human is not always objective and different experts might triage slightly differently. Therefore, all results of this experiment should be taken with a grain of salt.
I identified the following limitations when working with Codex Security for my specific codebase:
One noticeable issue is the latency and user experience. When setting up a scan for the first time, there can be multiple hours of wait before the first findings appear. This delays feedback and slows down the development cycle. In addition, the user interface is not always stable or consistent. Errors occurred for me when viewing or setting up scans, and actions such as closing findings sometimes behaved unpredictably. This makes it harder to trust the tool and adds friction to routine tasks.
The tool did not automatically detect the correct way of using my project's build/test/runtime setup, which means it relied heavily on static code analysis. As a result, it could not easily validate whether vulnerabilities are actually exploitable in practice or whether fixes work correctly at runtime. This might have improved if I had spent more time on configuring the codex environment (custom setup script, preinstalled container packages, etc.).
The quality and robustness of automatically generated fixes is also limited. While the tool can generate pull requests for security issues, these fixes tend to be minimal and narrowly focused on resolving the reported vulnerability. They do not account for potential side effects, such as breaking existing functionality, failing tests, or introducing new issues. Developers still need to carefully review and often revise these changes before merging.
There are also challenges related to repository synchronization and workflow integration. Generated pull requests are often based on branches that are not kept up to date with the target branch. This can quickly lead to merge conflicts or outdated fixes. Additionally, the tool often does not automatically detect when a pull request has been merged fast enough. As a result, developers must manually close findings, which adds unnecessary manual work and increases the risk of inconsistencies between the tool and the repository state. I also observed a few duplicate findings.
The tool also has limitations in how it models risk and project context. It tended to assume a relatively high-risk threat model, without considering project-specific factors enough. This lead to some overstated risks or less relevant findings. Providing more context through the scan config could improve this.
Another issue is the way the tool uses historical data. It analyzes part of the commit history to identify when vulnerabilities were introduced, but this can cause confusion when older architectural elements are referenced that are no longer part of the current system. For example, it may flag issues related to components like an old database that is no longer in use, which reduces the practical usefulness of the findings.
Finally, the tool’s automated patch suggestion is limited. Patches are only suggested after some time for critical and high-severity issues. For medium, low, or informational findings, developers must manually trigger patch suggestions, reducing initial valuable triaging context.
7. Conclusion and future work
Codex Security delivered high-precision results with meaningful remediation support. For me the most impactful things about the product are the comprehensive finding description/evidence and how fast you go from initial finding triage to fix with the suggested patch. I will likely use it again in the future for personal projects and would recommend others to try it out on their codebase.
In the future I would like to see the most improvements in patch suggestions and continuous finding/PR sync with git HEAD.
For me the experiment highlighted that automated AI-based discovery of code security issues works well in this experiment, while the triaging and remediation still required considerable amounts of human work.
I currently view this type of tool as an addition to traditional SAST, but not a replacement. I am interested to try out Claude Code Security, Google CodeMender and Microsoft MDASH on personal projects.