Skip to content

Add data extension prompts, templates, and barrier/barrierGuard support#42

Open
felickz wants to merge 2 commits intoadvanced-security:mainfrom
forks-felickz:feat/data-extension-prompts
Open

Add data extension prompts, templates, and barrier/barrierGuard support#42
felickz wants to merge 2 commits intoadvanced-security:mainfrom
forks-felickz:feat/data-extension-prompts

Conversation

@felickz
Copy link
Copy Markdown
Contributor

@felickz felickz commented Apr 21, 2026

Summary

Add comprehensive CodeQL data extension (Models as Data) development guidance as Copilot prompts, issue template, and PR template.

Sample MAD's created

see usage

see usage

What's included

10 new files:

File Description
.github/prompts/data_extensions_development.prompt.md Common guidance: core principles, threat models, model formats (API Graph vs MaD), CLI references
.github/prompts/cpp_data_extension_development.prompt.md C/C++ MaD format (9-10 col tuples), pointer indirection (Argument[*n]), namespace-based identification
.github/prompts/csharp_data_extension_development.prompt.md C# MaD format, fully qualified signatures, property getter/setter naming (get_/set_)
.github/prompts/go_data_extension_development.prompt.md Go MaD format, package versioning, package grouping, Argument[receiver]
.github/prompts/java_data_extension_development.prompt.md Java/Kotlin MaD format, subtypes flag, VS Code model editor reference
.github/prompts/javascript_data_extension_development.prompt.md JS/TS API Graph format (3-5 col tuples), Fuzzy, GuardedRouteHandler, typeModel
.github/prompts/python_data_extension_development.prompt.md Python API Graph format, API graph verification queries, builtins type
.github/prompts/ruby_data_extension_development.prompt.md Ruby API Graph format, Method[] access paths, ! suffix for class references
.github/ISSUE_TEMPLATE/data-extension-create.yml GitHub issue template for requesting new data extensions
.github/PULL_REQUEST_TEMPLATE/data-extension-create.md PR template for data extension contributions

Barrier and Barrier Guard support (CodeQL 2.25.2+)

All prompts include the new barrierModel (sanitizers) and barrierGuardModel (validators) extensible predicates announced in the April 21, 2026 changelog:

  • barrierModel: Stops taint flow at the modeled element for a specified query kind (e.g., HTML-escaping prevents XSS)
  • barrierGuardModel: Stops taint flow when a conditional check returns an expected boolean value (e.g., URL validation prevents open redirects)

Each language prompt includes barrier/barrier guard examples from the official CodeQL docs:

  • C++: mysql_real_escape_string (SQL injection barrier), is_safe (barrier guard)
  • C#: HttpRequest.RawUrl (URL redirection barrier), Uri.IsAbsoluteUri (barrier guard)
  • Go: beego Htmlquote (HTML injection barrier), IsSafe (barrier guard)
  • Java: File.getName() (path injection barrier), URI.isAbsolute() (request forgery barrier guard)
  • JavaScript: encodeURIComponent (HTML injection barrier), isValid (barrier guard)
  • Python: html.escape (HTML injection barrier), Django url_has_allowed_host_and_scheme (barrier guard)
  • Ruby: Mysql2::Client#escape (SQL injection barrier), Validator.is_safe (barrier guard)

References

Add comprehensive CodeQL data extension development guidance:
- Common prompt with core principles, threat models, and CLI references
- Language-specific prompts for C++, C#, Go, Java/Kotlin, JS/TS, Python, Ruby
- Issue template and PR template for data extension workflow
- barrierModel (sanitizers) and barrierGuardModel (validators) support across all languages (CodeQL 2.25.2+)
@felickz felickz requested review from a team, data-douser and enyil as code owners April 21, 2026 17:12
@felickz felickz requested a review from coadaflorin April 21, 2026 17:29
@felickz felickz added this pull request to the merge queue Apr 21, 2026
@felickz felickz removed this pull request from the merge queue due to a manual request Apr 21, 2026
@felickz felickz mentioned this pull request Apr 21, 2026
Copy link
Copy Markdown
Collaborator

@data-douser data-douser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work — dedicated, language-specific models-as-data guidance with barrier/barrierGuard coverage aligned to CodeQL 2.25.2 is exactly what this repo needs. The YAML examples, API Graph vs MaD format documentation, and real-world samples (HTTP4k, Apache Camel, Databricks, Undertow) are excellent.

Key concern: Several places in the prompts and templates use language that implies the goal is to write a new CodeQL query (.ql file), when the primary value of models-as-data is that you only need simple YAML. This framing risks misleading LLMs — especially Copilot Cloud Agent — into scaffolding QL code when they should be creating/updating .model.yml files and/or publishing model packs.

The three primary use cases that need better coverage:

  1. Creating a new .model.yml for an unmodeled library (partially covered; needs an end-to-end procedural workflow including both the repo-level .github/codeql/extensions/ path and the model pack path)
  2. Updating an existing .model.yml — adding new sinks/sources/barriers to an already-modeled library (not covered at all)
  3. Publishing a model pack to GHCR for org-wide Default Setup (referenced but not walked through as a workflow; see org-level model packs and extending coverage for all repos in an org)

Opened #44 to track adding .github/skills/{create,publish}-model-pack/ agent skills as a follow-up to provide the procedural workflows for these use cases.

See inline comments for specifics and typo fixes.

@@ -0,0 +1,143 @@
name: Request new CodeQL Data Exension
description: Request a new CodeQL query for detecting specific code patterns
title: "[Data Extension Create]: "
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This description says "Request a new CodeQL query" — but the whole point of data extensions is that you don't write a new CodeQL query. An LLM (especially Copilot Cloud Agent) reading this will anchor on "new CodeQL query" and may attempt to scaffold a .ql file instead of a .model.yml file.

Suggest:

description: Request a new CodeQL data extension (models-as-data) for an unmodeled library or framework

@@ -0,0 +1,143 @@
name: Request new CodeQL Data Exension
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: "Exension" → "Extension"

description: Which programming language should this query target?
options:
- actions
- cpp
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actions language is listed as an option, but there's no corresponding actions_data_extension_development.prompt.md in this PR. If Actions doesn't support models-as-data, remove it from this dropdown to avoid confusing agents. If it does, it needs a prompt file.

This prompt provides common guidance for developing CodeQL data extensions across all supported languages, while language-specific prompts reference this common guidance and add language-specific details.

## Product Documentation

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prompts are heavily oriented toward creating a brand new model from scratch, but the most common real-world workflows aren't well represented. Consider adding a "## Common Workflows" section covering:

  1. Creating a new .model.yml file — end-to-end: identify library → create YAML → test with --additional-packs → validate results
  2. Updating an existing .model.yml file — adding rows to an already-modeled library (where to find existing models, how to add without breaking, re-testing)
  3. Publishing updates to an existing model pack — versioning, codeql pack publish, and configuring the pack for Default Setup across an org

These three use cases are the primary value proposition of models-as-data, and an agent needs explicit procedural guidance for each.

#### Default behavior

By default, only the **`remote`** threat model is enabled. This means only sources marked with `kind: "remote"` are active. To include local sources, you must explicitly enable additional threat models via `--threat-model` on the CLI or in the code scanning configuration.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typos: "organizaiton""organization", and in the Development section further down, "easilly""easily".


### Threat Models

Threat models control which `sourceModel` entries are active during analysis. The `kind` column of a `sourceModel` determines its threat model category.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This header says "Query Quality Criteria" but the body talks about models/extensions. Should be "Model Quality Criteria" or "Extension Quality Criteria" to avoid reinforcing query-writing framing for agents.

## Product Documentation

- [Extending coverage for a repository](https://docs.github.com/en/code-security/how-tos/scan-code-for-vulnerabilities/manage-your-configuration/editing-your-configuration-of-default-setup#extending-coverage-for-a-repository) - `.github/codeql/extensions directory` for local model pack refrences (does not need a qlpack.yml)
- [Extending coverage for all repositories in an organization](https://docs.github.com/en/code-security/how-tos/scan-code-for-vulnerabilities/manage-your-configuration/editing-your-configuration-of-default-setup#extending-coverage-for-all-repositories-in-an-organization) - publishing model packs and referencing them globally (must be done click button in UI)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: "refrences""references"

Model packs can be used to expand code scanning analysis at scale. Model packs use data extensions, which are implemented as YAML and describe how to add data for new dependencies. When a model pack is specified, the data extensions in that pack will be added to the code scanning analysis automatically.

Generally each language will allow customization of the following extensible prdicates:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: "prdicates""predicates"


For general CodeQL data extension model development guidance, see [Common Data Extension Development](./data_extensions_development.prompt.md).
For general CodeQL query development guidance, see [Common Query Development](./query_development.prompt.md).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cross-reference to query_development.prompt.md is prominently placed as the second line of every language prompt. For a data extension task, the agent should not need query development guidance — and this framing may cause an LLM to treat QL query writing as part of the expected workflow.

Consider moving this to the bottom under "Additional References" (where it already appears), or qualifying it: "If you need to write a custom CodeQL query instead of a data extension, see..." — making it clear data extensions are the primary path and QL queries are a fallback.

(Same feedback applies to all seven language-specific prompts.)

### Python Documentation

- [Customizing Library Models for Python](https://codeql.github.com/docs/codeql-language-guides/customizing-library-models-for-python/)
- Can also be found at [Customizing Library Models for Python Docs](https://github.com/github/codeql/blob/main/docs/codeql/codeql-language-guides/customizing-library-models-for-python.rst)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: "acess""access"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants