Skip to content

feat: add LLM semantic security review for SKILL.md scanner (QEMETRICS-2415)#43

Open
Benkapner wants to merge 1 commit into
mainfrom
QEMETRICS-2415/security-llm-review
Open

feat: add LLM semantic security review for SKILL.md scanner (QEMETRICS-2415)#43
Benkapner wants to merge 1 commit into
mainfrom
QEMETRICS-2415/security-llm-review

Conversation

@Benkapner

Copy link
Copy Markdown
Collaborator

Summary

Adds LLM-based semantic security review to the SKILL.md scanner, complementing the deterministic regex checks merged in PR #42. Adapted from harness-eval-lab's security review rubric.

What the LLM checks

Three checks that catch attacks regex cannot detect:

Check What it catches
Anti-jailbreak Self-declared safety claims, evaluation bypass attempts
Semantic attacks Polite reframings of jailbreaks, gradual escalation, split-instruction attacks, conditional triggers
Description-behavior mismatch Stated purpose vs actual instruction behavior

How it works

  • Deterministic scan runs first (always)
  • LLM review runs after, appending findings to the same JSON with source: \"llm\"
  • Uses llm_client.chat_completion() (same as quality review)
  • On by default, matching Cisco's use-llm=true convention
  • Disable with --no-llm flag
  • If the LLM call fails, deterministic results are preserved

Changes

File What
abevalflow/security/skillmd_scanner.py Added SECURITY_REVIEW_PROMPT, _extract_json(), llm_security_review()
abevalflow/gates/security/base.py Added .strip() to severity parsing to prevent block bypass
scripts/skillmd_security_scan.py Added --no-llm flag, LLM review runs by default
pipeline/tasks/phases/test.yaml Updated step 3 with LLM env vars and use-llm flag
tests/test_skillmd_scanner.py Added 4 mock tests for LLM review path

Changes since previous review (PR #39)

This replaces PR #39 (moved from fork to org branch). All review feedback from Guy has been addressed:

Must Fix:

  1. Rebased on latest main (includes merged PR feat: add SKILL.md security scanner gate (QEMETRICS-2414) #42 and lint cleanup fix: resolve ruff linting issues across codebase #41)
  2. Added _extract_json() to strip markdown fences from LLM JSON response before parsing
  3. Added .strip() to severity parsing in base.py to prevent block bypass from trailing whitespace
  4. Added 4 mock tests: valid findings, fenced JSON, LLM failure, invalid JSON

Should Fix:
5. Added content size limit (MAX_TOTAL_CHARS = 40_000) with truncation warning for large submissions
6. Expanded pattern allowlists: added Italian/Arabic/Hindi/Russian/Dutch/Swedish/Turkish to translate evasion; added shields.io and raw.githubusercontent to markdown image

Test plan

  • 76 tests pass (46 deterministic + 4 LLM mock + 26 existing gate registry)
  • ruff lint + format clean
  • Run pipeline test script (./scripts/misc/trigger_test_runs.sh)

Ref: QEMETRICS-2415

Replaces PR #39 (moved from fork to org branch per review feedback).">

…S-2415)

Add 3 LLM-based security checks adapted from harness-eval-lab rubric:
- Anti-jailbreak: self-declared safety claims, evaluation bypass attempts
- Semantic attacks: polite reframings, gradual escalation, split-instruction
- Description-behavior mismatch: stated purpose vs actual instruction

Addresses review feedback:
- Strip markdown fences from LLM JSON response
- Normalize severity values with .strip() to prevent block bypass
- Add content size limit (40k chars) for large submissions
- Expand pattern allowlists (Italian/Arabic/Hindi, shields.io, raw.githubusercontent)
- Add 4 mock tests for LLM review path (valid, fenced, failure, invalid JSON)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants