AI写的AI渗透测试反思总结

基于 Cairn AI Agent 对 HTB Reactor 靶机的全自动渗透测试（2026-05-24）聊以慰藉吧聊以慰藉吧

1. 总览：问题全景

本次测试耗时 44 分钟、执行 19 个 Intents、产生 14 条 Facts，最终成功获取双 Flag。但过程中暴露出 6 类系统性问题：

#	问题	严重度	浪费的时间
P1	SSH 爆破死循环（方向偏执）	严重	~18 分钟
P2	Explore 全量超时（执行模型缺陷）	严重	每次 +5-60s
P3	Reason JSON 解析失败	中等	~1-2 分钟/次
P4	并行 Intents 大量冗余输出	中等	浪费并行槽位
P5	Bootstrap 价值密度低	低	~4 分钟
P6	事实噪音累积	低	增加 Reason 分析成本

如果不解决 P1-P2，即使是最简单的靶机也可能耗时 30 分钟以上。下面逐项分析。

2. P1：方向偏执 — Agent 陷入 SSH 死循环

现象

Agent 在完成 Web 侦察（f003/f004/f005/f006，17:12 - 17:21）后，所有后续 Reason 周期都在围绕 SSH 爆破展开：

19:17:50 → i006: SSH brute-force with personnel names
19:23:58 → i009: SSH brute-force via Hydra again (expanded wordlists)
19:28:18 → Reason triggered → still proposed SSH brute-force intent

直到 19:28:28 注入 Hint “SSH brute-force is WRONG, focus on Next.js CVEs”，方向才被纠正。Agent 自主无法跳出这个循环。

根因

Reason 缺少 “staleness detection”（僵局检测） 。当前 Reason prompt 中有一行：

“If many intents have been explored but no new facts have been gained, reflect on whether the approach has drifted.”

但这条规则的触发阈值太高——Agent 确实产生了新的 Facts（f007/f009/f012，都是 SSH 爆破的各种进度报告），但这些 Facts 没有实质性突破，却被计入了 facts:N->N+1 的触发条件。
Fact 质量与 Fact 数量混为一谈。SSH 爆破产生的 Facts 是纯进度报告（“prepared wordlists”、“Hydra launched”、“1,856 combinations, zero success”），不是突破。这些 “stale progress facts” 在图中占比增大后，反而强化了 “SSH 是可用的攻击向量” 这一错误认知。
初始错误结论的放大效应。f004/f005/f006 都明确写出 “SSH is the only remaining foothold vector”。这个结论被固化后，后续 Reason 将其视为确认信息而非待验证假设。

改进方案

A. Reason 僵局检测升级（prompt 层面）

在 Reason prompt 中加入明确的死循环检测逻辑：

## Stalemate Detection (NEW)
- Count how many consecutive intents target the SAME service/port/vector.
  If >= 3 intents targeting the same vector produced zero credential or shell facts,
  you MUST declare that vector EXHAUSTED and pivot to a completely different approach.
- "Progress facts" (wordlist preparation, tool setup, "attempted but not completed")
  do NOT count as positive signal. Only credential disclosure, shell access, or
  flag capture qualify as positive signal.
- If you find yourself proposing an intent that is substantially similar to a
  previously EXHAUSTED intent, you MUST explain why circumstances changed.
  If circumstances have NOT changed, do NOT propose it.

B. Server 层面的意图去重（代码层面）

在 scheduler/loop.py 或 tasks/reason.py 中增加去重检查：

# 伪代码
def is_duplicate_intent(new_intent: str, history: list[Intent]) -> bool:
    """检测新 Intent 是否与已完成的 Intent 实质重复"""
    for intent in history:
        if intent.to is not None:  # 已完成的
            similarity = embedding_similarity(new_intent, intent.description)
            if similarity > 0.85:
                return True
    return False

对 Easy 难度的靶机，去重可能是最有效的加速手段。

C. Fact 打标签（架构层面）

为 Facts 增加语义标签，区分 “progress” vs “breakthrough”：

facts:
  - id: f007
    type: progress        # SSH brute-force 进度更新
  - id: f014
    type: breakthrough    # CVE-2025-66478 RCE shell

Reason 在做决策时，可以只读 breakthrough facts 的数量变化，而不被 progress facts 的数量增长所欺骗。

3. P2：Explore 全量超时 — 执行模型的浪费

数据

19:10:43  Bootstrap timed out          (240s)
19:16:26  i003/i004/i005 timed out     (240s × 3)
19:21:51  i006/i007 timed out          (240s × 2)
19:28:01  i009 timed out               (240s × 1)
19:33:46  i010/i012 timed out          (240s × 2)
19:36:53  i013/i014/i015 timed out     (240s × 3)
19:42:34  i016/i017/i018 timed out     (240s × 3)

统计：19 个 Explore 任务中，只有 2 个在超时前完成（i002: 75s, i008: 201s）。其余 17 个（89%）全部超时，触发 conclude fallback。

根因

查看 stdout_preview= 全部为空。这说明 LLM（DeepSeek V4-Pro via Anthropic API）在 240s 内：

接收上下文（prompt + graph snapshot）—— ~10-30s
进行工具调用（nmap/curl/hydra 等）—— 剩余时间
工具调用可能产生了输出，但由于模型正在 streaming 工具调用，没有写回 stdout
240s 到达时，dispatcher 直接杀掉进程

Conclude fallback 反而产出了所有有效输出——f003-f019 几乎都是在 conclude phase（60s）中生成的。这意味着主执行阶段主要在做工具调用，而结构化总结是在 conclude 中完成的。

改进方案

A. 缩短 execute 超时 + 延长 conclude 超时（配置层面）

当前配置：

explore:
  timeout: 240
  conclude_timeout: 60

建议调整为：

explore:
  timeout: 120      # 工具调用阶段：2 分钟足够完成大部分操作
  conclude_timeout: 120  # 增加总结容量：让 LLM 有更长时间分析输出

理由：

120s 工具调用时间减少浪费（如果 2 分钟没出结果，大概率方向不对）
120s conclude 时间给 LLM 足够的时间分析工具输出、识别关键发现
总时间不变（240s），但更合理分配

B. 动态超时（架构层面）

# 根据任务类型和当前状态动态调整超时
def dynamic_timeout(intent_type: str, facts: list[Fact]) -> int:
    if intent_is_complex(intent_type):  # 如 "enumerate all ports"
        return 180
    if has_credential(facts):  # 已有凭据，下一步可能很快
        return 90
    return 120

C. Streaming 输出捕获（代码层面）

当前实现只捕获最终的 stdout。如果能每次工具调用后立即写回部分结果（progress report），dispatcher 可以在 timeout 前就得到有用的内容：

# tasks/common.py 中增加 streaming 支持
# Worker 每完成一个工具调用，就向 server 写入 partial fact
# Dispatcher 在 timeout 时可以直接使用最后一个 partial fact

4. P3：Reason JSON 解析失败

现象

19:11:50  WARNING  reason parse failed
          error=no JSON object found in output
          stdout_preview=The JSON response has been output above...

LLM 在 JSON 之前输出了解释性文本（“The JSON response has been output above…”），导致 output_parser.py 的 JSON 提取失败。触发了一次完整的重试（53s 浪费）。

根因

当前 parser 逻辑可能使用了简单的 regex 提取（如匹配第一个 { 到最后一个 }），但 LLM 在 JSON block 外输出了 preamble text。这在新版 DeepSeek V4-Pro 中尤其常见——模型倾向于输出 “thinking” 文本再给 JSON。

改进方案

A. Parser 增强（代码层面）

import re
import json

def extract_json(text: str) -> dict | None:
    """多层回退 JSON 提取"""
    # Level 1: 直接解析
    try:
        return json.loads(text)
    except json.JSONDecodeError:
        pass

    # Level 2: 提取 ```json ... ``` block
    m = re.search(r'```json\s*(.*?)\s*```', text, re.DOTALL)
    if m:
        try:
            return json.loads(m.group(1))
        except json.JSONDecodeError:
            pass

    # Level 3: 提取最外层 { ... }
    m = re.search(r'\{.*\}', text, re.DOTALL)
    if m:
        try:
            return json.loads(m.group(0))
        except json.JSONDecodeError:
            pass

    # Level 4: 尝试修复常见错误（如尾部逗号、单引号）
    # ...

    return None

B. Prompt 强化

在 prompt 末尾加更严格的输出约束：

# CRITICAL OUTPUT RULE
Your ENTIRE response must be ONLY the JSON object. No preamble, no markdown code fences,
no "Here is my analysis" or "The JSON has been output above".
The FIRST character of your response must be '{', the LAST character must be '}'.
If you output ANY text outside the JSON, the task will FAIL and must be retried.

5. P4：并行 Intents 冗余 — Web 分析四重奏

现象

第一轮 Reason（19:12:23）创建了 4 个 Intent，其中 3 个都指向 Web 分析：

Intent	任务	结论
i003	Web 目录爆破	无隐藏端点
i004	手动 Web 探索	静态 decoy
i005	漏洞搜索	无公开 PoC

这三个 Intent 产出的 Facts（f003/f004/f005）核心结论完全一致：“Web 应用是静态 decoy，无利用价值”。

根因

max_intents: 4 的配置迫使 Reason 生成 4 个探索方向，但当攻击面只有 2 个服务（SSH + Web）时，Agent 只能把一个 Web 分析拆成 3 个略有不同的角度（目录爆破、手动探索、漏洞搜索）。

改进方案

A. 降低 max_intents（配置层面）

# 对 Easy 难度靶机
reason:
  max_intents: 2  # 从 4 降到 2，减少冗余

B. Reason 阶段去重（prompt 层面）

在 Reason prompt 中加入覆盖度检测：

- Before proposing a new intent, check:
  1. Does this direction overlap >50% with an existing Open or recently completed Intent?
  2. If so, incorporate only the non-overlapping portion, or skip.
- Each intent MUST have a clearly distinguishable outcome from all other active intents.

C. 智能合并（架构层面）

当检测到 3 个 Facts 都是从不同角度描述同一结论时，自动合并为一个复合 Fact，减少噪音。

6. P5：Bootstrap 效率低

现象

Bootstrap 阶段 240s 产出了一个 Fact（f001）——端口扫描结果。这个信息量对应的合理执行时间大约 30-60s。

改进方案

bootstrap:
  timeout: 120     # 从 240 降到 120
  conclude_timeout: 60

对于 Easy 靶机，Bootstrap 不需要深度分析——只需要基本端口扫描 + 服务识别。深度分析留给 Explore 阶段。

7. P6：事实噪音累积

现象

14 条 Facts 中，只有 6 条包含实质性突破信息：

实质性	噪音
f001 (端口扫描)	f002 (DNS 无果)
f014 (React2Shell RCE)	f003-f006 (Web decoy × 4)
f016 (凭据 + user flag)	f007/f009/f012 (SSH 进度 × 3)
f019 (Inspector RCE + root flag)	f008 (UDP 无果)
	f013/f015 (CVE 无果 × 2)

噪音率 = 8/14 = 57%。噪音 Facts 不仅占用存储，更重要的是混淆了 Reason 的判断——每次 facts:N->N+1 都触发一次 Reason 重评估，即使这个 Fact 只是 “SSH 爆破 1856 次零成功”。

改进方案

A. 事实合并策略

相同结论来源的 Facts 应该被合并：

f003 + f004 + f005 + f006 → 单条 Fact “Web 应用分析：静态 decoy，无利用价值（经 4 方向验证）”
f007 + f009 + f012 → 单条 Fact “SSH 爆破：1856 尝试，零成功，建议放弃”

B. 噪音过滤

Reason 在评估时跳过标记为 type: progress 或 type: exhausted 的 Facts（见 P1-C 的 Fact 标签方案）。

8. 综合优化配置

基于以上分析，针对 Easy 难度靶机的推荐配置：

server: "http://0.0.0.0:8000"

common_env:
  TSEC_SERVER_HOST: "0.0.0.0:8000"
  TSEC_AGENT_TOKEN: "xxx"

runtime:
  interval: 2
  max_workers: 8        # Easy 靶机不需要 16
  max_running_projects: 4
  max_project_workers: 4

tasks:
  bootstrap:
    timeout: 120          # 从 240 → 120
    conclude_timeout: 60
  reason:
    timeout: 90           # 从 120 → 90
    max_intents: 2        # 从 4 → 2（减少冗余）
  explore:
    timeout: 120          # 从 240 → 120（工具调用）
    conclude_timeout: 120 # 从 60 → 120（分析输出）

container:
  image: "cairn-worker-htb:latest"
  network_mode: "host"
  completed_action: "stop"

workers:
  - name: "claudecode_deepseek-v4-pro"
    type: "claudecode"
    task_types: [bootstrap, reason, explore]
    max_running: 4
    priority: 0
    env:
      ANTHROPIC_MODEL: "deepseek-v4-pro"
      ANTHROPIC_BASE_URL: "https://api.deepseek.com/anthropic"
      ANTHROPIC_AUTH_TOKEN: "sk-xxx"

预期效果估算

指标	当前	优化后（预计）
Bootstrap	240s	120-180s
探索轮数	6 轮	3-4 轮（去重）
每轮并行数	3-4	2（减少冗余）
Timeout 浪费	15 × 30s avg	5 × 15s avg
死循环检测	无（需人工 Hint）	3 次同向量自动放弃
预计总耗时	44 分钟	15-20 分钟

9. 实施优先级

优先级	改动	改动量	收益
P0	Parser 增强（JSON 多层提取）	30 行 Python	消除解析重试
P0	Reason prompt 僵局检测	10 行 prompt	消除 SSH 死循环
P1	explore timeout 120s + conclude 120s	配置修改	减少超时浪费
P1	max_intents 从 4 → 2	配置修改	减少冗余输出
P2	bootstrap timeout 120s	配置修改	加速启动
P2	Fact 标签机制	架构改动	长期收益
P3	Streaming 输出捕获	架构改动	减少超时损失
P3	智能 Fact 合并	架构改动	减少噪音

10. 对 Cairn 架构的启发

这次测试暴露的核心矛盾是：Agent 的 “显式理性”（基于 Facts 的推理）和 “隐式局限”（模型无法跳出自己构建的叙事框架）之间的张力。

具体来说：

当 Agent 自己写出了 “SSH is the only foothold vector” 这条 Fact 后，它在后续 Reason 中把这个推论当作已确认的前提，而不是一个待验证的假设
Hint 机制的完美工作证明了：有时只需要一个外部的小扰动，就能打破 Agent 的自我强化循环
但在没有外部 Hint 的情况下，Agent 缺少内生的 “自我怀疑” 机制

可能的架构改进方向：

Devil’s Advocate Agent：一个独立的 “质疑 Agent”，专门挑战当前的共识结论
回溯机制：当检测到 3 次零输出，自动回溯把之前的结论标记为 assumption 而非 fact
Fact 置信度：每一条 Fact 带有置信度分数，推论链上的 Fact 需要定期重新验证

11. 实施状态（2026-05-24）

优先级	改动	文件	状态
P0	JSON 多层提取 parser	`output_parser.py`	✅ 已实施
P0	Reason prompt 僵局检测 + 去重	`prompts/htb/reason.md`	✅ 已实施
P1	explore timeout 120s + conclude 120s	`dispatch_htb.yaml`	✅ 已实施
P1	max_intents 从 4 → 2	`dispatch_htb.yaml`	✅ 已实施
P1	输出约束强化（所有 prompt）	`reason.md`, `explore.md`, `bootstrap.md`	✅ 已实施
P2	bootstrap timeout 120s	`dispatch_htb.yaml`	✅ 已实施
P1	服务器端 Intent 去重	`tasks/dedup.py`, `tasks/reason.py`	✅ 已实施
P1	Breakthrough Fact 检测 + 智能 Reason 触发	`tasks/fact_quality.py`, `scheduler/loop.py`	✅ 已实施
P1	Graph YAML Fact 质量标注	`tasks/reason.py`	✅ 已实施
P2	Conclude prompt 传递原始输出	`tasks/common.py`, `tasks/explore.py`, `tasks/bootstrap.py`	✅ 已实施
P3	Streaming 输出捕获	架构改动	⏳ 待实施

改动详情

1. output_parser.py — 增加 Layer 3 balanced-brace 回退

原问题：LLM 在 JSON 前输出 preamble 文本导致解析失败
改动：新增 PRE_JSON_BOUNDARY_RE regex 和 _balanced_brace_span() 函数，从任意前导文本中提取首个平衡的 {...} 块
效果：消除 “no JSON object found” 解析重试，节省 ~53s/次

2. prompts/htb/reason.md — 僵局检测 + 去重 + 输出约束

原问题：SSH 爆破死循环（18 分钟，1856 次尝试零成功）
改动 A：新增 Stalemate Detection 规则 — 同一向量连续 3 次零突破 → EXHAUSTED，必须切换方向
改动 B：增强 Deduplication Check — 对已完成且结论为负面的 Intent 禁止重复提出，Web 场景引导至 CVE 研究
改动 C：输出约束 — FIRST char {，LAST char }，禁止 preamble/trailing explanations

3. prompts/htb/explore.md & bootstrap.md — 输出约束对齐

改动：统一使用与 reason.md 相同的严格输出约束（FIRST { / LAST }）
效果：三个 prompt 输出规范一致，消除所有任务类型的 JSON 解析失败风险

4. dispatch_htb.yaml — 超时 + 并行优化

max_workers: 16 → 8、max_running_projects: 8 → 4、max_project_workers: 8 → 4
bootstrap timeout: 240 → 120
reason timeout: 120 → 90、max_intents: 4 → 2
explore timeout: 240 → 120、conclude_timeout: 60 → 120
效果：减少超时浪费（原 89% 超时率），减少冗余并行 Intent

5. tasks/dedup.py + tasks/reason.py — Intent 服务器端去重

原问题：多个 Intent 重复探索同一方向（如 SSH 爆破变体、Web 分析四重奏）
改动：新增 is_duplicate_intent() — 基于 token 签名相似度（阈值 0.4），在 Reason task 创建 Intent 前过滤重复 Intent
相似度计算：分词 + port_XXX 特殊 token，Jaccard 相似度
效果：避免重复探索浪费并行槽位，Reason 创建 Intent 为空时区分”全被去重”vs”真的失败”

6. tasks/fact_quality.py + scheduler/loop.py — Breakthrough Fact 检测 + 智能 Reason 触发

原问题：进度 Fact 累积（wordlist prepared, Hydra launched）触发无意义的 Reason 重评估，每次 Reason 产生的新 Facts 都是噪音
改动 A：新增 classify_fact_quality() — breakthrough / negative / progress 三级分类，breakthrough 含凭据/Shell/Flag/CVE，negative 含 “no found”/decoy/blocked
改动 B：_reason_trigger() 增强 — 只包含 breakthrough Facts 时立即触发Reason；连续 3 个以上非 breakthrough Facts 才触发（积累检测）；连续 breakthrough → 重置噪音计数器
效果：减少被噪音 Fact 触发的无效 Reason 循环

7. tasks/reason.py + prompts/htb/reason.md — Graph YAML 质量标注

改动 A：_annotate_graph_yaml() — 对 Fact 添加 _quality: breakthrough | negative 字段
改动 B：Reason prompt 新增 Fact quality annotations 说明 — 引导 Agent 优先使用 breakthrough facts，negative facts 作为转向信号
效果：让 Reason Agent 在分析图谱时能够区分关键发现和噪音

8. tasks/common.py + tasks/explore.py + tasks/bootstrap.py — Conclude 传递原始输出

原问题：Explore 超时后，conclude phase 看不到之前 tool calls 的原始 stdout，只能基于空 context 总结
改动：超时/解析失败时，conclude fallback 接收 raw_stdout，通过 format_raw_output_context() 将最后 3000 字符注入 conclude prompt
新增 placeholder：{raw_output_context} — 注入 conclude prompt 的 explore_conclude.md / bootstrap_conclude.md
效果：conclude phase 有原始输出可参考，不因超时丢失所有发现

预期效果

指标	优化前	优化后（预计）
user.txt	~20 分钟	< 5 分钟
root.txt	~44 分钟	< 10 分钟
Reason 解析失败率	~25%	< 2%
SSH 死循环	需人工 Hint	3 次后自动放弃
每轮冗余 Intent	3-4 个	≤ 2 个
服务器端 Intent 重复	无检测	Token 相似度过滤
无效 Reason 触发	每个 Fact 变化触发	Breakthrough 优先，噪音积累检测
Conclude 超时损失	0 输出 → 空总结	传递原始 stdout 3000 字符

分析基于 Cairn dispatcher log + project proj_005 Fact Graph 完整数据。

AI写的AI渗透测试反思总结

1. 总览：问题全景

2. P1：方向偏执 — Agent 陷入 SSH 死循环

现象

根因

改进方案

3. P2：Explore 全量超时 — 执行模型的浪费

数据

根因

改进方案

4. P3：Reason JSON 解析失败

现象

根因

改进方案

5. P4：并行 Intents 冗余 — Web 分析四重奏

现象

根因

改进方案

6. P5：Bootstrap 效率低

现象

改进方案

7. P6：事实噪音累积

现象

改进方案

8. 综合优化配置

预期效果估算

9. 实施优先级

10. 对 Cairn 架构的启发

11. 实施状态（2026-05-24）

改动详情

预期效果

HackTheBox Machine SmartHire

HackTheBox Machine Reactor