03 - 数据流详解

AI 真在做什么? 怎么从 “运维点告警” 走到 “出诊断报告”?

总览数据流

运维操作                  Backend 真做                       数据真去向
─────────                ──────────                       ─────────
[选告警 / 自由文本] ──→ alert envelope 或 clue parser ──→ context (resolver_ip + window)
                              │
                              ↓
                         [_context_window 验时间窗合法]
                              │
                              ↓
                    ┌─────────┴──────────┐
                    │ deterministic seed │ ← Phase 68.C: incident_window fallback
                    │ plan (round 1)     │
                    └────────┬───────────┘
                             │
                             ↓
                    [tool_plan_validator 校 args]
                             │
                             ↓
              ┌──────────────┴─────────────┐
              │ 跑 ES 聚合 (8 DNS / 6 DHCP) │ → ES :9200 真返聚合数字
              │ tool_orchestrator           │
              └──────────────┬─────────────┘
                             │
                             ↓
                    [tool_trace 记录]
                             │
                             ↓
              ┌──────────────┴─────────────┐
              │ multiround: round 2 / 3    │
              │ model planner (LLM) 决定    │ ← LLM 真调
              │ 下一步跑哪几个工具          │
              └──────────────┬─────────────┘
                             │
                             ↓
                    [agent_stop_policy 早停判定]
                             │
                             ↓
              ┌──────────────┴─────────────┐
              │ build_multiround_report     │ ← Phase 68.D 真凶: 这里生成 hypothesis
              │ (生成 root_cause_hypotheses)│
              └──────────────┬─────────────┘
                             │
                             ↓
              ┌──────────────┴─────────────┐
              │ reasoner_factory 装回      │ ← Phase 68.D 加 post-build retry hook
              │ ai_reasoning 字段          │
              └──────────────┬─────────────┘
                             │
                             ↓
              ┌──────────────┴─────────────┐
              │ asset_registry lookup      │ ← Phase 68.B/F: asset_context.matched
              │ → business_impact           │   12 业务系统真识别
              └──────────────┬─────────────┘
                             │
                             ↓
              ┌──────────────┴─────────────┐
              │ recommended_actions builder │
              │ (3 tier: emergency/        │
              │  mitigation/long_term)      │
              └──────────────┬─────────────┘
                             │
                             ↓
              ┌──────────────┴─────────────┐
              │ report_status_invariants    │ ← 15+ 守门, 出口前最后验
              │ + sanitizers + safety_flags │   (phase68.E sanitize CJK regex)
              └──────────────┬─────────────┘
                             │
                             ↓ JSON
                    [Frontend Vue 真渲染]
                             │
                             ↓
                    [运维看 3 段输出]

4 endpoint × 3 段输出真值

Endpoint A: DNS Resolver-Chain Triage (单端)

真请求:

POST http://127.0.0.1:5000/api/v1/log-assistant/dns/resolver-chain-triage
Authorization: Bearer <token>
Content-Type: application/json

{
  "resolver_ip": "172.16.1.42",
  "incident_window": {"start": "2026-06-04T15:30:46+08:00", "end": "2026-06-04T15:45:46+08:00"},
  "baseline_window": "auto",
  "sample_limit": 10
}

真后端真做:

验 resolver_ip 在 asset_registry (Phase 68.B 修后真返 owner=“网络中心”)
取 baseline window (auto = 同时段往前 7 天)
跑 8 个 DNS 聚合 query 真比对 incident vs baseline
调 LLM reasoner 生成 hypothesis (单 round, 不走 multiround)
asset_registry → business_impact 12 业务系统真识别
recommended_actions 真分 tier

真响应 3 段:

{
  "report": {
    "status": "ok",
    "headline": "172.16.1.42 上游响应缺口 2.76%, 较 baseline 微幅上升",
    "ai_reasoning": {
      "available": false,
      "fallback_reason": "deterministic_meta_reasoning_only",
      "reasoning_summary": "..."
    },
    "business_impact": {
      "asset_context": {
        "matched": true,
        "resolver_ip": "172.16.1.42",
        "resolver_label": "校内主递归DNS",
        "zone": "教学区",
        "owner": "网络中心",
        "confidence": "high"
      },
      "affected_business_systems": [
        {"system": "统一认证系统", "domain": "...", "confidence": "high"},
        ... (12 条)
      ],
      "affected_zones": [{"cidr": "172.16.1.0/24", "zone": "教学区核心"}]
    },
    "recommended_actions": [],
    "raw_log_scan": false,
    "preview_only": true,
    "execution_enabled": false
  }
}

真返时间: ~3-5s (单 round + 单 reasoner LLM call).

Endpoint B: DNS Multi-Round Triage

真请求:

POST http://127.0.0.1:5000/api/v1/log-assistant/ai/multiround-triage
{
  "question": "DNS 失败聚合, 给我多轮排障",
  "protocol": "dns",
  "context": {
    "client_ip": "192.0.2.10",
    "resolver_ip": "172.16.1.42",
    "incident_window": {"start": "...", "end": "..."}
  },
  "max_rounds": 3,
  "max_tools": 8,
  "model_planning_enabled": true
}

真后端真做:

Round 1: deterministic_seed plan (query_baseline_window + query_resolver_chain_gap)
tool_plan_validator 校 args (Phase 68.C 修后接受 context.incident_window)
ES 真跑 2 tool
Round 2: LLM model planner 决定跑哪几个 (基于 round 1 evidence)
Round 3: LLM 决定 / 早停 (sufficient_evidence)
build_multiround_report 生成 hypothesis
Phase 68.D 真新增: post-build reasoner retry (final hypothesis 在场再调 reasoner 一次)
ai_reasoning 真装回 + business_impact + recommended_actions

真响应:

{
  "report": {
    "status": "completed",
    "round_count": 3,
    "stop_reason": "",
    "findings": [...6 条...],
    "root_cause_hypotheses": [
      {
        "category": "dns_failure_semantic_concentration",
        "label": "...",
        "confidence": "medium",
        "supporting_evidence": [...],
        "weakening_evidence": [...]
      }
    ],
    "ai_reasoning": {
      "available": true,
      "hypotheses": [{"hypothesis_id": "dns_failure_semantic_concentration", "confidence_delta": "up_one"}],
      "model_name": "dns_resolver_v1",
      "model_provider": "campus_ai_ops",
      "reasoning_summary": "聚合证据显示 DNS 总失败 860989 次 (15.62%)..."
    },
    ...
  }
}

真返时间: ~30-90s (3 round × ES + 2 LLM call).

Endpoint C: DHCP AI Tool Triage (单端, 5 工具)

真后端真做:

context 抽 client_ip / client_mac / server_ip / relay_ip 真值
跑 5 个 DHCP 聚合 query (concurrent):
- query_dhcp_message_summary
- query_dhcp_top_failed_clients
- query_dhcp_top_relays
- query_dhcp_top_servers
- query_dhcp_failure_drilldown
LLM reasoner (single-round, 直接基于 5 tool result)
asset_registry lookup (Phase 68.F: _dhcp_asset_context 真接通 ai_reasoning.asset_context.matched=true)
business_impact (注: business_impact.asset_context.matched=false 持续, phase69.G 候选)
recommended_actions 真分 tier

真返时间: ~30s.

Endpoint D: DHCP Multi-Round

跟 B 同款流程但 protocol=dhcp. 真返时间 ~30s.

按线索路径 (Clue): `POST /api/v1/log-assistant/clue/diagnose`

真请求:

POST /api/v1/log-assistant/clue/diagnose
{
  "text": "5/28 全天 huawei-wlan-controller 100% 失败, 是什么原因?",
  "session_id": "..."
}

真后端真做 (跟 4 endpoint 完全不同流程):

clue_parser_v2 LLM 抽 (Phase 69.A 关键词扩 + Phase 69.A.1 fixture 入库):
- 时间窗 2026-05-28T00:00:00+08 → 2026-05-29T00:00:00+08 (24h)
- 真凶域名 huawei-wlan-controller
- protocol auto detect → dns
live_query_guard 验 window 不超 LIVE_MAX_WINDOW (Phase 69.C 升 6h→24h, Phase 69.D 升 168h hard cap)
clue_qa_planner 跑 15 步 deterministic plan (跑 14 个 ES 聚合 query)
聚合结果走 dns_diagnostics / dhcp_diagnostics 主路径 → ai_reasoning + business_impact + recommended_actions
Phase 69.B 真新增: 顶层 data_freshness_warning (warning 字段在 silent fallback open_dataset 路径下触发, frontend Phase 69.E 真渲染红 banner)

真返时间: ~30-120s (LLM 抽参 + 14 tool + LLM reasoner).

安全旗位 (3 旗物理禁)

每个 API response 顶层 + 每段 sub-report 都含:

{
  "raw_log_scan": false,
  "preview_only": true,
  "execution_enabled": false
}

旗	物理含义	谁守
`raw_log_scan=false`	后端不真读 ES raw 日志 doc, 只跑 aggregate query (count / sum / cardinality). Tool registry 14 个 query 全是 aggs 类型, 不取 source 字段.	`dns_tool_registry.py` / `dhcp_tool_registry.py` 真 ES DSL 验证
`preview_only=true`	任何 recommended_actions 真字段都标记为预览, frontend 真渲染时显示"建议动作"而非"立即执行".	`report_status_invariants.py` 验输出真含 `preview_only=true`
`execution_enabled=false`	后端无任何修改网络配置代码 (无 SSH client / 无 device config / 无 SNMP write). 物理上没接口可调.	代码层面: 全 repo `grep -rE “ssh

任何 phase 改 backend 不允许动这 3 旗 (feedback_bitter_lessons_r10_r37.md 红线 2).

LLM 调用链 (真值)

主路径: 本地 LLM 优先

# .env.local
OPENCLAW_REASONING_PRIMARY=local-onprem
OPENCLAW_LOCAL_URL=http://10.10.x.x:8000/v1/chat/completions
OPENCLAW_LOCAL_MODEL=Qwen2.5-72B
OPENCLAW_ALLOW_CLOUD_FALLBACK=true

llm_client.call_reasoning_model_with_fallback:

先调 local-onprem URL (timeout 8s 默认, 可改 OPENCLAW_REASONER_TIMEOUT_SECONDS)
真失败 (ConnectionError / Timeout / 5xx) → fallback DeepSeek
DeepSeek 真失败 → 返 {"available": false, "fallback_reason": "model_unavailable"}
Backend 收 fallback=false → ai_reasoning 段标 degraded, reasoning_summary 用 deterministic 文字

Fallback: DeepSeek 云端

OPENCLAW_DEEPSEEK_URL=https://api.deepseek.com
OPENCLAW_DEEPSEEK_API_KEY=sk-...
OPENCLAW_DEEPSEEK_MODEL=deepseek-v4-flash   # 不要用 deepseek-chat (2026/07/24 弃用)

LLM 真调几次?

Endpoint	LLM 调用次数	平均耗时
A (DNS resolver-chain)	0-1 (deterministic_meta 路径不调)	3-5s 总
B (DNS multi-round)	1-4 (1 reasoner + 1-3 model planner)	30-90s 总
C (DHCP single)	1 (reasoner)	30s 总
D (DHCP multi)	1-4 (同 B)	30s 总
Clue (按线索)	2-3 (parser + reasoner + 偶 planner)	30-120s 总

每次 LLM 调用都有 schema validation. 真返不合规直接降级走 deterministic, 不影响其余字段输出.

字段对照表 (重要 — 排障/扩展用)

`ai_reasoning` 字段 (顶层)

字段	类型	含义	何时出现
`available`	bool	LLM 真生成了 hypothesis 吗	总在
`model_name`	str	真用了哪个 model (`dns_resolver_v1` / `dhcp_resolver_v1`)	available=true 时
`model_provider`	str	`campus_ai_ops` (本地) / `deepseek_cloud` (fallback)	available=true 时
`reasoning_summary`	str	LLM 中文自然语言诊断 (300-500 字)	总在, 但可能是 deterministic fallback
`hypotheses`	list	LLM 给的根因排序 (hypothesis_id + confidence_delta + reason)	available=true 时
`fallback_reason`	str	`no_hypotheses_to_validate` / `model_unavailable` / `deterministic_meta_reasoning_only` 等	available=false 时
`asset_context`	dict	Phase 68.B/F 真接通的资产识别 (matched/resolver_ip/resolver_label/zone/owner/subnet_zone)	DNS B + DHCP C/D 真接通
`global_caveats`	list	跨字段警告 (e.g. “本结论由 deterministic meta-reasoning 生成”)	不定

`business_impact` 字段

字段	类型	含义
`asset_context.matched`	bool	resolver_ip 在 asset_registry 命中吗 (Phase 68.B 真接通)
`affected_business_systems`	list	12 个真业务系统 (统一认证/校园门户/OA/教务/…)
`affected_zones`	list	影响区域 (cidr + zone label)
`headline_summary`	str	一句话业务影响概述
`time_window_context`	dict	`is_business_hours` + `reason` (上课时段还是非高峰)

`recommended_actions` 字段

字段	类型	含义
`tier`	str	`emergency` / `mitigation` / `long_term`
`title`	str	1 句话动作标题
`body`	str	详细执行步骤
`recommended_actions_status`	str (顶层兄弟字段)	`actionable` / `awaiting_evidence` / `no_action_needed`

Phase 69.B 真新增: `data_freshness_warning` (顶层)

字段	类型	含义
`severity`	str	`critical` (silent fallback 触发时)
`headline`	str	“⚠ 当前回答基于历史样本数据…”
`detail`	str	详细说明 ES 不可用真因
`affected_tools`	list	哪几个工具走了 open_dataset fallback
`fallback_reasons`	list	真因清单

frontend Phase 69.E 真渲染红 banner. 详见 05-operations-manual.md §AI 异常时怎么办 + 06-troubleshooting.md.

Degraded 真因表 (运营人员排障必读)

`degraded_reason` 字面	真含义	怎么排
`es_query_failed`	ES 真没返 / timeout / 5xx	看 ES 健康: `curl http://10.10.1.147:9200/_cluster/health`
`deterministic_seed_rejected`	tool_plan_validator 真拒 round 1 plan (字段缺)	看 rejected_tools[*].reason, 通常是 args 缺
`tool_validator_rejected`	同上但 round 2+ LLM 给的 plan 真拒	看 LLM 真给了啥, 是否引用了非白名单工具
`no_root_cause_ranked`	跑了 tool 但 hypothesis builder 真没给出排名	看 findings 真值, 可能数据不足
`tool_budget_exhausted`	跑了 8 tool 没结论	真复杂 case, 可以加 max_tools 但通常说明问题超系统能力
`data_source_unavailable`	ES 完全不可用 (连 health check 都失败)	ES 真挂了 / VPN 断 / 网络隔离
`model_unavailable`	LLM 主+fallback 都挂	看 backend log grep `OPENCLAW_*`
`no_hypotheses_to_validate`	(合法 OK 态) hypothesis builder 真返空 (说明数据正常无异常)	不算 error
`disabled_by_config`	reasoner 被 env 禁用	看 `OPENCLAW_REASONING_PRIMARY` 是否真设
`deterministic_meta_reasoning_only`	(合法 OK 态) A endpoint 设计不调 LLM, 用 deterministic 元推理	不算 error

数据隔离 (合规)

真数据类别	真去向	谁能看
ES raw 日志 (真生产域名/MAC/IP)	留 ES, 后端不直读	ES 管理员; 本系统不暴露
ES aggregate 真返 (含真生产域名出现在某些 metric label 里)	进 backend log + frontend response	admin role 运维
`.env.local` 机密 (DeepSeek key / 密码 hash / token secret)	留单机文件, 不入 git	部署管理员
真后端 fixture (R(N) capture)	经 sanitize_fixture.py 脱敏后入 git	全员
Backend logs (含真生产域名出现在 ES 真返样本里)	本地 `logs/backend-verify.log`, 不入 git	admin role 运维, 季度 grep 审计
反馈表 (运维填写)	SQLite `feedback` 表	admin role 运维

详见 05-operations-manual.md §安全审计.

下一步: 04-deployment.md (装机 SOP + 全 ENV 表 + 启停命令)

03 - 数据流详解#

总览数据流#

4 endpoint × 3 段输出真值#

Endpoint A: DNS Resolver-Chain Triage (单端)#

Endpoint B: DNS Multi-Round Triage#

Endpoint C: DHCP AI Tool Triage (单端, 5 工具)#

Endpoint D: DHCP Multi-Round#

按线索路径 (Clue): POST /api/v1/log-assistant/clue/diagnose#

安全旗位 (3 旗物理禁)#

LLM 调用链 (真值)#

主路径: 本地 LLM 优先#

Fallback: DeepSeek 云端#

LLM 真调几次?#

字段对照表 (重要 — 排障/扩展用)#

ai_reasoning 字段 (顶层)#

business_impact 字段#

recommended_actions 字段#

Phase 69.B 真新增: data_freshness_warning (顶层)#

Degraded 真因表 (运营人员排障必读)#

数据隔离 (合规)#

03 - 数据流详解

总览数据流

4 endpoint × 3 段输出真值

Endpoint A: DNS Resolver-Chain Triage (单端)

Endpoint B: DNS Multi-Round Triage

Endpoint C: DHCP AI Tool Triage (单端, 5 工具)

Endpoint D: DHCP Multi-Round

按线索路径 (Clue): `POST /api/v1/log-assistant/clue/diagnose`

安全旗位 (3 旗物理禁)

LLM 调用链 (真值)

主路径: 本地 LLM 优先

Fallback: DeepSeek 云端

LLM 真调几次?

字段对照表 (重要 — 排障/扩展用)

`ai_reasoning` 字段 (顶层)

`business_impact` 字段

`recommended_actions` 字段

Phase 69.B 真新增: `data_freshness_warning` (顶层)

Degraded 真因表 (运营人员排障必读)

数据隔离 (合规)