04 - 装机 SOP

从 git clone 到运维真能用, 30-60 分钟. 出问题对照 06-troubleshooting.md.

系统要求

要求 真验
OS Linux Ubuntu 22.04+ (本项目真用), 其他 distro 理论可 lsb_release -a
Python 3.10+ (本项目真用 3.10, 不用更老) python3 --version
Node.js 18.x+ (Vue 3 + Vite 8 真需要) node --version
npm 9.x+ npm --version
Git 2.30+ (filter-repo 真需要) git --version
磁盘 5 GB+ (含 node_modules 127M + venv 55M + logs 累积) df -h
RAM 2 GB+ (backend Flask 500MB + frontend Vite 200MB) free -m
网络 校内 + 出公网 (DeepSeek API fallback 救命) curl -I https://api.deepseek.com

4 步装机

步骤 1: clone 代码 + venv

# 1. clone
cd ~/code/project
git clone <repo-url> cs-26spring-final-project
cd cs-26spring-final-project

# 2. 建 venv
cd workspace/system_monitor
python3 -m venv backend/.venv
source backend/.venv/bin/activate
pip install --upgrade pip
pip install -r backend/requirements.txt

# 验
backend/.venv/bin/python -c "import flask; print(flask.__version__)"
# 期望: 3.1.3+ (本项目真用 3.1.3)

步骤 2: frontend 依赖

cd workspace/system_monitor/frontend
npm install
# 真返时间: ~30s (校内镜像) / 5 min (官方源)

# 验
npm run build 2>&1 | tail -3
# 期望: ✓ built in N.NNs

步骤 3: 配置 .env.local (机密, 不入 git)

cd workspace/system_monitor
cp .env.local.example .env.local
vim .env.local

必填项 (无这些 backend sys.exit(1)):

# === 认证机密 ===
MONITOR_AUTH_USER=admin
MONITOR_AUTH_PASSWORD=<改成强密码, 不要用默认>
MONITOR_AUTH_TOKEN_SECRET=<openssl rand -hex 32 真生成>

# === ES 真连 ===
OPENCLAW_ES_URL=http://10.10.1.147:9200
# 如需 VPN, 提前在 OS 层装 SSL VPN (vpn.gbu.edu.cn)
OPENCLAW_ES_TIMEOUT_SECONDS=60   # Phase 69.D 真新增, 默认 60s (24h 聚合 ~30s 够)

# === LLM 主 (本地优先) ===
OPENCLAW_REASONING_PRIMARY=local-onprem
OPENCLAW_LOCAL_URL=http://10.10.x.x:8000/v1/chat/completions
OPENCLAW_LOCAL_MODEL=Qwen2.5-72B   # 改成真部署的 model 名
OPENCLAW_LOCAL_API_KEY=             # 本地通常不需要, 留空

# === LLM Fallback (DeepSeek 云端) ===
OPENCLAW_ALLOW_CLOUD_FALLBACK=true
OPENCLAW_DEEPSEEK_URL=https://api.deepseek.com
OPENCLAW_DEEPSEEK_API_KEY=sk-<你的真 key>
OPENCLAW_DEEPSEEK_MODEL=deepseek-v4-flash  # 注意 v4, 不要写 deepseek-chat (2026/07/24 弃用)

# === 真后端 capture 流程 (R(N) 评估用) ===
PHASE67A1_RESOLVER_IP=172.16.1.42         # 真 asset_registry 注册的 校内主递归DNS (Phase 68.B 真接通)
PHASE67A1_CLIENT_IP=192.0.2.10            # RFC 5737 TEST-NET-1 占位
PHASE67A1_INCIDENT_A_START=2026-06-04T15:30:46+08:00   # 真 evaluator-query-design.md 钉死
PHASE67A1_INCIDENT_A_END=2026-06-04T15:45:46+08:00
PHASE67A1_INCIDENT_BCD_START=2026-06-04T03:36:05+00:00
PHASE67A1_INCIDENT_BCD_END=2026-06-04T07:33:27+00:00

# === Live 模式 (clue 路径用) ===
LIVE_MAX_WINDOW=24h        # Phase 69.C 升 6h→24h, Phase 69.D 加 168h hard cap
ALLOW_WIDE_WINDOW=1         # Phase 69.D 真新增, 允许 24h-168h 范围

# === source_mode (生产必 live) ===
source_mode=live
window=6

可选项:

# === 多账号 RBAC (生产推荐, 试用不必) ===
MONITOR_ACCOUNTS_JSON={"admin": {"password_hash": "...", "role": "admin"}, "operator": {...}}
# 用 python -m backend.security_admin hash <plain> 生成 hash

# === LLM timeout 调优 (云端慢时) ===
OPENCLAW_REASONER_TIMEOUT_SECONDS=30   # 默认 8s, DeepSeek 实测 9-32s, 推荐 30s

步骤 4: 启动 + 验

cd workspace/system_monitor
set -a; source .env.local; set +a

# Backend
nohup ./backend/.venv/bin/python backend/app.py > logs/backend-verify.log 2>&1 &
BACKEND_PID=$!
echo "backend pid=$BACKEND_PID"
sleep 6

# 验
curl -s -o /dev/null -w "HTTP %{http_code}\n" --max-time 5 http://127.0.0.1:5000/api/v1/healthz
# 期望: 200 (健康) / 401 (健康但要 auth) / 404 (健康但路径变了)
# 不能是 000 (连接拒绝)

# Frontend (开发模式, 生产换 nginx 静态托管)
cd frontend
nohup npm run dev > ../logs/frontend-dev.log 2>&1 &
sleep 5
curl -s -o /dev/null -w "HTTP %{http_code}\n" --max-time 3 http://127.0.0.1:3002
# 期望: 200

# 验 backend test baseline (~30-60s)
cd ..
PYTHONPATH=. ./backend/.venv/bin/python -m unittest discover test_backup 2>&1 | tail -3
# 期望: Ran 1850 tests in NN.NNs OK (skipped=1)

打开浏览器 http://127.0.0.1:3002 (或 nginx 代理后的 url) → 见登录页 → 用 admin + 你的 MONITOR_AUTH_PASSWORD 登录 → 见告警队列 = 装机成功.

全 ENV 表 (完整参考)

ENV 默认 用途 影响 phase
MONITOR_AUTH_USER admin 登录用户名
MONITOR_AUTH_PASSWORD (无) 登录密码, 必填
MONITOR_AUTH_TOKEN_SECRET (无) JWT 签名 secret, 必填
MONITOR_ACCOUNTS_JSON (空) 多账号 RBAC JSON
OPENCLAW_ES_URL http://10.10.1.147:9200 ES 真地址 15+
OPENCLAW_ES_TIMEOUT_SECONDS 60 ES query timeout (24h 聚合 ~30s) 69.D 真新增
OPENCLAW_REASONING_PRIMARY local-onprem LLM 主选 34
OPENCLAW_LOCAL_URL (无) 本地 LLM url 34
OPENCLAW_LOCAL_MODEL (无) 本地 LLM model 名 34
OPENCLAW_LOCAL_API_KEY (空) 本地 LLM auth (通常不需要) 34
OPENCLAW_ALLOW_CLOUD_FALLBACK true 允许 fallback 到 DeepSeek 34
OPENCLAW_DEEPSEEK_URL https://api.deepseek.com DeepSeek 端点 32
OPENCLAW_DEEPSEEK_API_KEY (无) DeepSeek key, 真要才能 fallback 32
OPENCLAW_DEEPSEEK_MODEL deepseek-v4-flash DeepSeek model 名 (不要用旧名!) 32
OPENCLAW_REASONER_TIMEOUT_SECONDS 8 LLM reasoner timeout, 云端慢推荐 30 32
PHASE67A1_RESOLVER_IP 172.16.1.42 capture 默认 resolver IP (Phase 68.B 对齐 asset_registry) 67.A.1 + 68.B
PHASE67A1_CLIENT_IP 192.0.2.10 RFC 5737 占位 67.A.1
PHASE67A1_INCIDENT_A_START 2026-06-04T15:30:46+08:00 钉死的 A endpoint incident 窗口起 67.A.6 红线 22
PHASE67A1_INCIDENT_A_END 2026-06-04T15:45:46+08:00 A endpoint incident 窗口止 67.A.6
PHASE67A1_INCIDENT_BCD_START 2026-06-04T03:36:05+00:00 B/C/D endpoint 窗口起 (UTC) 67.A.6
PHASE67A1_INCIDENT_BCD_END 2026-06-04T07:33:27+00:00 B/C/D endpoint 窗口止 67.A.6
LIVE_MAX_WINDOW 24h (Phase 69.C 升) clue 路径最大允许 live 查询窗口 69.C
ALLOW_WIDE_WINDOW 1 (Phase 69.D 升) 允许 24h-168h 范围 (>168h hard cap raise) 69.D
source_mode live 数据源模式 (live 真 ES / sample 历史样本)
window 6 默认时间窗 (小时)

改 ENV 必重启 backend (Flask 启动时读). 详见 05-operations-manual.md §改配置 SOP.

生产部署 (未来转生产时, 试用阶段不必)

1. nginx 静态托管 frontend

# /etc/nginx/sites-available/log-assistant.conf

server {
    listen 443 ssl;
    server_name log-assistant.gbu.edu.cn;

    ssl_certificate /etc/letsencrypt/live/log-assistant.gbu.edu.cn/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/log-assistant.gbu.edu.cn/privkey.pem;

    # frontend 静态文件
    root /var/www/log-assistant/dist;
    index index.html;
    try_files $uri /index.html;

    # backend API 反代
    location /api/ {
        proxy_pass http://127.0.0.1:5000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_read_timeout 180s;   # LLM 真返时间 30-120s, 留 buffer
        proxy_connect_timeout 10s;
    }
}

frontend 一次性 npm run builddist/, scp 到 /var/www/log-assistant/dist/. 后续更新只需要 npm run build + scp.

2. systemd 守护 backend

# /etc/systemd/system/log-assistant-backend.service

[Unit]
Description=Log Assistant Backend (Flask)
After=network.target

[Service]
Type=simple
User=log-assistant
WorkingDirectory=/opt/log-assistant/workspace/system_monitor
EnvironmentFile=/opt/log-assistant/workspace/system_monitor/.env.local
ExecStart=/opt/log-assistant/workspace/system_monitor/backend/.venv/bin/gunicorn \
    -w 4 -b 0.0.0.0:5000 \
    --timeout 180 \
    --access-logfile /var/log/log-assistant/access.log \
    --error-logfile /var/log/log-assistant/error.log \
    backend.app:app
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
# 启用
sudo cp log-assistant-backend.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable log-assistant-backend
sudo systemctl start log-assistant-backend

# 看状态
sudo systemctl status log-assistant-backend
sudo journalctl -u log-assistant-backend -f

3. SQLite → PostgreSQL (告警量大时)

试用 + 中小规模生产: SQLite 真够.

告警量 > 1k/天 持续: 换 PostgreSQL.

# 安装 postgres adapter
pip install psycopg2-binary

# 改 backend SQL connection (assistant/*_store.py 文件中) — 需要技术工作量, 不在本 doc 范围
# 留 phase70.X 改源候选

4. HTTPS via Let’s Encrypt

sudo apt install certbot python3-certbot-nginx
sudo certbot --nginx -d log-assistant.gbu.edu.cn
# 自动 90 天续期

5. 日志 rotate

# /etc/logrotate.d/log-assistant
/var/log/log-assistant/*.log {
    daily
    rotate 30
    compress
    delaycompress
    notifempty
    create 0640 log-assistant log-assistant
    sharedscripts
    postrotate
        systemctl reload log-assistant-backend
    endscript
}

升级 / 更新 SOP

拉新代码

cd /home/kk/code/project/cs-26spring-final-project
git pull origin <branch>

# 看是否 backend 改了 (要重启) / frontend 改了 (要重 build)
git diff HEAD@{1} --name-only | head -10

Backend 改了 (.py 改动) — 必重启 + 验 mtime

cd workspace/system_monitor

# 1. kill 老 backend
OLD_PID=$(pgrep -f "backend/app.py" | grep -v "bash -c" | head -1)
[ -n "$OLD_PID" ] && kill "$OLD_PID" && sleep 3

# 2. (可选 — Phase 69.A 教训 lesson 11) 清 .pyc 缓存防 stale
find backend/ -name "__pycache__" -type d -exec rm -rf {} + 2>/dev/null

# 3. 重启
set -a; source .env.local; set +a
nohup ./backend/.venv/bin/python backend/app.py > logs/backend-verify.log 2>&1 &
NEW_PID=$!
sleep 6

# 4. 验 mtime (backend 启动 ts ≥ 最新源 mtime 才算真应用)
BACKEND_TS=$(stat -c %Y "/proc/$NEW_PID")
SRC_TS=$(stat -c %Y backend/assistant/<你改的文件>.py)
[ "$BACKEND_TS" -ge "$SRC_TS" ] && echo "✅ 真应用" || echo "❌ 重启失败"

# 5. 验 health
curl -s -o /dev/null -w "HTTP %{http_code}\n" http://127.0.0.1:5000/api/v1/healthz

# 6. 跑 unittest
PYTHONPATH=. ./backend/.venv/bin/python -m unittest discover test_backup 2>&1 | tail -3

Frontend 改了 (.vue .js 改动)

cd workspace/system_monitor/frontend

# dev mode (Vite HMR 自动 reload, 不用手动)
# 浏览器自动刷新

# 生产 (有 nginx 部署的话)
npm run build
# scp dist/* server:/var/www/log-assistant/dist/

.env.local 改了 — 必重启 backend

# 同上 backend 重启 SOP
# Flask 启动时一次读 env, 改了不重启 = 改了等于没改

依赖改了 (requirements.txt / package.json)

# Python
cd workspace/system_monitor
backend/.venv/bin/pip install -r backend/requirements.txt

# Node
cd workspace/system_monitor/frontend
npm install

一键脚本封装 (推荐)

scripts/run_round_evaluation.sh 已封装 R(N) 评估 6 步 pipeline. 详见 05-operations-manual.md §跑评估轮.

未来可加 scripts/restart-all.sh / scripts/health-check.sh 一键脚本, 留 phase70.X 候选.


下一步: 05-operations-manual.md (日常运维 SOP)