04 - 装机 SOP
从 git clone 到运维真能用, 30-60 分钟. 出问题对照 06-troubleshooting.md.
系统要求
| 项 | 要求 | 真验 |
|---|---|---|
| OS | Linux Ubuntu 22.04+ (本项目真用), 其他 distro 理论可 | lsb_release -a |
| Python | 3.10+ (本项目真用 3.10, 不用更老) | python3 --version |
| Node.js | 18.x+ (Vue 3 + Vite 8 真需要) | node --version |
| npm | 9.x+ | npm --version |
| Git | 2.30+ (filter-repo 真需要) | git --version |
| 磁盘 | 5 GB+ (含 node_modules 127M + venv 55M + logs 累积) | df -h |
| RAM | 2 GB+ (backend Flask 500MB + frontend Vite 200MB) | free -m |
| 网络 | 校内 + 出公网 (DeepSeek API fallback 救命) | curl -I https://api.deepseek.com |
4 步装机
步骤 1: clone 代码 + venv
# 1. clone
cd ~/code/project
git clone <repo-url> cs-26spring-final-project
cd cs-26spring-final-project
# 2. 建 venv
cd workspace/system_monitor
python3 -m venv backend/.venv
source backend/.venv/bin/activate
pip install --upgrade pip
pip install -r backend/requirements.txt
# 验
backend/.venv/bin/python -c "import flask; print(flask.__version__)"
# 期望: 3.1.3+ (本项目真用 3.1.3)
步骤 2: frontend 依赖
cd workspace/system_monitor/frontend
npm install
# 真返时间: ~30s (校内镜像) / 5 min (官方源)
# 验
npm run build 2>&1 | tail -3
# 期望: ✓ built in N.NNs
步骤 3: 配置 .env.local (机密, 不入 git)
cd workspace/system_monitor
cp .env.local.example .env.local
vim .env.local
必填项 (无这些 backend sys.exit(1)):
# === 认证机密 ===
MONITOR_AUTH_USER=admin
MONITOR_AUTH_PASSWORD=<改成强密码, 不要用默认>
MONITOR_AUTH_TOKEN_SECRET=<openssl rand -hex 32 真生成>
# === ES 真连 ===
OPENCLAW_ES_URL=http://10.10.1.147:9200
# 如需 VPN, 提前在 OS 层装 SSL VPN (vpn.gbu.edu.cn)
OPENCLAW_ES_TIMEOUT_SECONDS=60 # Phase 69.D 真新增, 默认 60s (24h 聚合 ~30s 够)
# === LLM 主 (本地优先) ===
OPENCLAW_REASONING_PRIMARY=local-onprem
OPENCLAW_LOCAL_URL=http://10.10.x.x:8000/v1/chat/completions
OPENCLAW_LOCAL_MODEL=Qwen2.5-72B # 改成真部署的 model 名
OPENCLAW_LOCAL_API_KEY= # 本地通常不需要, 留空
# === LLM Fallback (DeepSeek 云端) ===
OPENCLAW_ALLOW_CLOUD_FALLBACK=true
OPENCLAW_DEEPSEEK_URL=https://api.deepseek.com
OPENCLAW_DEEPSEEK_API_KEY=sk-<你的真 key>
OPENCLAW_DEEPSEEK_MODEL=deepseek-v4-flash # 注意 v4, 不要写 deepseek-chat (2026/07/24 弃用)
# === 真后端 capture 流程 (R(N) 评估用) ===
PHASE67A1_RESOLVER_IP=172.16.1.42 # 真 asset_registry 注册的 校内主递归DNS (Phase 68.B 真接通)
PHASE67A1_CLIENT_IP=192.0.2.10 # RFC 5737 TEST-NET-1 占位
PHASE67A1_INCIDENT_A_START=2026-06-04T15:30:46+08:00 # 真 evaluator-query-design.md 钉死
PHASE67A1_INCIDENT_A_END=2026-06-04T15:45:46+08:00
PHASE67A1_INCIDENT_BCD_START=2026-06-04T03:36:05+00:00
PHASE67A1_INCIDENT_BCD_END=2026-06-04T07:33:27+00:00
# === Live 模式 (clue 路径用) ===
LIVE_MAX_WINDOW=24h # Phase 69.C 升 6h→24h, Phase 69.D 加 168h hard cap
ALLOW_WIDE_WINDOW=1 # Phase 69.D 真新增, 允许 24h-168h 范围
# === source_mode (生产必 live) ===
source_mode=live
window=6
可选项:
# === 多账号 RBAC (生产推荐, 试用不必) ===
MONITOR_ACCOUNTS_JSON={"admin": {"password_hash": "...", "role": "admin"}, "operator": {...}}
# 用 python -m backend.security_admin hash <plain> 生成 hash
# === LLM timeout 调优 (云端慢时) ===
OPENCLAW_REASONER_TIMEOUT_SECONDS=30 # 默认 8s, DeepSeek 实测 9-32s, 推荐 30s
步骤 4: 启动 + 验
cd workspace/system_monitor
set -a; source .env.local; set +a
# Backend
nohup ./backend/.venv/bin/python backend/app.py > logs/backend-verify.log 2>&1 &
BACKEND_PID=$!
echo "backend pid=$BACKEND_PID"
sleep 6
# 验
curl -s -o /dev/null -w "HTTP %{http_code}\n" --max-time 5 http://127.0.0.1:5000/api/v1/healthz
# 期望: 200 (健康) / 401 (健康但要 auth) / 404 (健康但路径变了)
# 不能是 000 (连接拒绝)
# Frontend (开发模式, 生产换 nginx 静态托管)
cd frontend
nohup npm run dev > ../logs/frontend-dev.log 2>&1 &
sleep 5
curl -s -o /dev/null -w "HTTP %{http_code}\n" --max-time 3 http://127.0.0.1:3002
# 期望: 200
# 验 backend test baseline (~30-60s)
cd ..
PYTHONPATH=. ./backend/.venv/bin/python -m unittest discover test_backup 2>&1 | tail -3
# 期望: Ran 1850 tests in NN.NNs OK (skipped=1)
打开浏览器 http://127.0.0.1:3002 (或 nginx 代理后的 url) → 见登录页 → 用 admin + 你的 MONITOR_AUTH_PASSWORD 登录 → 见告警队列 = 装机成功.
全 ENV 表 (完整参考)
| ENV | 默认 | 用途 | 影响 phase |
|---|---|---|---|
MONITOR_AUTH_USER |
admin |
登录用户名 | – |
MONITOR_AUTH_PASSWORD |
(无) | 登录密码, 必填 | – |
MONITOR_AUTH_TOKEN_SECRET |
(无) | JWT 签名 secret, 必填 | – |
MONITOR_ACCOUNTS_JSON |
(空) | 多账号 RBAC JSON | – |
OPENCLAW_ES_URL |
http://10.10.1.147:9200 |
ES 真地址 | 15+ |
OPENCLAW_ES_TIMEOUT_SECONDS |
60 |
ES query timeout (24h 聚合 ~30s) | 69.D 真新增 |
OPENCLAW_REASONING_PRIMARY |
local-onprem |
LLM 主选 | 34 |
OPENCLAW_LOCAL_URL |
(无) | 本地 LLM url | 34 |
OPENCLAW_LOCAL_MODEL |
(无) | 本地 LLM model 名 | 34 |
OPENCLAW_LOCAL_API_KEY |
(空) | 本地 LLM auth (通常不需要) | 34 |
OPENCLAW_ALLOW_CLOUD_FALLBACK |
true |
允许 fallback 到 DeepSeek | 34 |
OPENCLAW_DEEPSEEK_URL |
https://api.deepseek.com |
DeepSeek 端点 | 32 |
OPENCLAW_DEEPSEEK_API_KEY |
(无) | DeepSeek key, 真要才能 fallback | 32 |
OPENCLAW_DEEPSEEK_MODEL |
deepseek-v4-flash |
DeepSeek model 名 (不要用旧名!) | 32 |
OPENCLAW_REASONER_TIMEOUT_SECONDS |
8 |
LLM reasoner timeout, 云端慢推荐 30 | 32 |
PHASE67A1_RESOLVER_IP |
172.16.1.42 |
capture 默认 resolver IP (Phase 68.B 对齐 asset_registry) | 67.A.1 + 68.B |
PHASE67A1_CLIENT_IP |
192.0.2.10 |
RFC 5737 占位 | 67.A.1 |
PHASE67A1_INCIDENT_A_START |
2026-06-04T15:30:46+08:00 |
钉死的 A endpoint incident 窗口起 | 67.A.6 红线 22 |
PHASE67A1_INCIDENT_A_END |
2026-06-04T15:45:46+08:00 |
A endpoint incident 窗口止 | 67.A.6 |
PHASE67A1_INCIDENT_BCD_START |
2026-06-04T03:36:05+00:00 |
B/C/D endpoint 窗口起 (UTC) | 67.A.6 |
PHASE67A1_INCIDENT_BCD_END |
2026-06-04T07:33:27+00:00 |
B/C/D endpoint 窗口止 | 67.A.6 |
LIVE_MAX_WINDOW |
24h (Phase 69.C 升) |
clue 路径最大允许 live 查询窗口 | 69.C |
ALLOW_WIDE_WINDOW |
1 (Phase 69.D 升) |
允许 24h-168h 范围 (>168h hard cap raise) | 69.D |
source_mode |
live |
数据源模式 (live 真 ES / sample 历史样本) |
– |
window |
6 |
默认时间窗 (小时) | – |
改 ENV 必重启 backend (Flask 启动时读). 详见 05-operations-manual.md §改配置 SOP.
生产部署 (未来转生产时, 试用阶段不必)
1. nginx 静态托管 frontend
# /etc/nginx/sites-available/log-assistant.conf
server {
listen 443 ssl;
server_name log-assistant.gbu.edu.cn;
ssl_certificate /etc/letsencrypt/live/log-assistant.gbu.edu.cn/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/log-assistant.gbu.edu.cn/privkey.pem;
# frontend 静态文件
root /var/www/log-assistant/dist;
index index.html;
try_files $uri /index.html;
# backend API 反代
location /api/ {
proxy_pass http://127.0.0.1:5000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_read_timeout 180s; # LLM 真返时间 30-120s, 留 buffer
proxy_connect_timeout 10s;
}
}
frontend 一次性 npm run build 出 dist/, scp 到 /var/www/log-assistant/dist/. 后续更新只需要 npm run build + scp.
2. systemd 守护 backend
# /etc/systemd/system/log-assistant-backend.service
[Unit]
Description=Log Assistant Backend (Flask)
After=network.target
[Service]
Type=simple
User=log-assistant
WorkingDirectory=/opt/log-assistant/workspace/system_monitor
EnvironmentFile=/opt/log-assistant/workspace/system_monitor/.env.local
ExecStart=/opt/log-assistant/workspace/system_monitor/backend/.venv/bin/gunicorn \
-w 4 -b 0.0.0.0:5000 \
--timeout 180 \
--access-logfile /var/log/log-assistant/access.log \
--error-logfile /var/log/log-assistant/error.log \
backend.app:app
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
# 启用
sudo cp log-assistant-backend.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable log-assistant-backend
sudo systemctl start log-assistant-backend
# 看状态
sudo systemctl status log-assistant-backend
sudo journalctl -u log-assistant-backend -f
3. SQLite → PostgreSQL (告警量大时)
试用 + 中小规模生产: SQLite 真够.
告警量 > 1k/天 持续: 换 PostgreSQL.
# 安装 postgres adapter
pip install psycopg2-binary
# 改 backend SQL connection (assistant/*_store.py 文件中) — 需要技术工作量, 不在本 doc 范围
# 留 phase70.X 改源候选
4. HTTPS via Let’s Encrypt
sudo apt install certbot python3-certbot-nginx
sudo certbot --nginx -d log-assistant.gbu.edu.cn
# 自动 90 天续期
5. 日志 rotate
# /etc/logrotate.d/log-assistant
/var/log/log-assistant/*.log {
daily
rotate 30
compress
delaycompress
notifempty
create 0640 log-assistant log-assistant
sharedscripts
postrotate
systemctl reload log-assistant-backend
endscript
}
升级 / 更新 SOP
拉新代码
cd /home/kk/code/project/cs-26spring-final-project
git pull origin <branch>
# 看是否 backend 改了 (要重启) / frontend 改了 (要重 build)
git diff HEAD@{1} --name-only | head -10
Backend 改了 (.py 改动) — 必重启 + 验 mtime
cd workspace/system_monitor
# 1. kill 老 backend
OLD_PID=$(pgrep -f "backend/app.py" | grep -v "bash -c" | head -1)
[ -n "$OLD_PID" ] && kill "$OLD_PID" && sleep 3
# 2. (可选 — Phase 69.A 教训 lesson 11) 清 .pyc 缓存防 stale
find backend/ -name "__pycache__" -type d -exec rm -rf {} + 2>/dev/null
# 3. 重启
set -a; source .env.local; set +a
nohup ./backend/.venv/bin/python backend/app.py > logs/backend-verify.log 2>&1 &
NEW_PID=$!
sleep 6
# 4. 验 mtime (backend 启动 ts ≥ 最新源 mtime 才算真应用)
BACKEND_TS=$(stat -c %Y "/proc/$NEW_PID")
SRC_TS=$(stat -c %Y backend/assistant/<你改的文件>.py)
[ "$BACKEND_TS" -ge "$SRC_TS" ] && echo "✅ 真应用" || echo "❌ 重启失败"
# 5. 验 health
curl -s -o /dev/null -w "HTTP %{http_code}\n" http://127.0.0.1:5000/api/v1/healthz
# 6. 跑 unittest
PYTHONPATH=. ./backend/.venv/bin/python -m unittest discover test_backup 2>&1 | tail -3
Frontend 改了 (.vue .js 改动)
cd workspace/system_monitor/frontend
# dev mode (Vite HMR 自动 reload, 不用手动)
# 浏览器自动刷新
# 生产 (有 nginx 部署的话)
npm run build
# scp dist/* server:/var/www/log-assistant/dist/
.env.local 改了 — 必重启 backend
# 同上 backend 重启 SOP
# Flask 启动时一次读 env, 改了不重启 = 改了等于没改
依赖改了 (requirements.txt / package.json)
# Python
cd workspace/system_monitor
backend/.venv/bin/pip install -r backend/requirements.txt
# Node
cd workspace/system_monitor/frontend
npm install
一键脚本封装 (推荐)
scripts/run_round_evaluation.sh 已封装 R(N) 评估 6 步 pipeline. 详见 05-operations-manual.md §跑评估轮.
未来可加 scripts/restart-all.sh / scripts/health-check.sh 一键脚本, 留 phase70.X 候选.
下一步: 05-operations-manual.md (日常运维 SOP)