fix: update

2026-03-03 17:27:55 +08:00
parent 29c921f498
commit 1764a44eb3
22 changed files with 818 additions and 30 deletions
--- a/crawler/README.md
+++ b/crawler/README.md
@@ -54,6 +54,86 @@ pip install -r requirements.txt

 **事件脉络不更新时**：多半是未启动 `npm run gdelt`。只跑 `npm run api` 时，事件脉络会显示空或仅有缓存。

+## 如何检查爬虫是否工作正常
+
+按下面顺序做即可确认整条链路（爬虫 → 数据库 → Node 重载 → API/WebSocket）正常。
+
+### 1. 一键验证（推荐）
+
+先启动 API，再执行验证脚本（可选是否顺带启动爬虫）：
+
+```bash
+# 终端 1：必须
+npm run api
+
+# 终端 2：执行验证（不启动爬虫，只检查当前状态）
+./scripts/verify-pipeline.sh
+
+# 或：顺带启动爬虫并等首次抓取后再验证
+./scripts/verify-pipeline.sh --start-crawler
+```
+
+脚本会检查：API 健康、态势数据含 `lastUpdated`、爬虫服务是否可达、`news_content`/situation_update、战损字段、`POST /api/crawler/notify` 是否可用。
+
+### 2. 手动快速检查
+
+| 步骤 | 命令 / 操作 | 正常表现 |
+|-----|-------------|----------|
+| API 是否在跑 | `curl -s http://localhost:3001/api/health` | 返回 `{"ok":true}` |
+| 态势是否可读 | `curl -s http://localhost:3001/api/situation \| head -c 300` | 含 `lastUpdated`、`usForces`、`recentUpdates` |
+| RSS 能否抓到 | `npm run crawler:test` | 输出「RSS 抓取: N 条」，N>0 表示有命中 |
+| 爬虫服务（gdelt） | `curl -s http://localhost:8000/crawler/status` | 返回 JSON，含 `db_path`/`db_exists` 等 |
+| 库里有无爬虫数据 | `sqlite3 server/data.db "SELECT COUNT(*) FROM situation_update; SELECT COUNT(*) FROM news_content;"` 或访问 `http://localhost:3001/api/db/dashboard` | situation_update、news_content 条数 > 0（跑过流水线后） |
+| 通知后是否重载 | 爬虫写库后会 POST `/api/crawler/notify`，Node 会 `reloadFromFile` 再广播 | 前端/`/api/situation` 的 `lastUpdated` 和内容会更新 |
+
+### 3. 跑一轮流水线（不常驻爬虫时）
+
+不启动 gdelt 时，可单次跑完整流水线（抓取 → 去重 → 写表 → notify）：
+
+```bash
+npm run api   # 保持运行
+cd crawler && python3 -c "
+from pipeline import run_full_pipeline
+from config import DB_PATH, API_BASE
+n_fetched, n_news, n_panel = run_full_pipeline(db_path=DB_PATH, api_base=API_BASE, notify=True)
+print('抓取:', n_fetched, '去重新增:', n_news, '面板写入:', n_panel)
+"
+```
+
+有网络且有关键词命中时，应看到非零数字；再查 `curl -s http://localhost:3001/api/situation` 或前端事件脉络是否出现新数据。
+
+### 4. 仅测提取逻辑（不写库）
+
+```bash
+npm run crawler:test:extraction   # 规则/db_merge 测试
+# 或按 README「快速自测命令」用示例文本调 extract_from_news 看 combat_losses_delta / key_location_updates
+```
+
+**常见现象**：抓取 0 条 → 网络/RSS 被墙或关键词未命中；situation_update 为空 → 未跑流水线或去重后无新增；前端不刷新 → 未开 `npm run api` 或未开爬虫（gdelt）。
+
+### 5. 爬虫与面板是否联通
+
+专门检查「爬虫写库」与「面板展示」是否一致：
+
+```bash
+./scripts/check-crawler-panel-connectivity.sh
+```
+
+会对比：爬虫侧的 `situation_update` 条数 vs 面板 API 返回的 `recentUpdates` 条数，并说明为何战损/基地等不一定随每条新闻变化。
+
+## 爬虫与面板数据联动说明
+
+| 面板展示 | 数据来源（表/接口） | 是否由爬虫更新 | 说明 |
+|----------|---------------------|----------------|------|
+| **事件脉络** (recentUpdates) | situation_update → getSituation() | ✅ 是 | 每条去重后的新闻会写入 situation_update，Node 收到 notify 后重载 DB 再广播 |
+| **地图冲突点** (conflictEvents) | gdelt_events 或 RSS→gdelt 回填 | ✅ 是 | GDELT 或 GDELT 禁用时由 situation_update 同步到 gdelt_events |
+| **战损/装备毁伤** (combatLosses) | combat_losses | ⚠️ 有条件 | 仅当 AI/规则从新闻中提取到数字（如「2 名美军死亡」）时，merge 才写入增量 |
+| **基地/地点状态** (keyLocations) | key_location | ⚠️ 有条件 | 仅当提取到 key_location_updates（如某基地遭袭）时更新 |
+| **力量摘要/指数/资产** (summary, powerIndex, assets) | force_summary, power_index, force_asset | ❌ 否 | 仅 seed 初始化，爬虫不写 |
+| **华尔街/报复情绪** (wallStreet, retaliation) | wall_street_trend, retaliation_* | ⚠️ 有条件 | 仅当提取器输出对应字段时更新 |
+
+因此：**新闻很多、但战损/基地数字不动**是正常现象——多数标题不含可解析的伤亡/基地数字，只有事件脉络（recentUpdates）和地图冲突点会随每条新闻增加。若**事件脉络也不更新**，请确认 Node 终端在爬虫每轮抓取后是否出现 `[crawler/notify] DB 已重载`；若无，检查爬虫的 `API_BASE` 是否指向当前 API（默认 `http://localhost:3001`）。
+
 ## 写库流水线（与 server/README 第五节一致）

 RSS 与主入口均走统一流水线 `pipeline.run_full_pipeline`：
@@ -80,6 +160,7 @@ RSS → 抓取 → 清洗 → 去重 → 写 news_content / situation_update /

 - `DB_PATH`: SQLite 路径，默认 `../server/data.db`
 - `API_BASE`: Node API 地址，默认 `http://localhost:3001`
+- **`DASHSCOPE_API_KEY`**：阿里云通义（DashScope）API Key。**设置后全程使用商业模型，无需本机安装 Ollama**（适合 Mac 版本较低无法跑 Ollama 的情况）。获取： [阿里云百炼 / DashScope](https://dashscope.console.aliyun.com/) → 创建 API-KEY，复制到环境变量或项目根目录 `.env` 中 `DASHSCOPE_API_KEY=sk-xxx`。摘要、分类、战损/基地提取均走通义。
 - `GDELT_QUERY`: 搜索关键词，默认 `United States Iran military`
 - `GDELT_MAX_RECORDS`: 最大条数，默认 30
 - `GDELT_TIMESPAN`: 时间范围，`1h` / `1d` / `1week`，默认 `1d`（近日资讯）
--- a/crawler/pycache/cleaner_ai.cpython-39.pyc
+++ b/crawler/pycache/cleaner_ai.cpython-39.pyc
--- a/crawler/pycache/parser_ai.cpython-39.pyc
+++ b/crawler/pycache/parser_ai.cpython-39.pyc
--- a/crawler/pycache/pipeline.cpython-39.pyc
+++ b/crawler/pycache/pipeline.cpython-39.pyc
--- a/crawler/pycache/realtime_conflict_service.cpython-39.pyc
+++ b/crawler/pycache/realtime_conflict_service.cpython-39.pyc
--- a/crawler/cleaner_ai.py
+++ b/crawler/cleaner_ai.py
@@ -2,6 +2,7 @@
 """
 AI 清洗新闻数据，严格按面板字段约束输出
 面板 EventTimelinePanel 所需：summary(≤120字)、category(枚举)、severity(枚举)
+优先使用 DASHSCOPE_API_KEY（通义，无需 Ollama），否则 Ollama，最后规则兜底
 """
 import os
 import re
@@ -9,6 +10,7 @@ from typing import Optional

 CLEANER_AI_DISABLED = os.environ.get("CLEANER_AI_DISABLED", "0") == "1"
 OLLAMA_MODEL = os.environ.get("OLLAMA_MODEL", "llama3.1")
+DASHSCOPE_API_KEY = os.environ.get("DASHSCOPE_API_KEY", "").strip()

 # 面板 schema：必须与 EventTimelinePanel / SituationUpdate 一致
 SUMMARY_MAX_LEN = 120  # 面板 line-clamp-2 展示
@@ -30,6 +32,38 @@ def _rule_clean(text: str, max_len: int = SUMMARY_MAX_LEN) -> str:
    return _sanitize_summary(text, max_len)


+def _call_dashscope_summary(text: str, max_len: int, timeout: int = 8) -> Optional[str]:
+    """调用阿里云通义（DashScope）提炼摘要，无需 Ollama。需设置 DASHSCOPE_API_KEY"""
+    if not DASHSCOPE_API_KEY or CLEANER_AI_DISABLED or not text or len(str(text).strip()) < 5:
+        return None
+    try:
+        import dashscope
+        from http import HTTPStatus
+        dashscope.api_key = DASHSCOPE_API_KEY
+        prompt = f"""将新闻提炼为1-2句简洁中文事实，直接输出纯文本，不要标号、引号、解释。限{max_len}字内。
+
+原文：{str(text)[:350]}
+
+输出："""
+        r = dashscope.Generation.call(
+            model="qwen-turbo",
+            messages=[{"role": "user", "content": prompt}],
+            result_format="message",
+            max_tokens=150,
+        )
+        if r.status_code != HTTPStatus.OK:
+            return None
+        out = (r.output.get("choices", [{}])[0].get("message", {}).get("content", "") or "").strip()
+        out = re.sub(r"^[\d\.\-\*\s]+", "", out)
+        out = re.sub(r"^['\"\s]+|['\"\s]+$", "", out)
+        out = _sanitize_summary(out, max_len)
+        if out and len(out) > 3:
+            return out
+        return None
+    except Exception:
+        return None
+
+
 def _call_ollama_summary(text: str, max_len: int, timeout: int = 6) -> Optional[str]:
    """调用 Ollama 提炼摘要，输出须为纯文本、≤max_len 字"""
    if CLEANER_AI_DISABLED or not text or len(str(text).strip()) < 5:
@@ -71,7 +105,11 @@ def clean_news_for_panel(text: str, max_len: int = SUMMARY_MAX_LEN) -> str:
    t = str(text).strip()
    if not t:
        return ""
-    res = _call_ollama_summary(t, max_len, timeout=6)
+    # 优先商业模型（通义），再 Ollama，最后规则
+    if DASHSCOPE_API_KEY:
+        res = _call_dashscope_summary(t, max_len, timeout=8)
+    else:
+        res = _call_ollama_summary(t, max_len, timeout=6)
    if res:
        return res
    return _rule_clean(t, max_len)
--- a/crawler/parser_ai.py
+++ b/crawler/parser_ai.py
@@ -1,7 +1,7 @@
 # -*- coding: utf-8 -*-
 """
 AI 新闻分类与严重度判定
-优先使用 Ollama 本地模型（免费），失败则回退到规则
+优先 DASHSCOPE_API_KEY（通义，无需 Ollama），否则 Ollama，最后规则
 设置 PARSER_AI_DISABLED=1 可只用规则（更快）
 """
 import os
@@ -11,7 +11,8 @@ Category = Literal["deployment", "alert", "intel", "diplomatic", "other"]
 Severity = Literal["low", "medium", "high", "critical"]

 PARSER_AI_DISABLED = os.environ.get("PARSER_AI_DISABLED", "0") == "1"
-OLLAMA_MODEL = os.environ.get("OLLAMA_MODEL", "llama3.1")  # 或 qwen2.5:7b
+OLLAMA_MODEL = os.environ.get("OLLAMA_MODEL", "llama3.1")
+DASHSCOPE_API_KEY = os.environ.get("DASHSCOPE_API_KEY", "").strip()

 _CATEGORIES = ("deployment", "alert", "intel", "diplomatic", "other")
 _SEVERITIES = ("low", "medium", "high", "critical")
@@ -32,8 +33,37 @@ def _parse_ai_response(text: str) -> Tuple[Category, Severity]:
    return cat, sev  # type: ignore


+def _call_dashscope(text: str, timeout: int = 6) -> Optional[Tuple[Category, Severity]]:
+    """调用阿里云通义（DashScope）分类，无需 Ollama。需设置 DASHSCOPE_API_KEY"""
+    if not DASHSCOPE_API_KEY or PARSER_AI_DISABLED:
+        return None
+    try:
+        import dashscope
+        from http import HTTPStatus
+        dashscope.api_key = DASHSCOPE_API_KEY
+        prompt = f"""Classify this news about US-Iran/middle east (one line only):
+- category: deployment|alert|intel|diplomatic|other
+- severity: low|medium|high|critical
+
+News: {text[:300]}
+
+Reply format: category:severity (e.g. alert:high)"""
+        r = dashscope.Generation.call(
+            model="qwen-turbo",
+            messages=[{"role": "user", "content": prompt}],
+            result_format="message",
+            max_tokens=32,
+        )
+        if r.status_code != HTTPStatus.OK:
+            return None
+        out = r.output.get("choices", [{}])[0].get("message", {}).get("content", "")
+        return _parse_ai_response(out)
+    except Exception:
+        return None
+
+
 def _call_ollama(text: str, timeout: int = 5) -> Optional[Tuple[Category, Severity]]:
-    """调用 Ollama 本地模型。需先运行 ollama run llama3.1 或 qwen2.5:7b"""
+    """调用 Ollama 本地模型。需先运行 ollama run llama3.1"""
    if PARSER_AI_DISABLED:
        return None
    try:
@@ -73,9 +103,16 @@ def _rule_severity(text: str, category: Category) -> Severity:
    return severity(text, category)


+def _call_ai(text: str) -> Optional[Tuple[Category, Severity]]:
+    """优先通义，再 Ollama"""
+    if DASHSCOPE_API_KEY:
+        return _call_dashscope(text)
+    return _call_ollama(text)
+
+
 def classify(text: str) -> Category:
    """分类。AI 失败时回退规则"""
-    res = _call_ollama(text)
+    res = _call_ai(text)
    if res:
        return res[0]
    return _rule_classify(text)
@@ -83,7 +120,7 @@ def classify(text: str) -> Category:

 def severity(text: str, category: Category) -> Severity:
    """严重度。AI 失败时回退规则"""
-    res = _call_ollama(text)
+    res = _call_ai(text)
    if res:
        return res[1]
    return _rule_severity(text, category)
@@ -95,7 +132,7 @@ def classify_and_severity(text: str) -> Tuple[Category, Severity]:
        from parser import classify, severity
        c = classify(text)
        return c, severity(text, c)
-    res = _call_ollama(text)
+    res = _call_ai(text)
    if res:
        return res
    return _rule_classify(text), _rule_severity(text, _rule_classify(text))
--- a/crawler/pipeline.py
+++ b/crawler/pipeline.py
@@ -14,10 +14,14 @@ def _notify_api(api_base: str) -> bool:
    """调用 Node API 触发立即广播"""
    try:
        import urllib.request
+        token = os.environ.get("API_CRAWLER_TOKEN", "").strip()
        req = urllib.request.Request(
            f"{api_base.rstrip('/')}/api/crawler/notify",
            method="POST",
-            headers={"Content-Type": "application/json"},
+            headers={
+                "Content-Type": "application/json",
+                **({"X-Crawler-Token": token} if token else {}),
+            },
        )
        with urllib.request.urlopen(req, timeout=5) as resp:
            return resp.status == 200
--- a/crawler/realtime_conflict_service.py
+++ b/crawler/realtime_conflict_service.py
@@ -242,7 +242,16 @@ def _write_to_db(events: List[dict]) -> None:

 def _notify_node() -> None:
    try:
-        r = requests.post(f"{API_BASE}/api/crawler/notify", timeout=5, proxies={"http": None, "https": None})
+        headers = {}
+        token = os.environ.get("API_CRAWLER_TOKEN", "").strip()
+        if token:
+          headers["X-Crawler-Token"] = token
+        r = requests.post(
+            f"{API_BASE}/api/crawler/notify",
+            timeout=5,
+            headers=headers,
+            proxies={"http": None, "https": None},
+        )
        if r.status_code != 200:
            print("  [warn] notify API 失败")
    except Exception as e:
@@ -340,7 +349,10 @@ def crawler_backfill():
        return {"ok": False, "error": "db not found"}
    try:
        from db_merge import merge
-        if os.environ.get("CLEANER_AI_DISABLED", "0") == "1":
+        use_dashscope = bool(os.environ.get("DASHSCOPE_API_KEY", "").strip())
+        if use_dashscope:
+            from extractor_dashscope import extract_from_news
+        elif os.environ.get("CLEANER_AI_DISABLED", "0") == "1":
            from extractor_rules import extract_from_news
        else:
            from extractor_ai import extract_from_news