fix: update

This commit is contained in:
Daniel
2026-03-03 17:27:55 +08:00
parent 29c921f498
commit 1764a44eb3
22 changed files with 818 additions and 30 deletions

View File

@@ -54,6 +54,86 @@ pip install -r requirements.txt
**事件脉络不更新时**:多半是未启动 `npm run gdelt`。只跑 `npm run api` 时,事件脉络会显示空或仅有缓存。
## 如何检查爬虫是否工作正常
按下面顺序做即可确认整条链路(爬虫 → 数据库 → Node 重载 → API/WebSocket正常。
### 1. 一键验证(推荐)
先启动 API再执行验证脚本可选是否顺带启动爬虫
```bash
# 终端 1必须
npm run api
# 终端 2执行验证不启动爬虫只检查当前状态
./scripts/verify-pipeline.sh
# 或:顺带启动爬虫并等首次抓取后再验证
./scripts/verify-pipeline.sh --start-crawler
```
脚本会检查API 健康、态势数据含 `lastUpdated`、爬虫服务是否可达、`news_content`/situation_update、战损字段、`POST /api/crawler/notify` 是否可用。
### 2. 手动快速检查
| 步骤 | 命令 / 操作 | 正常表现 |
|-----|-------------|----------|
| API 是否在跑 | `curl -s http://localhost:3001/api/health` | 返回 `{"ok":true}` |
| 态势是否可读 | `curl -s http://localhost:3001/api/situation \| head -c 300` | 含 `lastUpdated``usForces``recentUpdates` |
| RSS 能否抓到 | `npm run crawler:test` | 输出「RSS 抓取: N 条」N>0 表示有命中 |
| 爬虫服务gdelt | `curl -s http://localhost:8000/crawler/status` | 返回 JSON`db_path`/`db_exists` 等 |
| 库里有无爬虫数据 | `sqlite3 server/data.db "SELECT COUNT(*) FROM situation_update; SELECT COUNT(*) FROM news_content;"` 或访问 `http://localhost:3001/api/db/dashboard` | situation_update、news_content 条数 > 0跑过流水线后 |
| 通知后是否重载 | 爬虫写库后会 POST `/api/crawler/notify`Node 会 `reloadFromFile` 再广播 | 前端/`/api/situation``lastUpdated` 和内容会更新 |
### 3. 跑一轮流水线(不常驻爬虫时)
不启动 gdelt 时,可单次跑完整流水线(抓取 → 去重 → 写表 → notify
```bash
npm run api # 保持运行
cd crawler && python3 -c "
from pipeline import run_full_pipeline
from config import DB_PATH, API_BASE
n_fetched, n_news, n_panel = run_full_pipeline(db_path=DB_PATH, api_base=API_BASE, notify=True)
print('抓取:', n_fetched, '去重新增:', n_news, '面板写入:', n_panel)
"
```
有网络且有关键词命中时,应看到非零数字;再查 `curl -s http://localhost:3001/api/situation` 或前端事件脉络是否出现新数据。
### 4. 仅测提取逻辑(不写库)
```bash
npm run crawler:test:extraction # 规则/db_merge 测试
# 或按 README「快速自测命令」用示例文本调 extract_from_news 看 combat_losses_delta / key_location_updates
```
**常见现象**:抓取 0 条 → 网络/RSS 被墙或关键词未命中situation_update 为空 → 未跑流水线或去重后无新增;前端不刷新 → 未开 `npm run api` 或未开爬虫gdelt
### 5. 爬虫与面板是否联通
专门检查「爬虫写库」与「面板展示」是否一致:
```bash
./scripts/check-crawler-panel-connectivity.sh
```
会对比:爬虫侧的 `situation_update` 条数 vs 面板 API 返回的 `recentUpdates` 条数,并说明为何战损/基地等不一定随每条新闻变化。
## 爬虫与面板数据联动说明
| 面板展示 | 数据来源(表/接口) | 是否由爬虫更新 | 说明 |
|----------|---------------------|----------------|------|
| **事件脉络** (recentUpdates) | situation_update → getSituation() | ✅ 是 | 每条去重后的新闻会写入 situation_updateNode 收到 notify 后重载 DB 再广播 |
| **地图冲突点** (conflictEvents) | gdelt_events 或 RSS→gdelt 回填 | ✅ 是 | GDELT 或 GDELT 禁用时由 situation_update 同步到 gdelt_events |
| **战损/装备毁伤** (combatLosses) | combat_losses | ⚠️ 有条件 | 仅当 AI/规则从新闻中提取到数字如「2 名美军死亡」merge 才写入增量 |
| **基地/地点状态** (keyLocations) | key_location | ⚠️ 有条件 | 仅当提取到 key_location_updates如某基地遭袭时更新 |
| **力量摘要/指数/资产** (summary, powerIndex, assets) | force_summary, power_index, force_asset | ❌ 否 | 仅 seed 初始化,爬虫不写 |
| **华尔街/报复情绪** (wallStreet, retaliation) | wall_street_trend, retaliation_* | ⚠️ 有条件 | 仅当提取器输出对应字段时更新 |
因此:**新闻很多、但战损/基地数字不动**是正常现象——多数标题不含可解析的伤亡/基地数字只有事件脉络recentUpdates和地图冲突点会随每条新闻增加。若**事件脉络也不更新**,请确认 Node 终端在爬虫每轮抓取后是否出现 `[crawler/notify] DB 已重载`;若无,检查爬虫的 `API_BASE` 是否指向当前 API默认 `http://localhost:3001`)。
## 写库流水线(与 server/README 第五节一致)
RSS 与主入口均走统一流水线 `pipeline.run_full_pipeline`
@@ -80,6 +160,7 @@ RSS → 抓取 → 清洗 → 去重 → 写 news_content / situation_update /
- `DB_PATH`: SQLite 路径,默认 `../server/data.db`
- `API_BASE`: Node API 地址,默认 `http://localhost:3001`
- **`DASHSCOPE_API_KEY`**阿里云通义DashScopeAPI Key。**设置后全程使用商业模型,无需本机安装 Ollama**(适合 Mac 版本较低无法跑 Ollama 的情况)。获取: [阿里云百炼 / DashScope](https://dashscope.console.aliyun.com/) → 创建 API-KEY复制到环境变量或项目根目录 `.env``DASHSCOPE_API_KEY=sk-xxx`。摘要、分类、战损/基地提取均走通义。
- `GDELT_QUERY`: 搜索关键词,默认 `United States Iran military`
- `GDELT_MAX_RECORDS`: 最大条数,默认 30
- `GDELT_TIMESPAN`: 时间范围,`1h` / `1d` / `1week`,默认 `1d`(近日资讯)

View File

@@ -2,6 +2,7 @@
"""
AI 清洗新闻数据,严格按面板字段约束输出
面板 EventTimelinePanel 所需summary(≤120字)、category(枚举)、severity(枚举)
优先使用 DASHSCOPE_API_KEY通义无需 Ollama否则 Ollama最后规则兜底
"""
import os
import re
@@ -9,6 +10,7 @@ from typing import Optional
CLEANER_AI_DISABLED = os.environ.get("CLEANER_AI_DISABLED", "0") == "1"
OLLAMA_MODEL = os.environ.get("OLLAMA_MODEL", "llama3.1")
DASHSCOPE_API_KEY = os.environ.get("DASHSCOPE_API_KEY", "").strip()
# 面板 schema必须与 EventTimelinePanel / SituationUpdate 一致
SUMMARY_MAX_LEN = 120 # 面板 line-clamp-2 展示
@@ -30,6 +32,38 @@ def _rule_clean(text: str, max_len: int = SUMMARY_MAX_LEN) -> str:
return _sanitize_summary(text, max_len)
def _call_dashscope_summary(text: str, max_len: int, timeout: int = 8) -> Optional[str]:
"""调用阿里云通义DashScope提炼摘要无需 Ollama。需设置 DASHSCOPE_API_KEY"""
if not DASHSCOPE_API_KEY or CLEANER_AI_DISABLED or not text or len(str(text).strip()) < 5:
return None
try:
import dashscope
from http import HTTPStatus
dashscope.api_key = DASHSCOPE_API_KEY
prompt = f"""将新闻提炼为1-2句简洁中文事实直接输出纯文本不要标号、引号、解释。限{max_len}字内。
原文:{str(text)[:350]}
输出:"""
r = dashscope.Generation.call(
model="qwen-turbo",
messages=[{"role": "user", "content": prompt}],
result_format="message",
max_tokens=150,
)
if r.status_code != HTTPStatus.OK:
return None
out = (r.output.get("choices", [{}])[0].get("message", {}).get("content", "") or "").strip()
out = re.sub(r"^[\d\.\-\*\s]+", "", out)
out = re.sub(r"^['\"\s]+|['\"\s]+$", "", out)
out = _sanitize_summary(out, max_len)
if out and len(out) > 3:
return out
return None
except Exception:
return None
def _call_ollama_summary(text: str, max_len: int, timeout: int = 6) -> Optional[str]:
"""调用 Ollama 提炼摘要输出须为纯文本、≤max_len 字"""
if CLEANER_AI_DISABLED or not text or len(str(text).strip()) < 5:
@@ -71,7 +105,11 @@ def clean_news_for_panel(text: str, max_len: int = SUMMARY_MAX_LEN) -> str:
t = str(text).strip()
if not t:
return ""
res = _call_ollama_summary(t, max_len, timeout=6)
# 优先商业模型(通义),再 Ollama最后规则
if DASHSCOPE_API_KEY:
res = _call_dashscope_summary(t, max_len, timeout=8)
else:
res = _call_ollama_summary(t, max_len, timeout=6)
if res:
return res
return _rule_clean(t, max_len)

View File

@@ -1,7 +1,7 @@
# -*- coding: utf-8 -*-
"""
AI 新闻分类与严重度判定
优先使用 Ollama 本地模型(免费),失败则回退到规则
优先 DASHSCOPE_API_KEY通义无需 Ollama否则 Ollama最后规则
设置 PARSER_AI_DISABLED=1 可只用规则(更快)
"""
import os
@@ -11,7 +11,8 @@ Category = Literal["deployment", "alert", "intel", "diplomatic", "other"]
Severity = Literal["low", "medium", "high", "critical"]
PARSER_AI_DISABLED = os.environ.get("PARSER_AI_DISABLED", "0") == "1"
OLLAMA_MODEL = os.environ.get("OLLAMA_MODEL", "llama3.1") # 或 qwen2.5:7b
OLLAMA_MODEL = os.environ.get("OLLAMA_MODEL", "llama3.1")
DASHSCOPE_API_KEY = os.environ.get("DASHSCOPE_API_KEY", "").strip()
_CATEGORIES = ("deployment", "alert", "intel", "diplomatic", "other")
_SEVERITIES = ("low", "medium", "high", "critical")
@@ -32,8 +33,37 @@ def _parse_ai_response(text: str) -> Tuple[Category, Severity]:
return cat, sev # type: ignore
def _call_dashscope(text: str, timeout: int = 6) -> Optional[Tuple[Category, Severity]]:
"""调用阿里云通义DashScope分类无需 Ollama。需设置 DASHSCOPE_API_KEY"""
if not DASHSCOPE_API_KEY or PARSER_AI_DISABLED:
return None
try:
import dashscope
from http import HTTPStatus
dashscope.api_key = DASHSCOPE_API_KEY
prompt = f"""Classify this news about US-Iran/middle east (one line only):
- category: deployment|alert|intel|diplomatic|other
- severity: low|medium|high|critical
News: {text[:300]}
Reply format: category:severity (e.g. alert:high)"""
r = dashscope.Generation.call(
model="qwen-turbo",
messages=[{"role": "user", "content": prompt}],
result_format="message",
max_tokens=32,
)
if r.status_code != HTTPStatus.OK:
return None
out = r.output.get("choices", [{}])[0].get("message", {}).get("content", "")
return _parse_ai_response(out)
except Exception:
return None
def _call_ollama(text: str, timeout: int = 5) -> Optional[Tuple[Category, Severity]]:
"""调用 Ollama 本地模型。需先运行 ollama run llama3.1 或 qwen2.5:7b"""
"""调用 Ollama 本地模型。需先运行 ollama run llama3.1"""
if PARSER_AI_DISABLED:
return None
try:
@@ -73,9 +103,16 @@ def _rule_severity(text: str, category: Category) -> Severity:
return severity(text, category)
def _call_ai(text: str) -> Optional[Tuple[Category, Severity]]:
"""优先通义,再 Ollama"""
if DASHSCOPE_API_KEY:
return _call_dashscope(text)
return _call_ollama(text)
def classify(text: str) -> Category:
"""分类。AI 失败时回退规则"""
res = _call_ollama(text)
res = _call_ai(text)
if res:
return res[0]
return _rule_classify(text)
@@ -83,7 +120,7 @@ def classify(text: str) -> Category:
def severity(text: str, category: Category) -> Severity:
"""严重度。AI 失败时回退规则"""
res = _call_ollama(text)
res = _call_ai(text)
if res:
return res[1]
return _rule_severity(text, category)
@@ -95,7 +132,7 @@ def classify_and_severity(text: str) -> Tuple[Category, Severity]:
from parser import classify, severity
c = classify(text)
return c, severity(text, c)
res = _call_ollama(text)
res = _call_ai(text)
if res:
return res
return _rule_classify(text), _rule_severity(text, _rule_classify(text))

View File

@@ -14,10 +14,14 @@ def _notify_api(api_base: str) -> bool:
"""调用 Node API 触发立即广播"""
try:
import urllib.request
token = os.environ.get("API_CRAWLER_TOKEN", "").strip()
req = urllib.request.Request(
f"{api_base.rstrip('/')}/api/crawler/notify",
method="POST",
headers={"Content-Type": "application/json"},
headers={
"Content-Type": "application/json",
**({"X-Crawler-Token": token} if token else {}),
},
)
with urllib.request.urlopen(req, timeout=5) as resp:
return resp.status == 200

View File

@@ -242,7 +242,16 @@ def _write_to_db(events: List[dict]) -> None:
def _notify_node() -> None:
try:
r = requests.post(f"{API_BASE}/api/crawler/notify", timeout=5, proxies={"http": None, "https": None})
headers = {}
token = os.environ.get("API_CRAWLER_TOKEN", "").strip()
if token:
headers["X-Crawler-Token"] = token
r = requests.post(
f"{API_BASE}/api/crawler/notify",
timeout=5,
headers=headers,
proxies={"http": None, "https": None},
)
if r.status_code != 200:
print(" [warn] notify API 失败")
except Exception as e:
@@ -340,7 +349,10 @@ def crawler_backfill():
return {"ok": False, "error": "db not found"}
try:
from db_merge import merge
if os.environ.get("CLEANER_AI_DISABLED", "0") == "1":
use_dashscope = bool(os.environ.get("DASHSCOPE_API_KEY", "").strip())
if use_dashscope:
from extractor_dashscope import extract_from_news
elif os.environ.get("CLEANER_AI_DISABLED", "0") == "1":
from extractor_rules import extract_from_news
else:
from extractor_ai import extract_from_news