fix: update
This commit is contained in:
@@ -54,6 +54,86 @@ pip install -r requirements.txt
|
||||
|
||||
**事件脉络不更新时**:多半是未启动 `npm run gdelt`。只跑 `npm run api` 时,事件脉络会显示空或仅有缓存。
|
||||
|
||||
## 如何检查爬虫是否工作正常
|
||||
|
||||
按下面顺序做即可确认整条链路(爬虫 → 数据库 → Node 重载 → API/WebSocket)正常。
|
||||
|
||||
### 1. 一键验证(推荐)
|
||||
|
||||
先启动 API,再执行验证脚本(可选是否顺带启动爬虫):
|
||||
|
||||
```bash
|
||||
# 终端 1:必须
|
||||
npm run api
|
||||
|
||||
# 终端 2:执行验证(不启动爬虫,只检查当前状态)
|
||||
./scripts/verify-pipeline.sh
|
||||
|
||||
# 或:顺带启动爬虫并等首次抓取后再验证
|
||||
./scripts/verify-pipeline.sh --start-crawler
|
||||
```
|
||||
|
||||
脚本会检查:API 健康、态势数据含 `lastUpdated`、爬虫服务是否可达、`news_content`/situation_update、战损字段、`POST /api/crawler/notify` 是否可用。
|
||||
|
||||
### 2. 手动快速检查
|
||||
|
||||
| 步骤 | 命令 / 操作 | 正常表现 |
|
||||
|-----|-------------|----------|
|
||||
| API 是否在跑 | `curl -s http://localhost:3001/api/health` | 返回 `{"ok":true}` |
|
||||
| 态势是否可读 | `curl -s http://localhost:3001/api/situation \| head -c 300` | 含 `lastUpdated`、`usForces`、`recentUpdates` |
|
||||
| RSS 能否抓到 | `npm run crawler:test` | 输出「RSS 抓取: N 条」,N>0 表示有命中 |
|
||||
| 爬虫服务(gdelt) | `curl -s http://localhost:8000/crawler/status` | 返回 JSON,含 `db_path`/`db_exists` 等 |
|
||||
| 库里有无爬虫数据 | `sqlite3 server/data.db "SELECT COUNT(*) FROM situation_update; SELECT COUNT(*) FROM news_content;"` 或访问 `http://localhost:3001/api/db/dashboard` | situation_update、news_content 条数 > 0(跑过流水线后) |
|
||||
| 通知后是否重载 | 爬虫写库后会 POST `/api/crawler/notify`,Node 会 `reloadFromFile` 再广播 | 前端/`/api/situation` 的 `lastUpdated` 和内容会更新 |
|
||||
|
||||
### 3. 跑一轮流水线(不常驻爬虫时)
|
||||
|
||||
不启动 gdelt 时,可单次跑完整流水线(抓取 → 去重 → 写表 → notify):
|
||||
|
||||
```bash
|
||||
npm run api # 保持运行
|
||||
cd crawler && python3 -c "
|
||||
from pipeline import run_full_pipeline
|
||||
from config import DB_PATH, API_BASE
|
||||
n_fetched, n_news, n_panel = run_full_pipeline(db_path=DB_PATH, api_base=API_BASE, notify=True)
|
||||
print('抓取:', n_fetched, '去重新增:', n_news, '面板写入:', n_panel)
|
||||
"
|
||||
```
|
||||
|
||||
有网络且有关键词命中时,应看到非零数字;再查 `curl -s http://localhost:3001/api/situation` 或前端事件脉络是否出现新数据。
|
||||
|
||||
### 4. 仅测提取逻辑(不写库)
|
||||
|
||||
```bash
|
||||
npm run crawler:test:extraction # 规则/db_merge 测试
|
||||
# 或按 README「快速自测命令」用示例文本调 extract_from_news 看 combat_losses_delta / key_location_updates
|
||||
```
|
||||
|
||||
**常见现象**:抓取 0 条 → 网络/RSS 被墙或关键词未命中;situation_update 为空 → 未跑流水线或去重后无新增;前端不刷新 → 未开 `npm run api` 或未开爬虫(gdelt)。
|
||||
|
||||
### 5. 爬虫与面板是否联通
|
||||
|
||||
专门检查「爬虫写库」与「面板展示」是否一致:
|
||||
|
||||
```bash
|
||||
./scripts/check-crawler-panel-connectivity.sh
|
||||
```
|
||||
|
||||
会对比:爬虫侧的 `situation_update` 条数 vs 面板 API 返回的 `recentUpdates` 条数,并说明为何战损/基地等不一定随每条新闻变化。
|
||||
|
||||
## 爬虫与面板数据联动说明
|
||||
|
||||
| 面板展示 | 数据来源(表/接口) | 是否由爬虫更新 | 说明 |
|
||||
|----------|---------------------|----------------|------|
|
||||
| **事件脉络** (recentUpdates) | situation_update → getSituation() | ✅ 是 | 每条去重后的新闻会写入 situation_update,Node 收到 notify 后重载 DB 再广播 |
|
||||
| **地图冲突点** (conflictEvents) | gdelt_events 或 RSS→gdelt 回填 | ✅ 是 | GDELT 或 GDELT 禁用时由 situation_update 同步到 gdelt_events |
|
||||
| **战损/装备毁伤** (combatLosses) | combat_losses | ⚠️ 有条件 | 仅当 AI/规则从新闻中提取到数字(如「2 名美军死亡」)时,merge 才写入增量 |
|
||||
| **基地/地点状态** (keyLocations) | key_location | ⚠️ 有条件 | 仅当提取到 key_location_updates(如某基地遭袭)时更新 |
|
||||
| **力量摘要/指数/资产** (summary, powerIndex, assets) | force_summary, power_index, force_asset | ❌ 否 | 仅 seed 初始化,爬虫不写 |
|
||||
| **华尔街/报复情绪** (wallStreet, retaliation) | wall_street_trend, retaliation_* | ⚠️ 有条件 | 仅当提取器输出对应字段时更新 |
|
||||
|
||||
因此:**新闻很多、但战损/基地数字不动**是正常现象——多数标题不含可解析的伤亡/基地数字,只有事件脉络(recentUpdates)和地图冲突点会随每条新闻增加。若**事件脉络也不更新**,请确认 Node 终端在爬虫每轮抓取后是否出现 `[crawler/notify] DB 已重载`;若无,检查爬虫的 `API_BASE` 是否指向当前 API(默认 `http://localhost:3001`)。
|
||||
|
||||
## 写库流水线(与 server/README 第五节一致)
|
||||
|
||||
RSS 与主入口均走统一流水线 `pipeline.run_full_pipeline`:
|
||||
@@ -80,6 +160,7 @@ RSS → 抓取 → 清洗 → 去重 → 写 news_content / situation_update /
|
||||
|
||||
- `DB_PATH`: SQLite 路径,默认 `../server/data.db`
|
||||
- `API_BASE`: Node API 地址,默认 `http://localhost:3001`
|
||||
- **`DASHSCOPE_API_KEY`**:阿里云通义(DashScope)API Key。**设置后全程使用商业模型,无需本机安装 Ollama**(适合 Mac 版本较低无法跑 Ollama 的情况)。获取: [阿里云百炼 / DashScope](https://dashscope.console.aliyun.com/) → 创建 API-KEY,复制到环境变量或项目根目录 `.env` 中 `DASHSCOPE_API_KEY=sk-xxx`。摘要、分类、战损/基地提取均走通义。
|
||||
- `GDELT_QUERY`: 搜索关键词,默认 `United States Iran military`
|
||||
- `GDELT_MAX_RECORDS`: 最大条数,默认 30
|
||||
- `GDELT_TIMESPAN`: 时间范围,`1h` / `1d` / `1week`,默认 `1d`(近日资讯)
|
||||
|
||||
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
@@ -2,6 +2,7 @@
|
||||
"""
|
||||
AI 清洗新闻数据,严格按面板字段约束输出
|
||||
面板 EventTimelinePanel 所需:summary(≤120字)、category(枚举)、severity(枚举)
|
||||
优先使用 DASHSCOPE_API_KEY(通义,无需 Ollama),否则 Ollama,最后规则兜底
|
||||
"""
|
||||
import os
|
||||
import re
|
||||
@@ -9,6 +10,7 @@ from typing import Optional
|
||||
|
||||
CLEANER_AI_DISABLED = os.environ.get("CLEANER_AI_DISABLED", "0") == "1"
|
||||
OLLAMA_MODEL = os.environ.get("OLLAMA_MODEL", "llama3.1")
|
||||
DASHSCOPE_API_KEY = os.environ.get("DASHSCOPE_API_KEY", "").strip()
|
||||
|
||||
# 面板 schema:必须与 EventTimelinePanel / SituationUpdate 一致
|
||||
SUMMARY_MAX_LEN = 120 # 面板 line-clamp-2 展示
|
||||
@@ -30,6 +32,38 @@ def _rule_clean(text: str, max_len: int = SUMMARY_MAX_LEN) -> str:
|
||||
return _sanitize_summary(text, max_len)
|
||||
|
||||
|
||||
def _call_dashscope_summary(text: str, max_len: int, timeout: int = 8) -> Optional[str]:
|
||||
"""调用阿里云通义(DashScope)提炼摘要,无需 Ollama。需设置 DASHSCOPE_API_KEY"""
|
||||
if not DASHSCOPE_API_KEY or CLEANER_AI_DISABLED or not text or len(str(text).strip()) < 5:
|
||||
return None
|
||||
try:
|
||||
import dashscope
|
||||
from http import HTTPStatus
|
||||
dashscope.api_key = DASHSCOPE_API_KEY
|
||||
prompt = f"""将新闻提炼为1-2句简洁中文事实,直接输出纯文本,不要标号、引号、解释。限{max_len}字内。
|
||||
|
||||
原文:{str(text)[:350]}
|
||||
|
||||
输出:"""
|
||||
r = dashscope.Generation.call(
|
||||
model="qwen-turbo",
|
||||
messages=[{"role": "user", "content": prompt}],
|
||||
result_format="message",
|
||||
max_tokens=150,
|
||||
)
|
||||
if r.status_code != HTTPStatus.OK:
|
||||
return None
|
||||
out = (r.output.get("choices", [{}])[0].get("message", {}).get("content", "") or "").strip()
|
||||
out = re.sub(r"^[\d\.\-\*\s]+", "", out)
|
||||
out = re.sub(r"^['\"\s]+|['\"\s]+$", "", out)
|
||||
out = _sanitize_summary(out, max_len)
|
||||
if out and len(out) > 3:
|
||||
return out
|
||||
return None
|
||||
except Exception:
|
||||
return None
|
||||
|
||||
|
||||
def _call_ollama_summary(text: str, max_len: int, timeout: int = 6) -> Optional[str]:
|
||||
"""调用 Ollama 提炼摘要,输出须为纯文本、≤max_len 字"""
|
||||
if CLEANER_AI_DISABLED or not text or len(str(text).strip()) < 5:
|
||||
@@ -71,7 +105,11 @@ def clean_news_for_panel(text: str, max_len: int = SUMMARY_MAX_LEN) -> str:
|
||||
t = str(text).strip()
|
||||
if not t:
|
||||
return ""
|
||||
res = _call_ollama_summary(t, max_len, timeout=6)
|
||||
# 优先商业模型(通义),再 Ollama,最后规则
|
||||
if DASHSCOPE_API_KEY:
|
||||
res = _call_dashscope_summary(t, max_len, timeout=8)
|
||||
else:
|
||||
res = _call_ollama_summary(t, max_len, timeout=6)
|
||||
if res:
|
||||
return res
|
||||
return _rule_clean(t, max_len)
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
AI 新闻分类与严重度判定
|
||||
优先使用 Ollama 本地模型(免费),失败则回退到规则
|
||||
优先 DASHSCOPE_API_KEY(通义,无需 Ollama),否则 Ollama,最后规则
|
||||
设置 PARSER_AI_DISABLED=1 可只用规则(更快)
|
||||
"""
|
||||
import os
|
||||
@@ -11,7 +11,8 @@ Category = Literal["deployment", "alert", "intel", "diplomatic", "other"]
|
||||
Severity = Literal["low", "medium", "high", "critical"]
|
||||
|
||||
PARSER_AI_DISABLED = os.environ.get("PARSER_AI_DISABLED", "0") == "1"
|
||||
OLLAMA_MODEL = os.environ.get("OLLAMA_MODEL", "llama3.1") # 或 qwen2.5:7b
|
||||
OLLAMA_MODEL = os.environ.get("OLLAMA_MODEL", "llama3.1")
|
||||
DASHSCOPE_API_KEY = os.environ.get("DASHSCOPE_API_KEY", "").strip()
|
||||
|
||||
_CATEGORIES = ("deployment", "alert", "intel", "diplomatic", "other")
|
||||
_SEVERITIES = ("low", "medium", "high", "critical")
|
||||
@@ -32,8 +33,37 @@ def _parse_ai_response(text: str) -> Tuple[Category, Severity]:
|
||||
return cat, sev # type: ignore
|
||||
|
||||
|
||||
def _call_dashscope(text: str, timeout: int = 6) -> Optional[Tuple[Category, Severity]]:
|
||||
"""调用阿里云通义(DashScope)分类,无需 Ollama。需设置 DASHSCOPE_API_KEY"""
|
||||
if not DASHSCOPE_API_KEY or PARSER_AI_DISABLED:
|
||||
return None
|
||||
try:
|
||||
import dashscope
|
||||
from http import HTTPStatus
|
||||
dashscope.api_key = DASHSCOPE_API_KEY
|
||||
prompt = f"""Classify this news about US-Iran/middle east (one line only):
|
||||
- category: deployment|alert|intel|diplomatic|other
|
||||
- severity: low|medium|high|critical
|
||||
|
||||
News: {text[:300]}
|
||||
|
||||
Reply format: category:severity (e.g. alert:high)"""
|
||||
r = dashscope.Generation.call(
|
||||
model="qwen-turbo",
|
||||
messages=[{"role": "user", "content": prompt}],
|
||||
result_format="message",
|
||||
max_tokens=32,
|
||||
)
|
||||
if r.status_code != HTTPStatus.OK:
|
||||
return None
|
||||
out = r.output.get("choices", [{}])[0].get("message", {}).get("content", "")
|
||||
return _parse_ai_response(out)
|
||||
except Exception:
|
||||
return None
|
||||
|
||||
|
||||
def _call_ollama(text: str, timeout: int = 5) -> Optional[Tuple[Category, Severity]]:
|
||||
"""调用 Ollama 本地模型。需先运行 ollama run llama3.1 或 qwen2.5:7b"""
|
||||
"""调用 Ollama 本地模型。需先运行 ollama run llama3.1"""
|
||||
if PARSER_AI_DISABLED:
|
||||
return None
|
||||
try:
|
||||
@@ -73,9 +103,16 @@ def _rule_severity(text: str, category: Category) -> Severity:
|
||||
return severity(text, category)
|
||||
|
||||
|
||||
def _call_ai(text: str) -> Optional[Tuple[Category, Severity]]:
|
||||
"""优先通义,再 Ollama"""
|
||||
if DASHSCOPE_API_KEY:
|
||||
return _call_dashscope(text)
|
||||
return _call_ollama(text)
|
||||
|
||||
|
||||
def classify(text: str) -> Category:
|
||||
"""分类。AI 失败时回退规则"""
|
||||
res = _call_ollama(text)
|
||||
res = _call_ai(text)
|
||||
if res:
|
||||
return res[0]
|
||||
return _rule_classify(text)
|
||||
@@ -83,7 +120,7 @@ def classify(text: str) -> Category:
|
||||
|
||||
def severity(text: str, category: Category) -> Severity:
|
||||
"""严重度。AI 失败时回退规则"""
|
||||
res = _call_ollama(text)
|
||||
res = _call_ai(text)
|
||||
if res:
|
||||
return res[1]
|
||||
return _rule_severity(text, category)
|
||||
@@ -95,7 +132,7 @@ def classify_and_severity(text: str) -> Tuple[Category, Severity]:
|
||||
from parser import classify, severity
|
||||
c = classify(text)
|
||||
return c, severity(text, c)
|
||||
res = _call_ollama(text)
|
||||
res = _call_ai(text)
|
||||
if res:
|
||||
return res
|
||||
return _rule_classify(text), _rule_severity(text, _rule_classify(text))
|
||||
|
||||
@@ -14,10 +14,14 @@ def _notify_api(api_base: str) -> bool:
|
||||
"""调用 Node API 触发立即广播"""
|
||||
try:
|
||||
import urllib.request
|
||||
token = os.environ.get("API_CRAWLER_TOKEN", "").strip()
|
||||
req = urllib.request.Request(
|
||||
f"{api_base.rstrip('/')}/api/crawler/notify",
|
||||
method="POST",
|
||||
headers={"Content-Type": "application/json"},
|
||||
headers={
|
||||
"Content-Type": "application/json",
|
||||
**({"X-Crawler-Token": token} if token else {}),
|
||||
},
|
||||
)
|
||||
with urllib.request.urlopen(req, timeout=5) as resp:
|
||||
return resp.status == 200
|
||||
|
||||
@@ -242,7 +242,16 @@ def _write_to_db(events: List[dict]) -> None:
|
||||
|
||||
def _notify_node() -> None:
|
||||
try:
|
||||
r = requests.post(f"{API_BASE}/api/crawler/notify", timeout=5, proxies={"http": None, "https": None})
|
||||
headers = {}
|
||||
token = os.environ.get("API_CRAWLER_TOKEN", "").strip()
|
||||
if token:
|
||||
headers["X-Crawler-Token"] = token
|
||||
r = requests.post(
|
||||
f"{API_BASE}/api/crawler/notify",
|
||||
timeout=5,
|
||||
headers=headers,
|
||||
proxies={"http": None, "https": None},
|
||||
)
|
||||
if r.status_code != 200:
|
||||
print(" [warn] notify API 失败")
|
||||
except Exception as e:
|
||||
@@ -340,7 +349,10 @@ def crawler_backfill():
|
||||
return {"ok": False, "error": "db not found"}
|
||||
try:
|
||||
from db_merge import merge
|
||||
if os.environ.get("CLEANER_AI_DISABLED", "0") == "1":
|
||||
use_dashscope = bool(os.environ.get("DASHSCOPE_API_KEY", "").strip())
|
||||
if use_dashscope:
|
||||
from extractor_dashscope import extract_from_news
|
||||
elif os.environ.get("CLEANER_AI_DISABLED", "0") == "1":
|
||||
from extractor_rules import extract_from_news
|
||||
else:
|
||||
from extractor_ai import extract_from_news
|
||||
|
||||
Reference in New Issue
Block a user