Files
usa/docs/CRAWLER_PIPELINE.md
2026-03-02 17:20:31 +08:00

66 lines
1.5 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# 爬虫数据流水线
## 数据流
```
RSS 抓取
↓ 翻译、清洗
↓ news_storage.save_and_dedup() → 历史去重
news_content资讯独立表供后续消费
↓ 去重后的新数据
situation_update面板展示用
↓ AI 提取(阿里云 DashScope
combat_losses / retaliation / key_location / wall_street_trend
↓ notify Node
前端 WebSocket + 轮询
```
## 阿里云 DashScope API Key
设置环境变量 `DASHSCOPE_API_KEY` 后,爬虫使用阿里云通义千问进行 AI 提取。不设置时回退到规则提取(`extractor_rules`)或 Ollama若可用
```bash
# 本地
export DASHSCOPE_API_KEY=sk-xxx
# Docker
docker compose up -d -e DASHSCOPE_API_KEY=sk-xxx
# 或在 .env 中写入 DASHSCOPE_API_KEY=sk-xxx
```
## 表说明
| 表 | 用途 |
|----|------|
| `news_content` | 资讯原文独立存储支持去重content_hash供后续消费 |
| `situation_update` | 面板「近期更新」展示 |
| `combat_losses` | 战损数据AI/规则提取) |
| `key_location` | 基地状态 |
| `gdelt_events` | 地图冲突点 |
## 去重逻辑
根据 `content_hash = sha256(normalize(title) + normalize(summary) + url)` 判断,相同或高度相似内容视为重复,不入库。
## 消费资讯
- HTTP: `GET /api/news?limit=50`
- 调试: `/db` 面板查看 `news_content`
## 链路验证
运行脚本一键检查全链路:
```bash
./scripts/verify-pipeline.sh
```
支持环境变量覆盖:`API_URL``CRAWLER_URL`