fix:优化爬虫配置,单独使用docker容器运行
This commit is contained in:
@@ -1,4 +1,4 @@
|
||||
# Python 爬虫服务
|
||||
# Python 3.11+ 爬虫服务(与 requirements.txt / pyproject.toml 一致)
|
||||
# 国内服务器可加 --build-arg REGISTRY=docker.m.daocloud.io/library
|
||||
ARG REGISTRY=
|
||||
FROM ${REGISTRY}python:3.11-slim
|
||||
|
||||
@@ -39,11 +39,16 @@
|
||||
|
||||
## 依赖
|
||||
|
||||
- **Python 3.11+**(推荐 3.11 或 3.12)
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
新增 `deep-translator`:GDELT 与 RSS 新闻入库前自动翻译为中文。
|
||||
或使用 pyproject:`pip install -e crawler/`(在项目根目录)。
|
||||
|
||||
- `deep-translator`:GDELT 与 RSS 新闻入库前自动翻译为中文。
|
||||
- `dashscope`:可选,配置 `DASHSCOPE_API_KEY` 后启用通义提取/清洗。
|
||||
|
||||
## 运行(需同时启动 3 个服务)
|
||||
|
||||
|
||||
20
crawler/pyproject.toml
Normal file
20
crawler/pyproject.toml
Normal file
@@ -0,0 +1,20 @@
|
||||
[project]
|
||||
name = "usa-crawler"
|
||||
version = "1.0.0"
|
||||
description = "GDELT + RSS 爬虫与实时冲突服务"
|
||||
readme = "README.md"
|
||||
requires-python = ">=3.11"
|
||||
dependencies = [
|
||||
"requests>=2.32.0",
|
||||
"feedparser>=6.0.10",
|
||||
"beautifulsoup4>=4.12.0",
|
||||
"pytest>=8.0.0",
|
||||
"fastapi>=0.115.0",
|
||||
"uvicorn[standard]>=0.32.0",
|
||||
"deep-translator>=1.11.0",
|
||||
"dashscope>=1.20.0",
|
||||
]
|
||||
|
||||
[tool.pytest.ini_options]
|
||||
testpaths = ["tests"]
|
||||
python_files = ["test_*.py"]
|
||||
@@ -1,10 +0,0 @@
|
||||
# Python 3.6 / 旧 pip 索引兼容(生产机 pip 无 2.31+ 时用此文件)
|
||||
# 安装: pip3 install --user -r crawler/requirements-py36.txt
|
||||
requests>=2.24.0,<2.29
|
||||
feedparser>=6.0.0
|
||||
beautifulsoup4>=4.9.0
|
||||
pytest>=6.0.0
|
||||
fastapi>=0.68.0,<0.100
|
||||
uvicorn>=0.15.0,<0.23
|
||||
deep-translator>=1.5.0
|
||||
dashscope>=1.14.0
|
||||
@@ -1,8 +1,10 @@
|
||||
requests>=2.31.0
|
||||
feedparser>=6.0.0
|
||||
# Python 3.11+ 爬虫依赖(使用当前最新兼容版本)
|
||||
# 安装: pip install -r crawler/requirements.txt
|
||||
requests>=2.32.0
|
||||
feedparser>=6.0.10
|
||||
beautifulsoup4>=4.12.0
|
||||
pytest>=7.0.0
|
||||
fastapi>=0.109.0
|
||||
uvicorn>=0.27.0
|
||||
pytest>=8.0.0
|
||||
fastapi>=0.115.0
|
||||
uvicorn[standard]>=0.32.0
|
||||
deep-translator>=1.11.0
|
||||
dashscope>=1.20.0
|
||||
|
||||
@@ -19,6 +19,8 @@ services:
|
||||
build:
|
||||
context: .
|
||||
dockerfile: Dockerfile.crawler
|
||||
ports:
|
||||
- "8000:8000"
|
||||
environment:
|
||||
- DB_PATH=/data/data.db
|
||||
- API_BASE=http://api:3001
|
||||
|
||||
68
docs/PRODUCTION.md
Normal file
68
docs/PRODUCTION.md
Normal file
@@ -0,0 +1,68 @@
|
||||
# 生产部署与数据对齐
|
||||
|
||||
## 1. 当前项目是否能在 Docker 中单独运行
|
||||
|
||||
- **能**。爬虫镜像 `Dockerfile.crawler` 自包含 Python 3.11 + `crawler/requirements.txt`(含 dashscope),无宿主机 Python 版本依赖。
|
||||
- **两种常见用法**:
|
||||
- **docker-compose 一起跑**:API + 爬虫都在容器内,共用一个命名卷 `app-data`,天然对齐。
|
||||
- **爬虫单独 Docker、API 在宿主机**:爬虫容器通过挂载宿主机上的 **同一个** `server/data.db`,并设置 `API_BASE` 指向宿主机 API,即可单独运行且数据一致。
|
||||
|
||||
## 2. 数据对齐(必须满足)
|
||||
|
||||
| 角色 | 使用的 DB 路径(示例) | 说明 |
|
||||
|--------|-------------------------------|------|
|
||||
| Node API | `process.env.DB_PATH` 或 `server/data.db` | 见 `server/db.js`、`docker-entrypoint.sh` |
|
||||
| 爬虫(Docker 内) | `DB_PATH=/data/data.db`,且 `/data/data.db` 由宿主机同一文件挂载 | 见 `Dockerfile.crawler`、`crawler/config.py` |
|
||||
|
||||
**原则**:API 和爬虫必须读写 **同一个 SQLite 文件**。否则会出现「爬虫写了库、API 读不到」或反之。
|
||||
|
||||
- **docker-compose 全容器**:两边都用卷 `app-data`,路径均为 `/data/data.db`,自动对齐。
|
||||
- **API 宿主机 + 爬虫 Docker**:宿主机 API 的 `DB_PATH` 指向例如 `$PROJECT/server/data.db`;爬虫启动时用 `-v $PROJECT/server/data.db:/data/data.db` 和 `-e DB_PATH=/data/data.db`,即对齐。
|
||||
|
||||
## 3. 生产脚本与用法
|
||||
|
||||
### 3.1 爬虫单独 Docker(API 在宿主机,如 PM2)
|
||||
|
||||
```bash
|
||||
# 首次:构建镜像并启动爬虫容器(会读 .env 中的 DASHSCOPE_API_KEY)
|
||||
./scripts/production-start.sh
|
||||
|
||||
# 或分步:
|
||||
docker build -t usa-dashboard-crawler:latest -f Dockerfile.crawler .
|
||||
./scripts/run-crawler-docker-standalone.sh
|
||||
```
|
||||
|
||||
可调环境变量(在运行脚本前 export 或写在 .env):
|
||||
|
||||
- `PROJECT_ROOT`:项目根目录,默认当前目录;用于解析 `server/data.db`。
|
||||
- `DB_FILE`:宿主机 DB 绝对路径,默认 `$PROJECT_ROOT/server/data.db`。
|
||||
- `API_BASE`:爬虫通知 API 的地址,默认 `http://host.docker.internal:3001`(Linux 下脚本会自动加 `--add-host=host.docker.internal:host-gateway`)。
|
||||
- `DASHSCOPE_API_KEY`:阿里云 DashScope,启用 AI 清洗(可选)。
|
||||
|
||||
### 3.2 docker-compose 全栈(API + 爬虫都在容器)
|
||||
|
||||
```bash
|
||||
# 启动
|
||||
docker compose up -d
|
||||
# 或传入 DASHSCOPE_API_KEY
|
||||
DASHSCOPE_API_KEY=sk-xxx docker compose up -d
|
||||
|
||||
# 停止
|
||||
docker compose down
|
||||
```
|
||||
|
||||
此时 API 与爬虫共用卷 `app-data`,DB 路径均为 `/data/data.db`,无需额外对齐。
|
||||
|
||||
### 3.3 宿主机 API(PM2)使用的 DB 路径
|
||||
|
||||
确保 PM2 启动 API 时使用的 DB 与爬虫挂载的是同一文件,例如:
|
||||
|
||||
- 在 ecosystem 或启动命令里设置:`DB_PATH=/www/wwwroot/www.airtep.com2/usa/server/data.db`
|
||||
- 或项目根目录即部署目录时,不设则默认为 `server/data.db`(相对路径以进程 cwd 为准)。
|
||||
|
||||
## 4. 检查清单
|
||||
|
||||
- [ ] API 与爬虫使用**同一 DB 文件**(见上表)。
|
||||
- [ ] 爬虫能访问到 API:`API_BASE` 在「爬虫单独 Docker」场景下指向宿主机(如 `http://host.docker.internal:3001`),在 compose 场景下为 `http://api:3001`。
|
||||
- [ ] 若需 AI 清洗:在爬虫侧设置 `DASHSCOPE_API_KEY`(compose 或 standalone 脚本的 .env/环境变量)。
|
||||
- [ ] 首次部署或无 DB 时:先创建并初始化 DB(例如 `DB_PATH=server/data.db node server/seed.js`),再启动爬虫容器。
|
||||
21
scripts/production-start.sh
Normal file
21
scripts/production-start.sh
Normal file
@@ -0,0 +1,21 @@
|
||||
#!/usr/bin/env bash
|
||||
# 生产环境一键:构建爬虫镜像 + 以「仅爬虫 Docker、API 在宿主机」方式启动,并输出数据对齐说明。
|
||||
# 使用前:API 已用 PM2 等方式在宿主机 3001 端口运行,且 server/data.db 已存在(或先执行 npm run api:seed)。
|
||||
set -e
|
||||
cd "$(dirname "$0")/.."
|
||||
PROJECT_ROOT="${PROJECT_ROOT:-$(pwd)}"
|
||||
REGISTRY="${REGISTRY:-}"
|
||||
|
||||
echo "==> Building crawler image..."
|
||||
docker build -t usa-dashboard-crawler:latest \
|
||||
${REGISTRY:+--build-arg REGISTRY="$REGISTRY"} \
|
||||
-f Dockerfile.crawler .
|
||||
|
||||
echo ""
|
||||
./scripts/run-crawler-docker-standalone.sh
|
||||
|
||||
echo ""
|
||||
echo "==> Data alignment (生产数据对齐)"
|
||||
echo " API (host) DB_PATH = $PROJECT_ROOT/server/data.db (或 env DB_PATH)"
|
||||
echo " Crawler /data/data.db = 挂载自上述同一文件"
|
||||
echo " 二者必须指向同一 SQLite 文件,前端/API 与爬虫才能数据一致。"
|
||||
55
scripts/run-crawler-docker-standalone.sh
Normal file
55
scripts/run-crawler-docker-standalone.sh
Normal file
@@ -0,0 +1,55 @@
|
||||
#!/usr/bin/env bash
|
||||
# 生产:仅用 Docker 跑爬虫,API 在宿主机(如 PM2)时使用。
|
||||
# 保证爬虫与 API 使用同一 SQLite 文件(数据对齐)。
|
||||
set -e
|
||||
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
|
||||
PROJECT_ROOT="${PROJECT_ROOT:-$(cd "$SCRIPT_DIR/.." && pwd)}"
|
||||
DB_FILE="${DB_FILE:-$PROJECT_ROOT/server/data.db}"
|
||||
API_BASE="${API_BASE:-http://host.docker.internal:3001}"
|
||||
CRAWLER_IMAGE="${CRAWLER_IMAGE:-usa-dashboard-crawler:latest}"
|
||||
CONTAINER_NAME="${CONTAINER_NAME:-usa-crawler}"
|
||||
|
||||
# 可选:从 .env 加载 DASHSCOPE_API_KEY 等
|
||||
if [ -f "$PROJECT_ROOT/.env" ]; then
|
||||
set -a
|
||||
# shellcheck source=../.env
|
||||
. "$PROJECT_ROOT/.env"
|
||||
set +a
|
||||
fi
|
||||
|
||||
# 宿主机 DB 必须存在(API 已初始化或先 seed)
|
||||
if [ ! -f "$DB_FILE" ]; then
|
||||
echo "ERROR: DB file not found: $DB_FILE"
|
||||
echo " Create it first: DB_PATH=$DB_FILE node server/seed.js"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Linux 下 Docker 默认无 host.docker.internal,需显式添加
|
||||
DOCKER_EXTRA=()
|
||||
if [ "$(uname -s)" = "Linux" ]; then
|
||||
DOCKER_EXTRA+=(--add-host=host.docker.internal:host-gateway)
|
||||
fi
|
||||
|
||||
# 若已存在同名容器则先删
|
||||
docker rm -f "$CONTAINER_NAME" 2>/dev/null || true
|
||||
|
||||
echo "==> Starting crawler container (standalone)"
|
||||
echo " DB: $DB_FILE -> /data/data.db"
|
||||
echo " API_BASE: $API_BASE"
|
||||
echo " Image: $CRAWLER_IMAGE"
|
||||
docker run -d \
|
||||
--name "$CONTAINER_NAME" \
|
||||
--restart unless-stopped \
|
||||
-p 8000:8000 \
|
||||
-v "$DB_FILE:/data/data.db" \
|
||||
-e DB_PATH=/data/data.db \
|
||||
-e API_BASE="$API_BASE" \
|
||||
-e GDELT_DISABLED=1 \
|
||||
-e RSS_INTERVAL_SEC=60 \
|
||||
${DASHSCOPE_API_KEY:+ -e DASHSCOPE_API_KEY="$DASHSCOPE_API_KEY"} \
|
||||
"${DOCKER_EXTRA[@]}" \
|
||||
"$CRAWLER_IMAGE"
|
||||
|
||||
echo " Container: $CONTAINER_NAME"
|
||||
echo " Logs: docker logs -f $CONTAINER_NAME"
|
||||
echo " Status: curl -s http://localhost:8000/crawler/status | jq ."
|
||||
Reference in New Issue
Block a user