init
This commit is contained in:
8
.idea/.gitignore
generated
vendored
Normal file
8
.idea/.gitignore
generated
vendored
Normal file
@@ -0,0 +1,8 @@
|
||||
# Default ignored files
|
||||
/shelf/
|
||||
/workspace.xml
|
||||
# Editor-based HTTP Client requests
|
||||
/httpRequests/
|
||||
# Datasource local storage ignored files
|
||||
/dataSources/
|
||||
/dataSources.local.xml
|
||||
6
.idea/inspectionProfiles/profiles_settings.xml
generated
Normal file
6
.idea/inspectionProfiles/profiles_settings.xml
generated
Normal file
@@ -0,0 +1,6 @@
|
||||
<component name="InspectionProjectProfileManager">
|
||||
<settings>
|
||||
<option name="USE_PROJECT_PROFILE" value="false" />
|
||||
<version value="1.0" />
|
||||
</settings>
|
||||
</component>
|
||||
7
.idea/misc.xml
generated
Normal file
7
.idea/misc.xml
generated
Normal file
@@ -0,0 +1,7 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<project version="4">
|
||||
<component name="Black">
|
||||
<option name="sdkName" value="/opt/homebrew/Caskroom/miniconda/base" />
|
||||
</component>
|
||||
<component name="ProjectRootManager" version="2" project-jdk-name="Python 3.13" project-jdk-type="Python SDK" />
|
||||
</project>
|
||||
8
.idea/modules.xml
generated
Normal file
8
.idea/modules.xml
generated
Normal file
@@ -0,0 +1,8 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<project version="4">
|
||||
<component name="ProjectModuleManager">
|
||||
<modules>
|
||||
<module fileurl="file://$PROJECT_DIR$/.idea/xspider.iml" filepath="$PROJECT_DIR$/.idea/xspider.iml" />
|
||||
</modules>
|
||||
</component>
|
||||
</project>
|
||||
6
.idea/vcs.xml
generated
Normal file
6
.idea/vcs.xml
generated
Normal file
@@ -0,0 +1,6 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<project version="4">
|
||||
<component name="VcsDirectoryMappings">
|
||||
<mapping directory="$PROJECT_DIR$" vcs="Git" />
|
||||
</component>
|
||||
</project>
|
||||
8
.idea/xspider.iml
generated
Normal file
8
.idea/xspider.iml
generated
Normal file
@@ -0,0 +1,8 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<module type="PYTHON_MODULE" version="4">
|
||||
<component name="NewModuleRootManager">
|
||||
<content url="file://$MODULE_DIR$" />
|
||||
<orderEntry type="jdk" jdkName="Python 3.13" jdkType="Python SDK" />
|
||||
<orderEntry type="sourceFolder" forTests="false" />
|
||||
</component>
|
||||
</module>
|
||||
79
README.md
Normal file
79
README.md
Normal file
@@ -0,0 +1,79 @@
|
||||
# xspider 模板爬虫
|
||||
|
||||
基于 XML 配置驱动的爬虫执行引擎,可从 Redis 列表中获取模板地址,按流程控制使用 DrissionPage 浏览器执行登录和业务流程,并将抽取的数据存储到 MongoDB。
|
||||
|
||||
## 依赖
|
||||
|
||||
- Python 3.10+
|
||||
- [DrissionPage](https://github.com/g1879/DrissionPage)
|
||||
- `redis`, `requests`, `pymongo`, `lxml`, `cssselect`
|
||||
|
||||
使用 pip 安装:
|
||||
|
||||
```bash
|
||||
pip install drissionpage redis requests pymongo lxml cssselect
|
||||
```
|
||||
|
||||
若需要验证码识别、变量服务等能力,请根据业务另行实现。
|
||||
|
||||
## 环境变量
|
||||
|
||||
| 变量 | 默认值 | 说明 |
|
||||
| --- | --- | --- |
|
||||
| `XSPIDER_REDIS_URL` | `redis://localhost:6379/0` | Redis 连接串 |
|
||||
| `XSPIDER_REDIS_LIST_KEY` | `xspider:config` | 待处理模板所在的 `list` key |
|
||||
| `XSPIDER_REDIS_BLOCK_TIMEOUT` | `30` | `BLPOP` 阻塞秒数 |
|
||||
| `XSPIDER_MONGO_URI` | `mongodb://localhost:27017` | MongoDB 连接串 |
|
||||
| `XSPIDER_MONGO_DB` | `xspider` | MongoDB 数据库名称 |
|
||||
| `XSPIDER_VARIABLE_SERVICE` | `None` | 变量服务接口地址;GET 查询,POST 写入 |
|
||||
|
||||
变量服务要求:
|
||||
|
||||
- `GET {base}?name=变量名&...` 返回 JSON,包含 `value` 字段。
|
||||
- `POST {base}` 提交 JSON `{name, value, ...}`。
|
||||
|
||||
## 运行
|
||||
|
||||
```bash
|
||||
python main.py
|
||||
```
|
||||
|
||||
程序将持续阻塞等待 Redis 列表推送 XML 模板地址,下载模板并执行流程。
|
||||
|
||||
## XML 模板
|
||||
|
||||
模板结构参考示例:
|
||||
|
||||
```xml
|
||||
<site id="example" base="https://example.com">
|
||||
<config enable_proxy="false" rotate_ua="false" retry="3">
|
||||
<header name="Accept-Language" value="zh-CN,zh;q=0.9"/>
|
||||
</config>
|
||||
|
||||
<login url="https://example.com/login" selector="div.dashboard" mode="css">
|
||||
<action type="wait_dom_show" selector="form#login"/>
|
||||
<action type="type" selector="//input[@name='username']" text="${account}" mode="xpath"/>
|
||||
<action type="type" selector="//input[@name='password']" text="${password}" mode="xpath"/>
|
||||
<action type="click" selector="button.submit"/>
|
||||
</login>
|
||||
|
||||
<flows>
|
||||
<flow id="orders" entry="/orders" data_type="sales" unique_keys="custom" columns="order_id">
|
||||
<action type="wait_dom_show" selector="table.data-list"/>
|
||||
<extract record_css="table.data-list tbody tr">
|
||||
<field name="order_id" selector="td:nth-child(1)" mode="css"/>
|
||||
<field name="customer" selector="td:nth-child(2)" mode="css"/>
|
||||
</extract>
|
||||
<paginate selector="button.next" mode="css" max_pages="10"/>
|
||||
</flow>
|
||||
</flows>
|
||||
</site>
|
||||
```
|
||||
|
||||
支持的 `action` 类型见 `xspider/actions/builtin.py`,如需扩展可继承 `BaseAction` 并注册到 `ActionRegistry`。
|
||||
|
||||
## 重要说明
|
||||
|
||||
- `CaptchaAction` 会自动截图(元素或整页)并调用 `https://captcha.lfei007s.workers.dev`,请求体为 `{image, type}`(image 采用 `data:image/png;base64,...` 形式)。可通过 `captcha_config`(JSON 字符串)自定义 `url`、`headers`、`timeout` 或额外字段。
|
||||
- 下载文件监听、复杂的分页场景需要根据目标站点扩展。
|
||||
- 为保证可维护性,所有动作执行过程中均进行了简单日志输出并允许扩展变量解析。需要对框架进行二次开发时,可直接扩展 Action、Extractor 以及 Runner。
|
||||
11
main.py
Normal file
11
main.py
Normal file
@@ -0,0 +1,11 @@
|
||||
from xspider.app import TemplateCrawlerApp, configure_logging
|
||||
|
||||
|
||||
def main() -> None:
|
||||
configure_logging()
|
||||
app = TemplateCrawlerApp()
|
||||
app.run_forever()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
7
xspider/__init__.py
Normal file
7
xspider/__init__.py
Normal file
@@ -0,0 +1,7 @@
|
||||
"""Template-driven crawling framework."""
|
||||
|
||||
__all__ = [
|
||||
"TemplateCrawlerApp",
|
||||
]
|
||||
|
||||
from .app import TemplateCrawlerApp
|
||||
3
xspider/actions/__init__.py
Normal file
3
xspider/actions/__init__.py
Normal file
@@ -0,0 +1,3 @@
|
||||
from .registry import ActionRegistry
|
||||
|
||||
__all__ = ["ActionRegistry"]
|
||||
38
xspider/actions/base.py
Normal file
38
xspider/actions/base.py
Normal file
@@ -0,0 +1,38 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import time
|
||||
from abc import ABC, abstractmethod
|
||||
from typing import Any, Dict, Optional
|
||||
|
||||
from ..browser import BrowserSession
|
||||
from ..models import ActionConfig
|
||||
from ..variables import VariableResolver
|
||||
|
||||
|
||||
class ActionContext:
|
||||
def __init__(
|
||||
self,
|
||||
session: BrowserSession,
|
||||
resolver: VariableResolver,
|
||||
site_context: Dict[str, str],
|
||||
) -> None:
|
||||
self.session = session
|
||||
self.resolver = resolver
|
||||
self.site_context = site_context
|
||||
|
||||
|
||||
class BaseAction(ABC):
|
||||
type_name: str
|
||||
|
||||
def __init__(self, config: ActionConfig) -> None:
|
||||
self.config = config
|
||||
|
||||
def execute(self, ctx: ActionContext) -> Optional[Any]:
|
||||
result = self._execute(ctx)
|
||||
if self.config.after_wait:
|
||||
time.sleep(self.config.after_wait / 1000.0)
|
||||
return result
|
||||
|
||||
@abstractmethod
|
||||
def _execute(self, ctx: ActionContext) -> Optional[Any]:
|
||||
raise NotImplementedError
|
||||
308
xspider/actions/builtin.py
Normal file
308
xspider/actions/builtin.py
Normal file
@@ -0,0 +1,308 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import logging
|
||||
import time
|
||||
from typing import Any, Dict, Optional
|
||||
|
||||
import requests
|
||||
|
||||
from ..models import ActionConfig
|
||||
from .base import ActionContext, BaseAction
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
def _timeout_seconds(action: ActionConfig) -> float:
|
||||
return max(action.timeout_ms / 1000.0, 0.1)
|
||||
|
||||
|
||||
class GotoAction(BaseAction):
|
||||
type_name = "goto"
|
||||
|
||||
def _execute(self, ctx: ActionContext) -> None:
|
||||
url = ctx.resolver.resolve(
|
||||
self.config.params.get("url") or self.config.selector,
|
||||
ctx.site_context,
|
||||
)
|
||||
if not url:
|
||||
url = ctx.site_context.get("entry_url") or ctx.site_context.get("base_url")
|
||||
if not url:
|
||||
raise ValueError("goto action requires a URL or entry/base context")
|
||||
ctx.session.goto(url)
|
||||
|
||||
|
||||
class ClickAction(BaseAction):
|
||||
type_name = "click"
|
||||
|
||||
def _execute(self, ctx: ActionContext) -> None:
|
||||
if not self.config.selector:
|
||||
raise ValueError("click action requires selector")
|
||||
button_raw = self.config.params.get("button")
|
||||
button = (
|
||||
ctx.resolver.resolve(button_raw, ctx.site_context)
|
||||
if button_raw
|
||||
else None
|
||||
)
|
||||
ctx.session.click(
|
||||
selector=ctx.resolver.resolve(self.config.selector, ctx.site_context),
|
||||
mode=self.config.mode,
|
||||
timeout=_timeout_seconds(self.config),
|
||||
button=button,
|
||||
)
|
||||
download_name = self.config.params.get("download_filename")
|
||||
if download_name:
|
||||
ctx.session.download(
|
||||
ctx.resolver.resolve(download_name, ctx.site_context) or download_name
|
||||
)
|
||||
|
||||
|
||||
class TypeAction(BaseAction):
|
||||
type_name = "type"
|
||||
|
||||
def _execute(self, ctx: ActionContext) -> None:
|
||||
if not self.config.selector:
|
||||
raise ValueError("type action requires selector")
|
||||
text = ctx.resolver.resolve(
|
||||
self.config.params.get("text"),
|
||||
ctx.site_context,
|
||||
)
|
||||
if text is None:
|
||||
raise ValueError("type action missing 'text'")
|
||||
ctx.session.type(
|
||||
selector=ctx.resolver.resolve(self.config.selector, ctx.site_context),
|
||||
mode=self.config.mode,
|
||||
text=text,
|
||||
timeout=_timeout_seconds(self.config),
|
||||
)
|
||||
|
||||
|
||||
class WaitDomShowAction(BaseAction):
|
||||
type_name = "wait_dom_show"
|
||||
|
||||
def _execute(self, ctx: ActionContext) -> Optional[object]:
|
||||
if not self.config.selector:
|
||||
raise ValueError("wait_dom_show action requires selector")
|
||||
return ctx.session.wait_dom_show(
|
||||
selector=ctx.resolver.resolve(self.config.selector, ctx.site_context),
|
||||
mode=self.config.mode,
|
||||
timeout=_timeout_seconds(self.config),
|
||||
)
|
||||
|
||||
|
||||
class WaitDomGoneAction(BaseAction):
|
||||
type_name = "wait_dom_gone"
|
||||
|
||||
def _execute(self, ctx: ActionContext) -> None:
|
||||
if not self.config.selector:
|
||||
raise ValueError("wait_dom_gone action requires selector")
|
||||
ctx.session.wait_dom_gone(
|
||||
selector=ctx.resolver.resolve(self.config.selector, ctx.site_context),
|
||||
mode=self.config.mode,
|
||||
timeout=_timeout_seconds(self.config),
|
||||
)
|
||||
|
||||
|
||||
class WaitDomHideAction(BaseAction):
|
||||
type_name = "wait_dom_hide"
|
||||
|
||||
def _execute(self, ctx: ActionContext) -> None:
|
||||
if not self.config.selector:
|
||||
raise ValueError("wait_dom_hide action requires selector")
|
||||
ctx.session.wait_dom_hide(
|
||||
selector=ctx.resolver.resolve(self.config.selector, ctx.site_context),
|
||||
mode=self.config.mode,
|
||||
timeout=_timeout_seconds(self.config),
|
||||
)
|
||||
|
||||
|
||||
class WaitTimeAction(BaseAction):
|
||||
type_name = "wait_time"
|
||||
|
||||
def _execute(self, ctx: ActionContext) -> None:
|
||||
timeout_raw = self.config.params.get("timeout_ms", str(self.config.timeout_ms))
|
||||
timeout_resolved = ctx.resolver.resolve(timeout_raw, ctx.site_context)
|
||||
try:
|
||||
timeout_ms = (
|
||||
int(timeout_resolved) if timeout_resolved else self.config.timeout_ms
|
||||
)
|
||||
except ValueError as exc:
|
||||
raise ValueError(f"Invalid timeout_ms value: {timeout_resolved}") from exc
|
||||
time.sleep(max(timeout_ms, 0) / 1000.0)
|
||||
|
||||
|
||||
class RunJsAction(BaseAction):
|
||||
type_name = "run_js"
|
||||
|
||||
def _execute(self, ctx: ActionContext) -> object:
|
||||
script = self.config.params.get("script") or self.config.params.get("text")
|
||||
script = ctx.resolver.resolve(script, ctx.site_context)
|
||||
if not script:
|
||||
raise ValueError("run_js action requires script")
|
||||
return ctx.session.run_js(script)
|
||||
|
||||
|
||||
class SetHeaderAction(BaseAction):
|
||||
type_name = "set_header"
|
||||
|
||||
def _execute(self, ctx: ActionContext) -> None:
|
||||
header_name = self.config.params.get("header_name")
|
||||
header_value = self.config.params.get("header_value")
|
||||
if not header_name or header_value is None:
|
||||
raise ValueError("set_header requires header_name and header_value")
|
||||
ctx.session.set_header(
|
||||
ctx.resolver.resolve(header_name, ctx.site_context) or header_name,
|
||||
ctx.resolver.resolve(header_value, ctx.site_context) or header_value,
|
||||
)
|
||||
|
||||
|
||||
class SetAttrAction(BaseAction):
|
||||
type_name = "set_attr"
|
||||
|
||||
def _execute(self, ctx: ActionContext) -> None:
|
||||
selector = self.config.selector
|
||||
attr_name = self.config.params.get("attr_name")
|
||||
attr_value = self.config.params.get("attr_value")
|
||||
if not selector or not attr_name:
|
||||
raise ValueError("set_attr requires selector and attr_name")
|
||||
resolved_selector = ctx.resolver.resolve(selector, ctx.site_context) or selector
|
||||
ctx.session.set_attr(
|
||||
selector=resolved_selector,
|
||||
mode=self.config.mode,
|
||||
attr=ctx.resolver.resolve(attr_name, ctx.site_context) or attr_name,
|
||||
value=ctx.resolver.resolve(attr_value, ctx.site_context)
|
||||
if attr_value
|
||||
else "",
|
||||
timeout=_timeout_seconds(self.config),
|
||||
)
|
||||
|
||||
|
||||
class SetVarAction(BaseAction):
|
||||
type_name = "set_var"
|
||||
|
||||
def _execute(self, ctx: ActionContext) -> None:
|
||||
var_name = self.config.params.get("var_name")
|
||||
var_value = self.config.params.get("var_value")
|
||||
if not var_name or var_value is None:
|
||||
raise ValueError("set_var requires var_name and var_value")
|
||||
resolved_name = ctx.resolver.resolve(var_name, ctx.site_context) or var_name
|
||||
resolved_value = (
|
||||
ctx.resolver.resolve(var_value, ctx.site_context) or var_value
|
||||
)
|
||||
payload = {
|
||||
"scope": ctx.resolver.resolve(
|
||||
self.config.params.get("var_scope"), ctx.site_context
|
||||
)
|
||||
if self.config.params.get("var_scope")
|
||||
else None,
|
||||
"ttl": ctx.resolver.resolve(
|
||||
self.config.params.get("var_ttl"), ctx.site_context
|
||||
)
|
||||
if self.config.params.get("var_ttl")
|
||||
else None,
|
||||
"single_use": ctx.resolver.resolve(
|
||||
self.config.params.get("var_single_use"), ctx.site_context
|
||||
)
|
||||
if self.config.params.get("var_single_use")
|
||||
else None,
|
||||
}
|
||||
payload = {k: v for k, v in payload.items() if v is not None}
|
||||
ctx.resolver.set(
|
||||
resolved_name,
|
||||
resolved_value,
|
||||
{**ctx.site_context, **payload},
|
||||
)
|
||||
|
||||
|
||||
class CaptchaAction(BaseAction):
|
||||
type_name = "captcha"
|
||||
|
||||
DEFAULT_ENDPOINT = "https://captcha.lfei007s.workers.dev"
|
||||
_session = requests.Session()
|
||||
|
||||
def _execute(self, ctx: ActionContext) -> None:
|
||||
config = self._load_config(ctx)
|
||||
api_url = (
|
||||
ctx.resolver.resolve(
|
||||
self.config.params.get("captcha_url"),
|
||||
ctx.site_context,
|
||||
)
|
||||
or config.get("url")
|
||||
or self.DEFAULT_ENDPOINT
|
||||
)
|
||||
|
||||
captcha_type = ctx.resolver.resolve(
|
||||
self.config.params.get("captcha_type"),
|
||||
ctx.site_context,
|
||||
)
|
||||
|
||||
image_source = self._resolve_image(ctx)
|
||||
|
||||
payload: Dict[str, Any] = {
|
||||
"image": image_source,
|
||||
"type": captcha_type,
|
||||
}
|
||||
timeout = config.get("timeout", 30)
|
||||
|
||||
logger.debug("Submitting captcha to %s", api_url)
|
||||
response = self._session.post(
|
||||
api_url,
|
||||
json=payload,
|
||||
timeout=timeout,
|
||||
)
|
||||
response.raise_for_status()
|
||||
result_payload = response.json()
|
||||
logger.info("Captcha result: %s", result_payload)
|
||||
solution = self._extract_solution(result_payload)
|
||||
logger.info("Captcha recognized successfully.")
|
||||
|
||||
variable_name = (
|
||||
ctx.resolver.resolve(
|
||||
self.config.params.get("variable"),
|
||||
ctx.site_context,
|
||||
)
|
||||
or f"{ctx.site_context.get('site_id', 'site')}:captcha_result"
|
||||
)
|
||||
|
||||
ctx.resolver.set(variable_name, solution, ctx.site_context)
|
||||
ctx.site_context[variable_name] = solution
|
||||
|
||||
def _load_config(self, ctx: ActionContext) -> Dict[str, Any]:
|
||||
raw_config = self.config.params.get("captcha_config")
|
||||
if not raw_config:
|
||||
return {}
|
||||
resolved = ctx.resolver.resolve(raw_config, ctx.site_context)
|
||||
if not resolved:
|
||||
return {}
|
||||
try:
|
||||
parsed = json.loads(resolved)
|
||||
if isinstance(parsed, dict):
|
||||
return parsed
|
||||
except json.JSONDecodeError as exc:
|
||||
raise ValueError("Invalid JSON in captcha_config") from exc
|
||||
return {}
|
||||
|
||||
def _resolve_image(self, ctx: ActionContext) -> str:
|
||||
direct_image = ctx.resolver.resolve(
|
||||
self.config.params.get("image"),
|
||||
ctx.site_context,
|
||||
)
|
||||
if direct_image:
|
||||
return direct_image
|
||||
|
||||
selector = self.config.selector
|
||||
mode = self.config.mode
|
||||
timeout = max(self.config.timeout_ms / 1000.0, 1.0)
|
||||
screenshot_b64 = ctx.session.screenshot(selector, mode, timeout)
|
||||
if screenshot_b64.startswith("data:"):
|
||||
return screenshot_b64
|
||||
return f"data:image/png;base64,{screenshot_b64}"
|
||||
|
||||
def _extract_solution(self, payload: Dict[str, Any]) -> str:
|
||||
for key in ("result", "text", "value", "code", "data"):
|
||||
value = payload.get(key)
|
||||
if value:
|
||||
return str(value)
|
||||
raise ValueError(
|
||||
f"Captcha service response missing solution field: {payload}"
|
||||
)
|
||||
54
xspider/actions/registry.py
Normal file
54
xspider/actions/registry.py
Normal file
@@ -0,0 +1,54 @@
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import Dict, Type
|
||||
|
||||
from .base import BaseAction
|
||||
from .builtin import (
|
||||
CaptchaAction,
|
||||
ClickAction,
|
||||
GotoAction,
|
||||
RunJsAction,
|
||||
SetAttrAction,
|
||||
SetHeaderAction,
|
||||
SetVarAction,
|
||||
TypeAction,
|
||||
WaitDomGoneAction,
|
||||
WaitDomHideAction,
|
||||
WaitDomShowAction,
|
||||
WaitTimeAction,
|
||||
)
|
||||
|
||||
|
||||
class ActionRegistry:
|
||||
_registry: Dict[str, Type[BaseAction]] = {}
|
||||
|
||||
@classmethod
|
||||
def register(cls, action_cls: Type[BaseAction]) -> None:
|
||||
cls._registry[action_cls.type_name] = action_cls
|
||||
|
||||
@classmethod
|
||||
def get(cls, type_name: str) -> Type[BaseAction]:
|
||||
if type_name not in cls._registry:
|
||||
raise KeyError(f"Unknown action type '{type_name}'")
|
||||
return cls._registry[type_name]
|
||||
|
||||
@classmethod
|
||||
def register_builtin(cls) -> None:
|
||||
for action_cls in (
|
||||
GotoAction,
|
||||
ClickAction,
|
||||
TypeAction,
|
||||
WaitDomShowAction,
|
||||
WaitDomGoneAction,
|
||||
WaitDomHideAction,
|
||||
WaitTimeAction,
|
||||
RunJsAction,
|
||||
SetHeaderAction,
|
||||
SetAttrAction,
|
||||
SetVarAction,
|
||||
CaptchaAction,
|
||||
):
|
||||
cls.register(action_cls)
|
||||
|
||||
|
||||
ActionRegistry.register_builtin()
|
||||
73
xspider/app.py
Normal file
73
xspider/app.py
Normal file
@@ -0,0 +1,73 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import sys
|
||||
import time
|
||||
from typing import Optional
|
||||
|
||||
import requests
|
||||
|
||||
from .redis_queue import RedisConfigQueue
|
||||
from .runner import FlowRunner
|
||||
from .settings import Settings
|
||||
from .storage import MongoRepository
|
||||
from .variables import VariableService
|
||||
from .xml_parser import XMLSiteParser
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class TemplateCrawlerApp:
|
||||
def __init__(self, settings: Optional[Settings] = None) -> None:
|
||||
self.settings = settings or Settings.from_env()
|
||||
self.queue = RedisConfigQueue(
|
||||
self.settings.redis_url,
|
||||
self.settings.redis_list_key,
|
||||
timeout=self.settings.redis_block_timeout,
|
||||
)
|
||||
self.mongo = MongoRepository(
|
||||
self.settings.mongo_uri,
|
||||
self.settings.mongo_database,
|
||||
)
|
||||
self.variable_service = VariableService(self.settings.variable_service_url)
|
||||
self.parser = XMLSiteParser()
|
||||
self.runner = FlowRunner(
|
||||
storage=self.mongo,
|
||||
variable_service=self.variable_service,
|
||||
)
|
||||
self.http = requests.Session()
|
||||
|
||||
def run_forever(self) -> None:
|
||||
logger.info("Template crawler started. Waiting for XML configurations...")
|
||||
while True:
|
||||
try:
|
||||
self._iterate()
|
||||
except KeyboardInterrupt:
|
||||
logger.info("Received interrupt; shutting down.")
|
||||
break
|
||||
except Exception: # noqa: BLE001
|
||||
logger.exception("Unexpected error during iteration.")
|
||||
time.sleep(3)
|
||||
|
||||
def _iterate(self) -> None:
|
||||
xml_location = self.queue.fetch()
|
||||
if not xml_location:
|
||||
return
|
||||
logger.info("Fetched XML location: %s", xml_location)
|
||||
xml_payload = self._load_xml(xml_location)
|
||||
site = self.parser.parse(xml_payload)
|
||||
self.runner.run_site(site)
|
||||
|
||||
def _load_xml(self, location: str) -> str:
|
||||
logger.debug("Downloading XML from %s", location)
|
||||
response = self.http.get(location, timeout=30)
|
||||
response.raise_for_status()
|
||||
return response.text
|
||||
|
||||
|
||||
def configure_logging() -> None:
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
|
||||
stream=sys.stdout,
|
||||
)
|
||||
230
xspider/browser.py
Normal file
230
xspider/browser.py
Normal file
@@ -0,0 +1,230 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import base64
|
||||
from contextlib import contextmanager
|
||||
from dataclasses import dataclass, field
|
||||
from typing import Any, Optional
|
||||
|
||||
from .models import SelectorMode, SiteConfig
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class BrowserError(Exception):
|
||||
"""Wrap driver-specific exceptions."""
|
||||
|
||||
|
||||
@dataclass
|
||||
class BrowserSession:
|
||||
site: SiteConfig
|
||||
page: Any
|
||||
|
||||
def goto(self, url: str, wait: Optional[int] = None) -> None:
|
||||
logger.debug("Navigating to %s", url)
|
||||
try:
|
||||
self.page.get(url)
|
||||
if wait:
|
||||
self.page.wait.load_start(wait / 1000.0)
|
||||
except Exception as exc: # noqa: BLE001
|
||||
raise BrowserError(f"Failed to navigate to {url}") from exc
|
||||
|
||||
def find(self, selector: str, mode: SelectorMode) -> Any:
|
||||
try:
|
||||
prefixed_selector = self._prefixed_selector(selector, mode)
|
||||
if mode is SelectorMode.xpath:
|
||||
return self.page.ele(prefixed_selector, mode="xpath")
|
||||
return self.page.ele(prefixed_selector, mode="css")
|
||||
except Exception as exc: # noqa: BLE001
|
||||
raise BrowserError(f"Failed to locate element {selector}") from exc
|
||||
|
||||
def click(self, selector: str, mode: SelectorMode, timeout: float, button: Optional[str] = None) -> None:
|
||||
ele = self._wait_ele(selector, mode, timeout)
|
||||
try:
|
||||
if button:
|
||||
ele.click(button=button)
|
||||
else:
|
||||
ele.click()
|
||||
except Exception as exc: # noqa: BLE001
|
||||
raise BrowserError(f"Failed to click {selector}") from exc
|
||||
|
||||
def type(self, selector: str, mode: SelectorMode, text: str, timeout: float) -> None:
|
||||
ele = self._wait_ele(selector, mode, timeout)
|
||||
try:
|
||||
ele.clear()
|
||||
ele.input(text)
|
||||
except Exception as exc: # noqa: BLE001
|
||||
raise BrowserError(f"Failed to type into {selector}") from exc
|
||||
|
||||
def run_js(self, script: str) -> Any:
|
||||
try:
|
||||
return self.page.run_js(script)
|
||||
except Exception as exc: # noqa: BLE001
|
||||
raise BrowserError("Failed to execute script") from exc
|
||||
|
||||
def wait_dom_show(self, selector: str, mode: SelectorMode, timeout: float) -> Any:
|
||||
return self._wait_ele(selector, mode, timeout)
|
||||
|
||||
def wait_dom_gone(self, selector: str, mode: SelectorMode, timeout: float) -> None:
|
||||
try:
|
||||
prefixed_selector = self._prefixed_selector(selector, mode)
|
||||
self.page.wait.ele_gone(prefixed_selector, timeout=timeout, mode=mode.value)
|
||||
except Exception as exc: # noqa: BLE001
|
||||
raise BrowserError(f"Element {selector} did not disappear") from exc
|
||||
|
||||
def wait_dom_hide(self, selector: str, mode: SelectorMode, timeout: float) -> None:
|
||||
ele = self._wait_ele(selector, mode, timeout)
|
||||
try:
|
||||
self.page.wait.attr(ele, "style", "display: none", timeout=timeout)
|
||||
except Exception as exc: # noqa: BLE001
|
||||
raise BrowserError(f"Element {selector} did not hide") from exc
|
||||
|
||||
def set_header(self, name: str, value: str) -> None:
|
||||
try:
|
||||
self.page.set_extra_headers({name: value})
|
||||
except Exception as exc: # noqa: BLE001
|
||||
raise BrowserError(f"Failed to set header {name}") from exc
|
||||
|
||||
def set_attr(self, selector: str, mode: SelectorMode, attr: str, value: str, timeout: float) -> None:
|
||||
ele = self._wait_ele(selector, mode, timeout)
|
||||
try:
|
||||
self.page.run_js(
|
||||
"arguments[0].setAttribute(arguments[1], arguments[2]);",
|
||||
args=(ele, attr, value),
|
||||
)
|
||||
except Exception as exc: # noqa: BLE001
|
||||
raise BrowserError(f"Failed to set attribute {attr} on {selector}") from exc
|
||||
|
||||
def download(self, filename: str) -> None:
|
||||
# Placeholder for download handling.
|
||||
logger.info("Download requested for %s", filename)
|
||||
|
||||
def html(self) -> str:
|
||||
try:
|
||||
if callable(getattr(self.page, "html", None)):
|
||||
return self.page.html()
|
||||
return self.page.html
|
||||
except Exception as exc: # noqa: BLE001
|
||||
raise BrowserError("Failed to retrieve page HTML") from exc
|
||||
|
||||
def _prefixed_selector(self, selector: str, mode: SelectorMode) -> str:
|
||||
stripped = selector.strip()
|
||||
if not stripped:
|
||||
return stripped
|
||||
expected_prefix = f"{mode.value}="
|
||||
lowered = stripped.lower()
|
||||
if lowered.startswith(expected_prefix):
|
||||
return stripped
|
||||
alt_prefix = f"{mode.value}:"
|
||||
if lowered.startswith(alt_prefix):
|
||||
return f"{expected_prefix}{stripped[len(alt_prefix):]}"
|
||||
return f"{expected_prefix}{stripped}"
|
||||
|
||||
def _wait_ele(self, selector: str, mode: SelectorMode, timeout: float) -> Any:
|
||||
try:
|
||||
prefixed_selector = self._prefixed_selector(selector, mode)
|
||||
return self.page.wait.ele(prefixed_selector, timeout=timeout, mode=mode.value)
|
||||
except Exception as exc: # noqa: BLE001
|
||||
raise BrowserError(f"Timeout locating element {selector}") from exc
|
||||
|
||||
def screenshot(
|
||||
self,
|
||||
selector: Optional[str] = None,
|
||||
mode: SelectorMode = SelectorMode.css,
|
||||
timeout: float = 5.0,
|
||||
) -> str:
|
||||
if selector:
|
||||
element = self._wait_ele(selector, mode, timeout)
|
||||
return self._screenshot_element(element)
|
||||
return self._screenshot_page()
|
||||
|
||||
def _screenshot_element(self, element: Any) -> str:
|
||||
# Try a few common DrissionPage/Selenium patterns.
|
||||
candidates = [
|
||||
("screenshot", {"base64": True}),
|
||||
("screenshot", {"as_base64": True}),
|
||||
("screenshot_as_base64", {}),
|
||||
("get_screenshot", {"as_base64": True}),
|
||||
("screenshot", {}),
|
||||
]
|
||||
for method_name, kwargs in candidates:
|
||||
method = getattr(element, method_name, None)
|
||||
if not callable(method):
|
||||
continue
|
||||
try:
|
||||
result = method(**kwargs)
|
||||
return self._ensure_base64(result)
|
||||
except TypeError:
|
||||
# Retry without kwargs if not supported.
|
||||
try:
|
||||
result = method()
|
||||
return self._ensure_base64(result)
|
||||
except Exception: # noqa: BLE001
|
||||
continue
|
||||
except Exception: # noqa: BLE001
|
||||
continue
|
||||
raise BrowserError("Failed to capture captcha element screenshot.")
|
||||
|
||||
def _screenshot_page(self) -> str:
|
||||
candidates = [
|
||||
("get_screenshot", {"as_base64": True}),
|
||||
("screenshot", {"as_base64": True}),
|
||||
("screenshot", {}),
|
||||
]
|
||||
for method_name, kwargs in candidates:
|
||||
method = getattr(self.page, method_name, None)
|
||||
if not callable(method):
|
||||
continue
|
||||
try:
|
||||
result = method(**kwargs)
|
||||
return self._ensure_base64(result)
|
||||
except TypeError:
|
||||
try:
|
||||
result = method()
|
||||
return self._ensure_base64(result)
|
||||
except Exception: # noqa: BLE001
|
||||
continue
|
||||
except Exception: # noqa: BLE001
|
||||
continue
|
||||
raise BrowserError("Failed to capture page screenshot.")
|
||||
|
||||
def _ensure_base64(self, content: Any) -> str:
|
||||
if isinstance(content, str):
|
||||
return content
|
||||
if isinstance(content, bytes):
|
||||
return base64.b64encode(content).decode("utf-8")
|
||||
raise BrowserError("Unsupported screenshot content type.")
|
||||
|
||||
|
||||
@dataclass
|
||||
class BrowserFactory:
|
||||
site: SiteConfig
|
||||
options: Any = None
|
||||
_page_kwargs: dict = field(default_factory=dict)
|
||||
|
||||
def create(self) -> BrowserSession:
|
||||
try:
|
||||
from DrissionPage import ChromiumOptions, ChromiumPage
|
||||
except ImportError as exc: # noqa: BLE001
|
||||
raise RuntimeError(
|
||||
"DrissionPage is required for BrowserFactory. Install with `pip install drissionpage`."
|
||||
) from exc
|
||||
|
||||
chromium_options = self.options or ChromiumOptions()
|
||||
page = ChromiumPage(addr_or_opts=chromium_options, **self._page_kwargs)
|
||||
|
||||
for header in self.site.settings.headers:
|
||||
page.set_extra_headers({header.name: header.value})
|
||||
|
||||
return BrowserSession(site=self.site, page=page)
|
||||
|
||||
@contextmanager
|
||||
def session(self) -> BrowserSession:
|
||||
browser = self.create()
|
||||
try:
|
||||
yield browser
|
||||
finally:
|
||||
try:
|
||||
browser.page.close()
|
||||
except Exception: # noqa: BLE001
|
||||
logger.debug("Failed to close browser cleanly", exc_info=True)
|
||||
74
xspider/extraction.py
Normal file
74
xspider/extraction.py
Normal file
@@ -0,0 +1,74 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import datetime as dt
|
||||
import logging
|
||||
from typing import Any, Dict, List
|
||||
|
||||
from lxml import html
|
||||
|
||||
from .models import ExtractConfig, FieldConfig, SelectorMode
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class Extractor:
|
||||
def __init__(self) -> None:
|
||||
pass
|
||||
|
||||
def extract(self, page_html: str, config: ExtractConfig) -> List[Dict[str, Any]]:
|
||||
doc = html.fromstring(page_html)
|
||||
if config.record_mode is SelectorMode.css:
|
||||
nodes = doc.cssselect(config.record_selector)
|
||||
else:
|
||||
nodes = doc.xpath(config.record_selector)
|
||||
|
||||
records: List[Dict[str, Any]] = []
|
||||
for node in nodes:
|
||||
record: Dict[str, Any] = {}
|
||||
for field in config.fields:
|
||||
record[field.name] = self._extract_field(node, field)
|
||||
records.append(record)
|
||||
return records
|
||||
|
||||
def _extract_field(self, node: html.HtmlElement, field: FieldConfig) -> Any:
|
||||
raw_value = ""
|
||||
try:
|
||||
if field.mode is SelectorMode.css:
|
||||
matches = node.cssselect(field.selector)
|
||||
else:
|
||||
matches = node.xpath(field.selector)
|
||||
if matches:
|
||||
if hasattr(matches[0], "text_content"):
|
||||
raw_value = matches[0].text_content().strip()
|
||||
else:
|
||||
raw_value = str(matches[0]).strip()
|
||||
except Exception: # noqa: BLE001
|
||||
logger.debug("Failed to extract field %s", field.name, exc_info=True)
|
||||
raw_value = ""
|
||||
return self._transform_value(raw_value, field.value_type)
|
||||
|
||||
def _transform_value(self, value: str, value_type: str | None) -> Any:
|
||||
if not value_type or value == "":
|
||||
return value
|
||||
if value_type == "string_lower":
|
||||
return value.lower()
|
||||
if value_type == "string_upper":
|
||||
return value.upper()
|
||||
if value_type == "int":
|
||||
try:
|
||||
return int(value)
|
||||
except ValueError:
|
||||
return value
|
||||
if value_type == "float":
|
||||
try:
|
||||
return float(value)
|
||||
except ValueError:
|
||||
return value
|
||||
if value_type == "date":
|
||||
for fmt in ("%Y-%m-%d", "%Y/%m/%d", "%Y-%m-%d %H:%M:%S"):
|
||||
try:
|
||||
return dt.datetime.strptime(value, fmt)
|
||||
except ValueError:
|
||||
continue
|
||||
return value
|
||||
return value
|
||||
92
xspider/models.py
Normal file
92
xspider/models.py
Normal file
@@ -0,0 +1,92 @@
|
||||
from __future__ import annotations
|
||||
|
||||
from dataclasses import dataclass, field
|
||||
from enum import Enum
|
||||
from typing import Any, Dict, List, Optional
|
||||
|
||||
|
||||
class SelectorMode(str, Enum):
|
||||
css = "css"
|
||||
xpath = "xpath"
|
||||
|
||||
|
||||
@dataclass
|
||||
class HeaderConfig:
|
||||
name: str
|
||||
value: str
|
||||
|
||||
|
||||
@dataclass
|
||||
class SiteSettings:
|
||||
enable_proxy: bool = False
|
||||
rotate_ua: bool = False
|
||||
retry: int = 3
|
||||
headers: List[HeaderConfig] = field(default_factory=list)
|
||||
|
||||
|
||||
@dataclass
|
||||
class ActionConfig:
|
||||
type: str
|
||||
selector: Optional[str] = None
|
||||
mode: SelectorMode = SelectorMode.xpath
|
||||
timeout_ms: int = 10_000
|
||||
after_wait: int = 0
|
||||
params: Dict[str, Any] = field(default_factory=dict)
|
||||
|
||||
|
||||
@dataclass
|
||||
class FieldConfig:
|
||||
name: str
|
||||
selector: str
|
||||
mode: SelectorMode = SelectorMode.css
|
||||
value_type: Optional[str] = None
|
||||
|
||||
|
||||
@dataclass
|
||||
class ExtractConfig:
|
||||
record_selector: str
|
||||
record_mode: SelectorMode = SelectorMode.css
|
||||
fields: List[FieldConfig] = field(default_factory=list)
|
||||
download: Optional[Dict[str, Any]] = None
|
||||
|
||||
|
||||
@dataclass
|
||||
class ExcelExtractConfig:
|
||||
file_pattern: str
|
||||
directory: Optional[str] = None
|
||||
|
||||
|
||||
@dataclass
|
||||
class PaginateConfig:
|
||||
mode: SelectorMode = SelectorMode.xpath
|
||||
selector: Optional[str] = None
|
||||
max_pages: Optional[int] = None
|
||||
|
||||
|
||||
class UniqueKeyMode(str, Enum):
|
||||
all = "all"
|
||||
custom = "custom"
|
||||
none = "null"
|
||||
|
||||
|
||||
@dataclass
|
||||
class FlowConfig:
|
||||
flow_id: str
|
||||
entry: Optional[str]
|
||||
data_type: Optional[str]
|
||||
unique_keys: UniqueKeyMode = UniqueKeyMode.all
|
||||
unique_columns: List[str] = field(default_factory=list)
|
||||
actions: List[ActionConfig] = field(default_factory=list)
|
||||
extract: Optional[ExtractConfig] = None
|
||||
excel_extract: Optional[ExcelExtractConfig] = None
|
||||
paginate: Optional[PaginateConfig] = None
|
||||
metadata: Dict[str, Any] = field(default_factory=dict)
|
||||
|
||||
|
||||
@dataclass
|
||||
class SiteConfig:
|
||||
site_id: str
|
||||
base: Optional[str]
|
||||
settings: SiteSettings = field(default_factory=SiteSettings)
|
||||
login: Optional[FlowConfig] = None
|
||||
flows: List[FlowConfig] = field(default_factory=list)
|
||||
40
xspider/redis_queue.py
Normal file
40
xspider/redis_queue.py
Normal file
@@ -0,0 +1,40 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
from typing import Optional
|
||||
|
||||
import redis
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class RedisConfigQueue:
|
||||
def __init__(self, redis_url: str, list_key: str, timeout: int = 30) -> None:
|
||||
self._client = redis.Redis.from_url(redis_url)
|
||||
self._list_key = list_key
|
||||
self._timeout = timeout
|
||||
|
||||
def fetch(self) -> Optional[str]:
|
||||
item = self._client.brpop(self._list_key, timeout=self._timeout)
|
||||
if not item:
|
||||
return None
|
||||
_, value = item
|
||||
raw = value.decode("utf-8").strip()
|
||||
if not raw:
|
||||
return None
|
||||
if raw[0] == raw[-1] and raw[0] in {'"', "'"}:
|
||||
raw = raw[1:-1].strip()
|
||||
raw = raw.strip()
|
||||
return raw or None
|
||||
|
||||
def push(self, value: str) -> None:
|
||||
self._client.rpush(self._list_key, value)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
queue = RedisConfigQueue(
|
||||
"redis://:flik1513.@pve.92coco.cn:6379/0",
|
||||
'',
|
||||
timeout=30,
|
||||
)
|
||||
print(queue.fetch())
|
||||
170
xspider/runner.py
Normal file
170
xspider/runner.py
Normal file
@@ -0,0 +1,170 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import time
|
||||
from typing import Any, Dict, Optional
|
||||
from urllib.parse import urljoin
|
||||
|
||||
from .actions.registry import ActionRegistry
|
||||
from .browser import BrowserFactory, BrowserSession, BrowserError
|
||||
from .extraction import Extractor
|
||||
from .models import FlowConfig, SiteConfig, SelectorMode, ExcelExtractConfig
|
||||
from .storage import MongoRepository
|
||||
from .variables import VariableResolver, VariableService
|
||||
from .utils.selectors import is_xpath_selector
|
||||
from .actions.base import ActionContext
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class FlowRunner:
|
||||
def __init__(
|
||||
self,
|
||||
storage: MongoRepository,
|
||||
variable_service: VariableService,
|
||||
extractor: Optional[Extractor] = None,
|
||||
) -> None:
|
||||
self.storage = storage
|
||||
self.variable_service = variable_service
|
||||
self.extractor = extractor or Extractor()
|
||||
|
||||
def run_site(self, site: SiteConfig) -> None:
|
||||
factory = BrowserFactory(site)
|
||||
resolver = VariableResolver(self.variable_service)
|
||||
|
||||
with factory.session() as session:
|
||||
if site.login:
|
||||
logger.info("Executing login flow for site %s", site.site_id)
|
||||
self._run_flow(site, site.login, session, resolver, is_login=True)
|
||||
for flow in site.flows:
|
||||
logger.info("Executing flow %s for site %s", flow.flow_id, site.site_id)
|
||||
self._run_flow(site, flow, session, resolver, is_login=False)
|
||||
|
||||
def _run_flow(
|
||||
self,
|
||||
site: SiteConfig,
|
||||
flow: FlowConfig,
|
||||
session: BrowserSession,
|
||||
resolver: VariableResolver,
|
||||
is_login: bool,
|
||||
) -> None:
|
||||
entry_url = self._resolve_entry(site, flow)
|
||||
site_context = self._build_context(site, flow, entry_url)
|
||||
action_context = ActionContext(session, resolver, site_context)
|
||||
|
||||
if entry_url:
|
||||
resolved_entry = resolver.resolve(entry_url, action_context.site_context) or entry_url
|
||||
action_context.site_context["entry_url"] = resolved_entry
|
||||
logger.debug("Flow %s navigating to %s", flow.flow_id, resolved_entry)
|
||||
session.goto(resolved_entry)
|
||||
|
||||
for action_config in flow.actions:
|
||||
action_cls = ActionRegistry.get(action_config.type)
|
||||
action = action_cls(action_config)
|
||||
logger.debug("Executing action %s", action_config.type)
|
||||
action.execute(action_context)
|
||||
|
||||
if is_login:
|
||||
selector = flow.metadata.get("selector")
|
||||
if selector:
|
||||
selector_mode = self._resolve_selector_mode(selector, flow.metadata)
|
||||
timeout = int(flow.metadata.get("timeout_ms", 10_000)) / 1000.0
|
||||
try:
|
||||
session.wait_dom_show(
|
||||
resolver.resolve(selector, action_context.site_context) or selector,
|
||||
selector_mode,
|
||||
timeout,
|
||||
)
|
||||
except BrowserError:
|
||||
raise RuntimeError("Login verification selector not found.")
|
||||
return
|
||||
|
||||
if not flow.extract:
|
||||
if flow.excel_extract:
|
||||
self._handle_excel_extract(site, flow, session)
|
||||
else:
|
||||
logger.info("Flow %s has no extract step; skipping storage.", flow.flow_id)
|
||||
return
|
||||
|
||||
if flow.paginate and flow.paginate.selector:
|
||||
self._run_paginated(site, flow, session)
|
||||
else:
|
||||
records = self.extractor.extract(session.html(), flow.extract)
|
||||
self.storage.save_records(site, flow, records)
|
||||
|
||||
def _run_paginated(
|
||||
self,
|
||||
site: SiteConfig,
|
||||
flow: FlowConfig,
|
||||
session: BrowserSession,
|
||||
) -> None:
|
||||
page = 0
|
||||
while True:
|
||||
page += 1
|
||||
records = self.extractor.extract(session.html(), flow.extract)
|
||||
self.storage.save_records(site, flow, records)
|
||||
|
||||
if flow.paginate.max_pages and page >= flow.paginate.max_pages:
|
||||
break
|
||||
selector = flow.paginate.selector
|
||||
if not selector:
|
||||
break
|
||||
try:
|
||||
session.click(selector, flow.paginate.mode, timeout=5)
|
||||
except BrowserError:
|
||||
logger.info("Pagination stopped at page %s", page)
|
||||
break
|
||||
time.sleep(1)
|
||||
|
||||
def _resolve_entry(self, site: SiteConfig, flow: FlowConfig) -> Optional[str]:
|
||||
if flow.entry:
|
||||
if site.base:
|
||||
return urljoin(site.base, flow.entry)
|
||||
return flow.entry
|
||||
return site.base
|
||||
|
||||
def _build_context(
|
||||
self,
|
||||
site: SiteConfig,
|
||||
flow: FlowConfig,
|
||||
entry_url: Optional[str],
|
||||
) -> Dict[str, str]:
|
||||
context = {
|
||||
"site_id": site.site_id,
|
||||
"flow_id": flow.flow_id,
|
||||
}
|
||||
if site.base:
|
||||
context["base_url"] = site.base
|
||||
if entry_url:
|
||||
context["entry_url"] = entry_url
|
||||
return context
|
||||
|
||||
def _resolve_selector_mode(self, selector: str, metadata: Dict[str, Any]) -> SelectorMode:
|
||||
mode_value = metadata.get("mode")
|
||||
if mode_value:
|
||||
try:
|
||||
mode = SelectorMode(mode_value)
|
||||
except ValueError as exc: # noqa: BLE001
|
||||
raise ValueError(f"Unsupported selector mode {mode_value}") from exc
|
||||
if mode is SelectorMode.css and is_xpath_selector(selector):
|
||||
raise ValueError(
|
||||
f"Selector '{selector}' looks like XPath but mode='css' specified."
|
||||
)
|
||||
return mode
|
||||
return SelectorMode.xpath
|
||||
|
||||
def _handle_excel_extract(
|
||||
self,
|
||||
site: SiteConfig,
|
||||
flow: FlowConfig,
|
||||
session: BrowserSession,
|
||||
) -> None:
|
||||
config = flow.excel_extract
|
||||
if not config:
|
||||
return
|
||||
logger.warning(
|
||||
"Excel extraction for %s/%s not yet implemented. Expected file pattern %s.",
|
||||
site.site_id,
|
||||
flow.flow_id,
|
||||
config.file_pattern,
|
||||
)
|
||||
32
xspider/settings.py
Normal file
32
xspider/settings.py
Normal file
@@ -0,0 +1,32 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
from dataclasses import dataclass
|
||||
from typing import Optional
|
||||
|
||||
|
||||
@dataclass
|
||||
class Settings:
|
||||
redis_url: str
|
||||
redis_list_key: str
|
||||
mongo_uri: str
|
||||
mongo_database: str
|
||||
variable_service_url: Optional[str]
|
||||
redis_block_timeout: int
|
||||
|
||||
@classmethod
|
||||
def from_env(cls) -> "Settings":
|
||||
redis_url = os.getenv("XSPIDER_REDIS_URL", "redis://:flik1513.@pve.92coco.cn:6379/0")
|
||||
redis_list_key = os.getenv("XSPIDER_REDIS_LIST_KEY", "xspider:urls")
|
||||
mongo_uri = os.getenv("XSPIDER_MONGO_URI", "mongodb://hpower:hpower666.@192.168.8.154:27017")
|
||||
mongo_database = os.getenv("XSPIDER_MONGO_DB", "xspider")
|
||||
variable_service_url = os.getenv("XSPIDER_VARIABLE_SERVICE")
|
||||
redis_block_timeout = int(os.getenv("XSPIDER_REDIS_BLOCK_TIMEOUT", "30"))
|
||||
return cls(
|
||||
redis_url=redis_url,
|
||||
redis_list_key=redis_list_key,
|
||||
mongo_uri=mongo_uri,
|
||||
mongo_database=mongo_database,
|
||||
variable_service_url=variable_service_url,
|
||||
redis_block_timeout=redis_block_timeout,
|
||||
)
|
||||
75
xspider/storage.py
Normal file
75
xspider/storage.py
Normal file
@@ -0,0 +1,75 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import hashlib
|
||||
import json
|
||||
import logging
|
||||
from typing import Any, Dict, Iterable, List, Optional
|
||||
|
||||
from pymongo import MongoClient, UpdateOne
|
||||
|
||||
from .models import FlowConfig, SiteConfig, UniqueKeyMode
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class MongoRepository:
|
||||
def __init__(self, mongo_uri: str, database: str) -> None:
|
||||
self._client = MongoClient(mongo_uri)
|
||||
self._db = self._client[database]
|
||||
|
||||
def save_records(
|
||||
self,
|
||||
site: SiteConfig,
|
||||
flow: FlowConfig,
|
||||
records: Iterable[Dict[str, Any]],
|
||||
) -> None:
|
||||
records_list = list(records)
|
||||
if not records_list:
|
||||
logger.info("Flow %s yielded no records", flow.flow_id)
|
||||
return
|
||||
|
||||
collection_name = f"{site.site_id}_{flow.flow_id}"
|
||||
collection = self._db[collection_name]
|
||||
|
||||
if flow.unique_keys is UniqueKeyMode.none:
|
||||
collection.insert_many(records_list)
|
||||
return
|
||||
|
||||
operations: List[UpdateOne] = []
|
||||
for record in records_list:
|
||||
unique_key = self._unique_key(flow, record)
|
||||
operations.append(
|
||||
UpdateOne(
|
||||
{"_unique": unique_key},
|
||||
{
|
||||
"$set": {**record, "_unique": unique_key},
|
||||
"$setOnInsert": {"_site": site.site_id, "_flow": flow.flow_id},
|
||||
},
|
||||
upsert=True,
|
||||
)
|
||||
)
|
||||
|
||||
if operations:
|
||||
result = collection.bulk_write(operations, ordered=False)
|
||||
logger.info(
|
||||
"Saved records for %s/%s: matched=%s upserted=%s",
|
||||
site.site_id,
|
||||
flow.flow_id,
|
||||
result.matched_count,
|
||||
len(result.upserted_ids or []),
|
||||
)
|
||||
|
||||
def _unique_key(self, flow: FlowConfig, record: Dict[str, Any]) -> str:
|
||||
if flow.unique_keys is UniqueKeyMode.custom and flow.unique_columns:
|
||||
payload = {key: record.get(key) for key in flow.unique_columns}
|
||||
else:
|
||||
payload = record
|
||||
serialized = json.dumps(payload, sort_keys=True, default=self._json_default)
|
||||
return hashlib.md5(serialized.encode("utf-8")).hexdigest()
|
||||
|
||||
def _json_default(self, obj: Any) -> Any:
|
||||
if isinstance(obj, (set, tuple)):
|
||||
return list(obj)
|
||||
if hasattr(obj, "isoformat"):
|
||||
return obj.isoformat()
|
||||
return str(obj)
|
||||
29
xspider/utils/selectors.py
Normal file
29
xspider/utils/selectors.py
Normal file
@@ -0,0 +1,29 @@
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import Optional
|
||||
|
||||
|
||||
def is_xpath_selector(selector: Optional[str]) -> bool:
|
||||
"""Heuristically determine if the selector string looks like XPath."""
|
||||
if not selector:
|
||||
return False
|
||||
|
||||
stripped = selector.strip()
|
||||
if not stripped:
|
||||
return False
|
||||
|
||||
lowered = stripped.lower()
|
||||
if lowered.startswith("xpath="):
|
||||
return True
|
||||
|
||||
if stripped.startswith(("//", ".//", "/")):
|
||||
return True
|
||||
|
||||
if stripped.startswith("(") and "//" in stripped:
|
||||
return True
|
||||
|
||||
if stripped.startswith("@") or stripped.startswith("text()"):
|
||||
return True
|
||||
|
||||
xpath_tokens = ("::", "[@", "]", " and ", " or ", "/@")
|
||||
return any(token in stripped for token in xpath_tokens)
|
||||
79
xspider/variables.py
Normal file
79
xspider/variables.py
Normal file
@@ -0,0 +1,79 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import re
|
||||
from dataclasses import dataclass, field
|
||||
from typing import Dict, Optional
|
||||
|
||||
import requests
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
VAR_PATTERN = re.compile(r"\$\{(?P<name>[a-zA-Z0-9_:\-\.]+)\}")
|
||||
|
||||
|
||||
@dataclass
|
||||
class VariableService:
|
||||
base_url: Optional[str] = None
|
||||
session: requests.Session = field(default_factory=requests.Session)
|
||||
|
||||
def fetch(self, name: str, context: Optional[Dict[str, str]] = None) -> str:
|
||||
if not self.base_url:
|
||||
raise RuntimeError(
|
||||
f"Variable {name} requested but VARIABLE_SERVICE_URL not configured."
|
||||
)
|
||||
params = {"name": name}
|
||||
if context:
|
||||
params.update(context)
|
||||
response = self.session.get(self.base_url, params=params, timeout=10)
|
||||
response.raise_for_status()
|
||||
payload = response.json()
|
||||
if "value" not in payload:
|
||||
raise KeyError(f"Variable service response missing 'value': {payload}")
|
||||
return str(payload["value"])
|
||||
|
||||
def set(self, name: str, value: str, context: Optional[Dict[str, str]] = None) -> None:
|
||||
if not self.base_url:
|
||||
raise RuntimeError("VARIABLE_SERVICE_URL not configured for set_var action.")
|
||||
payload: Dict[str, str] = {"name": name, "value": value}
|
||||
if context:
|
||||
payload.update(context)
|
||||
response = self.session.post(self.base_url, json=payload, timeout=10)
|
||||
response.raise_for_status()
|
||||
|
||||
|
||||
class VariableResolver:
|
||||
def __init__(
|
||||
self,
|
||||
service: VariableService,
|
||||
) -> None:
|
||||
self._service = service
|
||||
self._cache: Dict[str, str] = {}
|
||||
|
||||
def resolve(self, value: Optional[str], context: Optional[Dict[str, str]] = None) -> Optional[str]:
|
||||
if not value:
|
||||
return value
|
||||
matches = list(VAR_PATTERN.finditer(value))
|
||||
if not matches:
|
||||
return value
|
||||
|
||||
result = value
|
||||
for match in matches:
|
||||
name = match.group("name")
|
||||
replacement = self._cache.get(name)
|
||||
if replacement is None:
|
||||
try:
|
||||
replacement = self._service.fetch(name, context)
|
||||
self._cache[name] = replacement
|
||||
except Exception as exc: # noqa: BLE001
|
||||
logger.exception("Failed to resolve variable %s", name)
|
||||
raise
|
||||
result = result.replace(match.group(0), replacement)
|
||||
return result
|
||||
|
||||
def resolve_dict(self, payload: Dict[str, str], context: Optional[Dict[str, str]] = None) -> Dict[str, str]:
|
||||
return {key: self.resolve(value, context) for key, value in payload.items()}
|
||||
|
||||
def set(self, name: str, value: str, context: Optional[Dict[str, str]] = None) -> None:
|
||||
self._cache[name] = value
|
||||
self._service.set(name, value, context)
|
||||
284
xspider/xml_parser.py
Normal file
284
xspider/xml_parser.py
Normal file
@@ -0,0 +1,284 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
from typing import Any, Dict, List, Optional
|
||||
from xml.etree import ElementTree as ET
|
||||
|
||||
from .models import (
|
||||
ActionConfig,
|
||||
ExtractConfig,
|
||||
ExcelExtractConfig,
|
||||
FieldConfig,
|
||||
FlowConfig,
|
||||
HeaderConfig,
|
||||
PaginateConfig,
|
||||
SelectorMode,
|
||||
SiteConfig,
|
||||
SiteSettings,
|
||||
UniqueKeyMode,
|
||||
)
|
||||
from .utils.selectors import is_xpath_selector
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
def _as_bool(value: Optional[str], default: bool = False) -> bool:
|
||||
if value is None:
|
||||
return default
|
||||
return value.lower() in {"true", "1", "yes", "on"}
|
||||
|
||||
|
||||
def _as_int(value: Optional[str], default: int) -> int:
|
||||
if value is None:
|
||||
return default
|
||||
try:
|
||||
return int(value)
|
||||
except ValueError as exc:
|
||||
raise ValueError(f"Invalid integer value: {value}") from exc
|
||||
|
||||
|
||||
def _selector_mode(value: Optional[str], default: SelectorMode = SelectorMode.css) -> SelectorMode:
|
||||
if value is None:
|
||||
return default
|
||||
try:
|
||||
return SelectorMode(value)
|
||||
except ValueError as exc:
|
||||
raise ValueError(f"Unsupported selector mode: {value}") from exc
|
||||
|
||||
|
||||
class XMLSiteParser:
|
||||
def parse(self, xml_payload: str) -> SiteConfig:
|
||||
root = ET.fromstring(xml_payload)
|
||||
self._strip_namespace(root)
|
||||
if root.tag != "site":
|
||||
raise ValueError("Root element must be <site>")
|
||||
|
||||
site_id = root.attrib.get("id")
|
||||
base = root.attrib.get("base")
|
||||
|
||||
if not site_id:
|
||||
raise ValueError("<site> missing required attribute 'id'")
|
||||
|
||||
settings = self._parse_settings(root.find("config"))
|
||||
login = self._parse_flow(root.find("login"), allow_missing_extract=True)
|
||||
|
||||
flows_node = root.find("flows")
|
||||
flows: List[FlowConfig] = []
|
||||
if flows_node is not None:
|
||||
for flow_node in flows_node.findall("flow"):
|
||||
flows.append(self._parse_flow(flow_node))
|
||||
|
||||
if not flows:
|
||||
logger.warning("Site %s has no flows defined", site_id)
|
||||
|
||||
return SiteConfig(
|
||||
site_id=site_id,
|
||||
base=base,
|
||||
settings=settings,
|
||||
login=login,
|
||||
flows=flows,
|
||||
)
|
||||
|
||||
def _strip_namespace(self, element: ET.Element) -> None:
|
||||
"""Remove XML namespaces in-place for easier processing."""
|
||||
for el in element.iter():
|
||||
if "}" in el.tag:
|
||||
el.tag = el.tag.split("}", 1)[1]
|
||||
|
||||
def _parse_settings(self, node: Optional[ET.Element]) -> SiteSettings:
|
||||
if node is None:
|
||||
return SiteSettings()
|
||||
|
||||
attrs = node.attrib
|
||||
enable_proxy = _as_bool(attrs.get("enable_proxy"), False)
|
||||
rotate_ua = _as_bool(attrs.get("rotate_ua"), False)
|
||||
retry = _as_int(attrs.get("retry"), 3)
|
||||
|
||||
headers: List[HeaderConfig] = []
|
||||
for header in node.findall("header"):
|
||||
name = header.attrib.get("name")
|
||||
value = header.attrib.get("value", "")
|
||||
if not name:
|
||||
raise ValueError("<header> missing required attribute 'name'")
|
||||
headers.append(HeaderConfig(name=name, value=value))
|
||||
|
||||
return SiteSettings(
|
||||
enable_proxy=enable_proxy,
|
||||
rotate_ua=rotate_ua,
|
||||
retry=retry,
|
||||
headers=headers,
|
||||
)
|
||||
|
||||
def _parse_flow(
|
||||
self,
|
||||
node: Optional[ET.Element],
|
||||
allow_missing_extract: bool = False,
|
||||
) -> Optional[FlowConfig]:
|
||||
if node is None:
|
||||
return None
|
||||
|
||||
attrs = node.attrib
|
||||
flow_id = attrs.get("id") or node.tag
|
||||
entry = attrs.get("entry") or attrs.get("url")
|
||||
data_type = attrs.get("data_type")
|
||||
|
||||
unique_keys_attr = attrs.get("unique_keys", UniqueKeyMode.all.value)
|
||||
try:
|
||||
unique_keys = UniqueKeyMode(unique_keys_attr)
|
||||
except ValueError as exc:
|
||||
raise ValueError(f"Invalid unique_keys value: {unique_keys_attr}") from exc
|
||||
|
||||
columns_attr = attrs.get("columns", "")
|
||||
unique_columns = [col.strip() for col in columns_attr.split(",") if col.strip()]
|
||||
|
||||
actions = [self._parse_action(action) for action in node.findall("action")]
|
||||
|
||||
extract_node = node.find("extract")
|
||||
extract = self._parse_extract(extract_node) if extract_node is not None else None
|
||||
|
||||
excel_node = node.find("excel_extract")
|
||||
excel_extract = (
|
||||
self._parse_excel_extract(excel_node) if excel_node is not None else None
|
||||
)
|
||||
|
||||
if (
|
||||
not allow_missing_extract
|
||||
and extract is None
|
||||
and excel_extract is None
|
||||
):
|
||||
raise ValueError(f"<flow id='{flow_id}'> requires an extract section.")
|
||||
|
||||
paginate_node = node.find("paginate")
|
||||
paginate = (
|
||||
self._parse_paginate(paginate_node) if paginate_node is not None else None
|
||||
)
|
||||
|
||||
metadata = {
|
||||
key: value
|
||||
for key, value in attrs.items()
|
||||
if key
|
||||
not in {
|
||||
"id",
|
||||
"entry",
|
||||
"url",
|
||||
"data_type",
|
||||
"unique_keys",
|
||||
"columns",
|
||||
}
|
||||
}
|
||||
|
||||
return FlowConfig(
|
||||
flow_id=flow_id,
|
||||
entry=entry,
|
||||
data_type=data_type,
|
||||
unique_keys=unique_keys,
|
||||
unique_columns=unique_columns,
|
||||
actions=actions,
|
||||
extract=extract,
|
||||
excel_extract=excel_extract,
|
||||
paginate=paginate,
|
||||
metadata=metadata,
|
||||
)
|
||||
|
||||
def _parse_action(self, node: ET.Element) -> ActionConfig:
|
||||
attrs = node.attrib
|
||||
action_type = attrs.get("type")
|
||||
if not action_type:
|
||||
raise ValueError("<action> missing required attribute 'type'")
|
||||
|
||||
mode_attr = attrs.get("mode")
|
||||
mode = _selector_mode(mode_attr, default=SelectorMode.xpath)
|
||||
selector = attrs.get("selector")
|
||||
if selector and mode is SelectorMode.css and is_xpath_selector(selector):
|
||||
raise ValueError(
|
||||
f"Selector '{selector}' looks like XPath but mode='css' specified."
|
||||
)
|
||||
timeout_ms = _as_int(attrs.get("timeout_ms"), 10_000)
|
||||
after_wait = _as_int(attrs.get("after_wait"), 0)
|
||||
|
||||
params = {
|
||||
key: value
|
||||
for key, value in attrs.items()
|
||||
if key
|
||||
not in {"type", "selector", "mode", "timeout_ms", "after_wait"}
|
||||
}
|
||||
|
||||
# Support inline script text for run_js, etc.
|
||||
if node.text and node.text.strip():
|
||||
params.setdefault("text", node.text.strip())
|
||||
|
||||
return ActionConfig(
|
||||
type=action_type,
|
||||
selector=selector,
|
||||
mode=mode,
|
||||
timeout_ms=timeout_ms,
|
||||
after_wait=after_wait,
|
||||
params=params,
|
||||
)
|
||||
|
||||
def _parse_extract(self, node: ET.Element) -> ExtractConfig:
|
||||
attrs = node.attrib
|
||||
record_selector = attrs.get("record_css") or attrs.get("record_xpath")
|
||||
if not record_selector:
|
||||
raise ValueError("<extract> requires record_css or record_xpath")
|
||||
record_mode = (
|
||||
SelectorMode.css if "record_css" in attrs else SelectorMode.xpath
|
||||
)
|
||||
|
||||
fields = [self._parse_field(field) for field in node.findall("field")]
|
||||
|
||||
download = None
|
||||
download_node = node.find("download")
|
||||
if download_node is not None:
|
||||
download = download_node.attrib.copy()
|
||||
|
||||
return ExtractConfig(
|
||||
record_selector=record_selector,
|
||||
record_mode=record_mode,
|
||||
fields=fields,
|
||||
download=download,
|
||||
)
|
||||
|
||||
def _parse_field(self, node: ET.Element) -> FieldConfig:
|
||||
attrs = node.attrib
|
||||
name = attrs.get("name")
|
||||
selector = attrs.get("selector")
|
||||
if not name or not selector:
|
||||
raise ValueError("<field> requires 'name' and 'selector'")
|
||||
|
||||
mode_attr = attrs.get("mode")
|
||||
mode = _selector_mode(mode_attr)
|
||||
value_type = attrs.get("value_type")
|
||||
if mode is SelectorMode.css and is_xpath_selector(selector):
|
||||
raise ValueError(
|
||||
f"Field selector '{selector}' looks like XPath but mode='css' specified."
|
||||
)
|
||||
|
||||
return FieldConfig(
|
||||
name=name,
|
||||
selector=selector,
|
||||
mode=mode,
|
||||
value_type=value_type,
|
||||
)
|
||||
|
||||
def _parse_paginate(self, node: ET.Element) -> PaginateConfig:
|
||||
attrs = node.attrib
|
||||
selector = attrs.get("selector") or attrs.get("css")
|
||||
mode_attr = attrs.get("mode")
|
||||
mode = _selector_mode(mode_attr, default=SelectorMode.xpath)
|
||||
if selector and mode is SelectorMode.css and is_xpath_selector(selector):
|
||||
raise ValueError(
|
||||
f"Paginate selector '{selector}' looks like XPath but mode='css' specified."
|
||||
)
|
||||
max_pages = None
|
||||
if "max_pages" in attrs:
|
||||
max_pages = _as_int(attrs.get("max_pages"), 0)
|
||||
return PaginateConfig(mode=mode, selector=selector, max_pages=max_pages)
|
||||
|
||||
def _parse_excel_extract(self, node: ET.Element) -> ExcelExtractConfig:
|
||||
attrs = node.attrib
|
||||
file_pattern = attrs.get("file_pattern") or attrs.get("pattern")
|
||||
if not file_pattern:
|
||||
raise ValueError("<excel_extract> requires file_pattern attribute")
|
||||
directory = attrs.get("directory")
|
||||
return ExcelExtractConfig(file_pattern=file_pattern, directory=directory)
|
||||
Reference in New Issue
Block a user