# pdf2word

**Repository Path**: git.fccfc.com/pdf2word

## Basic Information

- **Project Name**: pdf2word
- **Description**: PDF / 图片转 Word 命令行工具，支持多引擎转换，保留表格、图片、标题等文档结构。
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-05-12
- **Last Updated**: 2026-06-16

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# pdf2word

PDF / 图片转 Word 命令行工具，支持多引擎转换，保留表格、图片、标题等文档结构。

## 功能特性

- **多格式输入**：支持 PDF 和常见图片格式（JPG、PNG、BMP、WebP、TIFF）
- **多引擎支持**：自动选择或手动指定转换引擎
- **表格保留**：识别文档中的表格并还原为 Word 表格结构
- **图片智能处理**：纯图片保留原图，含文字的图片自动 OCR 转为可编辑文本
- **标题识别**：自动检测标题层级，映射为 Word 标题样式
- **OCR 识别**：扫描件/图片中的文字自动识别提取
- **混合文档**：同一 PDF 中文本页和扫描页自动分派最优引擎

## 安装

```bash
pip install -r requirements.txt
```

主要依赖：

| 库 | 用途 |
|---|---|
| PyMuPDF | PDF/图片读取、页面渲染、区域裁剪 |
| pdf2docx | 文本 PDF 直接转换（保留布局） |
| paddleocr | 本地 OCR + PPStructure 版面分析 + 表格识别 |
| glmocr | GLM-OCR 云端版面解析（需 API Key） |
| python-docx | Word 文档生成 |

## 快速开始

### PDF 转 Word

```bash
# 自动模式（默认），根据页面特征选择最优引擎
pdf2word input.pdf

# 指定输出路径
pdf2word input.pdf -o output.docx

# 只转换指定页码
pdf2word input.pdf --pages 1-5,8,10-12
```

### 图片转 Word

```bash
# 支持 JPG、PNG、BMP、WebP、TIFF 格式
pdf2word photo.jpg -o output.docx
pdf2word scan.png --engine paddleocr
pdf2word doc.jpg --engine glm-ocr
```

### 指定引擎

```bash
# 自动选择（默认）
pdf2word input.pdf --engine auto

# 文本 PDF 直接转换，质量最高
pdf2word input.pdf --engine pdf2docx

# 本地 OCR，免费离线使用，支持表格/标题识别
pdf2word input.pdf --engine paddleocr

# GLM-OCR 云端引擎，复杂文档效果最好
pdf2word input.pdf --engine glm-ocr
```

### OCR 语言设置

```bash
# 中文（默认）
pdf2word input.pdf --engine paddleocr --lang ch

# 英文
pdf2word input.pdf --engine paddleocr --lang en

# 中英混合
pdf2word input.pdf --engine paddleocr --lang ch_en
```

### 查看详细日志

```bash
pdf2word input.pdf -v
```

## 引擎对比

| 特性 | pdf2docx | paddleocr | glm-ocr |
|---|---|---|---|
| 适用场景 | 电子 PDF（有文字层） | 扫描件、混合文档 | 复杂文档（多表格/多图片） |
| 图片输入 | 不支持 | 支持 | 支持 |
| 网络要求 | 离线 | 离线 | 需要网络 |
| 费用 | 免费 | 免费 | 按 API 调用计费 |
| 表格识别 | 直接保留 | PPStructure 本地识别 | 云端版面分析 |
| 图片处理 | 直接保留 | 二次 OCR 判定（文字→文本/纯图→保留） | GLM 自身判断（文字→文本/纯图→保留） |
| 标题识别 | 直接保留 | 版面分析自动检测 | 版面分析自动检测 |

### auto 模式选择逻辑

```
PDF 输入：
  页面有文字层且无复杂布局  →  pdf2docx
  页面有文字层且含表格/图片  →  glm-ocr
  页面无文字层（扫描件）    →  paddleocr

图片输入：
  默认使用 paddleocr（本地免费）
```

### 图片区域智能判定

```
PaddleOCR 引擎：
  figure 区域 → 裁剪 → PaddleOCR 二次判定
    有文字（≥5字，置信度≥0.7）→ 转为可编辑文本
    无文字                      → 保留原图

GLM-OCR 引擎：
  figure 区域 → 检查 GLM-OCR 返回的 content
    content 有文字 → 转为可编辑文本（GLM 自身已识别）
    content 为空   → 裁剪保留原图
```

## GLM-OCR 配置

使用 `glm-ocr` 引擎需要智谱 AI API Key。

**方式一：环境变量**

```bash
export ZHIPU_API_KEY=your_api_key
pdf2word input.pdf --engine glm-ocr
```

**方式二：.env 文件**

在项目根目录创建 `.env` 文件：

```
ZHIPU_API_KEY=your_api_key
```

API Key 获取：[智谱 AI 开放平台](https://open.bigmodel.cn/)

## 作为 Python 库使用

```python
from pdf2word.converter import convert_to_word

# 自动检测输入类型（PDF 或图片）
convert_to_word("input.pdf", "output.docx")
convert_to_word("photo.jpg", "output.docx")

# 指定引擎
convert_to_word("scan.pdf", "output.docx", engine="paddleocr")
convert_to_word("doc.jpg", "output.docx", engine="glm-ocr")

# PDF 专用：指定页码范围
from pdf2word.converter import convert_pdf_to_word
convert_pdf_to_word("input.pdf", "output.docx", page_range="1-5,8", engine="glm-ocr")
```

## 项目结构

```
pdf2word/
├── cli.py                  # CLI 入口（支持 PDF 和图片输入）
├── converter.py            # 主编排器，自动检测输入类型，多引擎调度
├── models.py               # 数据模型
├── engines/
│   ├── base.py             # 引擎 Protocol 接口
│   ├── pdf2docx_engine.py  # 文本 PDF 引擎
│   ├── paddleocr_engine.py # PaddleOCR + PPStructure 引擎（含图片二次判定）
│   └── glm_ocr_engine.py   # GLM-OCR 云端引擎（含图片智能判定）
├── builders/
│   └── docx_builder.py     # Word 文档组装
└── utils/
    ├── page_classifier.py  # 页面分类器
    └── table_parser.py     # 表格解析器（HTML/Markdown）
```

## 许可证

MIT