如何使用Python从PDF中提取文本并转换为Markdown文本的实操

ytkz2024-07-232024-09-09

之前介绍如何通过pdfplumber获取PDF的文本。

基于上次所写的内容，继续深挖。现在使用Python从PDF中提取文本并转换为Markdown的实际操作。

A Step-by-Step Guide to Parsing PDFs using the pdfplumber Library In Python | by Azhar Sayyad | Medium

背景

现在有12个注册测绘师综合真题的PDF，但是里面很多广告，个人希望把这些广告消除了。如果是少量pdf，那么用WPS然后花钱开会员的应该能把广告清理掉。但是这12个PDF对应着12年的注册测绘师综合真题，每个PDF有100页，一页一道真题。如果手动清理广告，这个工作量是很大的。

先把PDF转为文本，然后对这些文本进行数据清理，这个是对单个PDF的转纯净文本的思路。

思路

这个编程思路，也是我平常写代码去解决问题的思路。

先解决简单的问题，提取共性，再解决复杂的问题，在这过程中把握好输入输出。

这句话有很多种角度去理解。

就具体事情来说，先解决单个PDF格式转换文本，再进行数据清洗。进而解决批量PDF格式转换文本且数据清洗。

单个PDF格式转换文本

数据清洗

因为这次的PDF的广告是有规律的，而且是一成不变，所以从技术层面来说，数据清洗是很简单的事情。在python中使用字符串自带的方法就可以实现了。

original_pdf_text = read_pdf(pdf_file)  # 转为文本
modified_string = original_pdf_text.replace(
r'仅允许加其中一个群！！','')  数据清洗

批量处理

以上已经实现了单文件的处理。这时，控制好输入参数就可以很容易地实现批量处理。

全部代码

整合以上所有代码，如下：

import pdfplumber
import os

def get_file_name(file_dir, type):
    """
    搜索 后缀名为type的文件  不包括子目录的文件
    #
    """
    corretion_file = []
    filelist = os.listdir(file_dir)
    for file in filelist:
        if os.path.splitext(file)[1] == type:
            corretion_file.append(os.path.join(file_dir, file))
    if corretion_file == []:
        for file in filelist:
            if os.path.splitext(file)[1] == '.'+type:
                corretion_file.append(os.path.join(file_dir, file))
    return corretion_file
def read_pdf(pdf_path):
    with pdfplumber.open(pdf_path) as pdf:
            content = ''
            table = []
            image = []
            for i in range(len(pdf.pages)):
                # 读取PDF文档第i+1页
                page = pdf.pages[i]
                if i == 9:
                    print()
                # page.extract_text()函数即读取文本内容
                page_content = '\n'.join(page.extract_text().split('\n'))
                content = content + page_content + '\n'


    modified_string = content.replace("\nB", " B")
    # 然后，替换\nC为 C
    modified_string = modified_string.replace("\nC", " C")
    # 最后，替换\nD为 D
    modified_string = modified_string.replace("\nD", " D")
    return modified_string
#
def comprehensive_cleaning(path, outpath):
    """
    综合真题_数据清洗_批量处理
    """
    if os.path.exists(outpath)==False:
        os.makedirs(outpath)
    pdf_files_list = get_file_name(path, '.pdf')

    i = 0
    for pdf_file in pdf_files_list:
        original_pdf_text = read_pdf(pdf_file)
        modified_string = original_pdf_text.replace(
        r'【路过讲堂测绘QQ群】 1群：517983234 2群：542530736 3群：397037429 4群：581154049 5群：158463229 注意：仅允许加其中一个群！！',
        '')
        # 写
        outfile = os.path.join(outpath , os.path.splitext(os.path.basename(pdf_file))[0] + '.md')
        with open(outfile, 'w', encoding='utf-8') as f:
            f.write(modified_string)
            f.close()
        i += 1
        print("\r进行PDF to markdown转换: [{0:50s}] {1:.1f}%".format('#' * int(i / (len(pdf_files_list)) * 50),
                                                                 i / len(pdf_files_list) * 100), end="",
              flush=True)

if __name__ == '__main__':
    path = r'D:\Registered_Surveyor\pdf'
    outpath = r'D:\Registered_Surveyor\markdown\test2022'
    comprehensive_cleaning(path, outpath)