Skip to content

PDF文档结构信息提取系统

随着移动阅读终端种类的多样化,如手机、电子阅读器、GPS等,电子文档在不同移动设备平台上的自适应显示,要求文档内容根据屏幕进行流式化重排和自适应调整,从而保证舒适的阅读体验。为满足移动阅读可读性需求,保证文档内容显示方式的重新调整,其关键在于获得原文档的物理逻辑结构和顺序信息。此外,电子文档结构化信息提取直接影响着信息检索、文本挖掘、搜索引擎、机器翻译、信息存储和管理等应用领域的发展和进步。

综上所述,研究团队研发了“PDF文档结构信息提取系统”。该系统能够提取结构化PDF文档中的文本、图片、图表、公式信息。

To ensure comfortable reading experience the adaptive display of electronic documents on different mobile devices requires the document content to be streamed rearranged and adaptive adjusted according to the screen. In order to meet the readability requirements of mobile reading and ensure the readjustment of document content display mode, the key is to obtain the physical logical structure and sequence information of the original document. In addition, the structural information extraction of electronic documents directly affects the development and progress of information retrieval, text mining, search engine, machine translation, information storage and management and other application fields.

To sum up, our research group developed the prototype of "PDF document structure information extraction system". The prototype can extract text, picture, chart and formula information from structured PDF documents.