Unlock the content in your PDFs

Presentation Abstract
Scenario: I’ve been creating documents for years, but only in the last 3 years have been creating my new documents in XML. Now I would like to be able to convert the old documents, most of which only exist in PDFs, into XML so that I can use that XML to create web content, ePubs, as part of my new documents, or simply to update and republish the previous documents. What are my options? Rekey the content, or try and extract it from the PDF binary, or use software tools to do the conversion. This presentation will focus on how software tools can do the conversion first by analyzing the PDF and converting it to a very rich generic XML format that identifies the building blocks of the page, typographic decorations (typeface, point size, color, weight, etc.) and X and Y coordinates of each item on the page and then by converting the rich xml into my desired output (XML, HTML, text or ?) using XSLT. From the very rich XML you can identify the intelligence of the document in order to apply the XML schema you are using.

What can attendees expect to learn?

Attendees will learn that automated conversion of PDF to XML is possible for a large collection of documents that are physically laid out the same by examining the technical components that make up the solution. The approach is not one that would be used for randomly laid out PDFs. By examining a page you can think of it as being constructed with content blocks that in turn are individually constructed from lines of text and images. By analyzing the content blocks it is possible to identify the type of content (paragraphs, tables, images, column, etc.) and then save that information along with the content and the physical location on the page to a very rich xml format. From the very rich xml it can then be determined if content is a title, list, paragraph, caption, table, page heading, page footing, etc. and apply the appropriate xml tagging. This approach can also be used to extract content from PDF for database, search and other applications.

Meet the Presenter

MikeMiller Michael Miller is Vice President of Antenna House. For over 30 years, he has been involved in high-end document formatting and document management and has an extensive background with structured data, including SGML, XML, S1000D, and DITA. He has worked extensively in Europe and North America and has been involved in the implementation of hundreds of systems for automated document formatting.

Unlock the content in your PDFs

⇐Return to Agenda

Title