gross - Best Practices Conference

Automation Techniques to Create XML from Paper and Page Images

Everything’s digital, so why talk about paper and images? Not so fast. There is still so much content trapped in paper or scanned images—documents of record, training materials, policy manuals, safety data—all kinds of stuff. Some of it is quite voluminous. Often it’s useful to get it into a form that you can re-use, incorporate, search, find, distribute, etc. While OCR is quite good these days, once you get into technical materials with illustrations, math, chemistry and other such artifacts, things can quickly get ugly.

Drawing on his experiences with work for the US Patent Office, where DCL currently creates XML from page images at the rate of two million pages a month, Mr. Gross will outline the issues, some successful approaches, and how they might apply to technical and training documentation.

What can attendees expect to learn?

During Mark’s discussion you will learn about OCR accuracy and more. Mark will delve into what happens to OCR accuracy in complex documents and how DCL has improved accuracy on complex document. Further, Mark will offer answers to these big questions: Is automated processing feasible? When is less than perfect good enough?

Meet the Presenter

Mark Gross, President & CEO—founder of Data Conversion Laboratory, is a recognized authority on XML implementation and document conversion. President and CEO of DCL, Mark also serves as Project Executive, with overall responsibility for resource management and planning.

Prior to founding DCL in 1981, Mark was with the consulting practice of Arthur Young & Co. Mark has a BS in Engineering from Columbia University and an MBA from New York University. He has also taught at the New York University Graduate School of Business, the New School, and Pace University. He is a frequent speaker on the topic of automated conversions to XML and SGML.

⇐Return to Agenda