OTM 2013 - LNCS 8185-8186

Cooperative and Fast-Learning Information Extraction from Business Documents for Document Archiving

Daniel Esser

Technical University Dresden, Computer Networks Group 01062, Dresden, Germany
daniel.esser@tu-dresden.de

Abstract. Automatic information extraction from scanned business documents is especially valuable in the application domain of document management and archiving. Although current solutions for document classification and extraction work pretty well, they still require a high effort of on-site configuration done by domain experts or administrators. Especially small office/home office (SOHO) users and private individuals often do not use such systems because of the need for configuration and long periods of training to reach acceptable extraction rates. Therefore we present a solution for information extraction out of scanned business documents that fits the requirements of these users. Our approach is highly adaptable to new document types and index fields and uses only a minimum of training documents to reach extraction rates comparable to related works and manual document indexing. By providing a cooperative extraction system, which allows sharing extraction knowledge between participants, we furthermore want to minimize the number of user feedback and increase the acceptance of such a system.

A first evaluation of our solution according to a document set of 12,500 documents with 10 commonly used fields shows competitive results above 85% F1-measure. Results above 75% F1-measure are already reached with a minimal training set of only one document per template.

Keywords: Document Layout Analysis, Information Extraction, Cooperative Extraction, Few-Exemplar-Learning

LNCS 8186, p. 22 ff.

Full article in PDF | BibTeX