![]() |
|
||
Cooperative and Fast-Learning Information Extraction from Business Documents for Document ArchivingDaniel Esser Technical University Dresden, Computer Networks Group 01062, Dresden, Germanydaniel.esser@tu-dresden.de Abstract. Automatic information extraction from scanned business documents is especially valuable in the application domain of document management and archiving. Although current solutions for document classification and extraction work pretty well, they still require a high effort of on-site configuration done by domain experts or administrators. Especially small office/home office (SOHO) users and private individuals often do not use such systems because of the need for configuration and long periods of training to reach acceptable extraction rates. Therefore we present a solution for information extraction out of scanned business documents that fits the requirements of these users. Our approach is highly adaptable to new document types and index fields and uses only a minimum of training documents to reach extraction rates comparable to related works and manual document indexing. By providing a cooperative extraction system, which allows sharing extraction knowledge between participants, we furthermore want to minimize the number of user feedback and increase the acceptance of such a system. A first evaluation of our solution according to a document set of 12,500 documents with 10 commonly used fields shows competitive results above 85% F1-measure. Results above 75% F1-measure are already reached with a minimal training set of only one document per template. Keywords: Document Layout Analysis, Information Extraction, Cooperative Extraction, Few-Exemplar-Learning LNCS 8186, p. 22 ff. lncs@springer.com
|