deepdoctection

May 28, 2024 | seedling, permanent

tags: Deep Learning

Summary #

A Document AI Package deepdoctection is a Python library that orchestrates document extraction and document layout analysis tasks using deep learning models. It does not implement models but enables you to build pipelines using highly acknowledged libraries for object detection, OCR and selected NLP tasks and provides an integrated framework for fine-tuning, evaluating and running models. For more specific text processing tasks use one of the many other great NLP libraries.

deepdoctection focuses on applications and is made for those who want to solve real world problems related to document extraction from PDFs or scans in various image formats.

Check the demo of a document layout analysis pipeline with OCR on 🤗 Hugging Face spaces.

deepdoctection provides model wrappers of supported libraries for various tasks to be integrated into pipelines. Its core function does not depend on any specific deep learning library.

Selected models for the following tasks are currently supported:

Document layout analysis including table recognition in TensorFlow with Tensorpack, or Pytorch with Detectron2,
OCR with support of Tesseract, DocTR (Tensorflow and PyTorch implementations available) and a wrapper to an API for a commercial solution,
Text mining for native PDFs with pdfplumber,
Language detection with fastText,
Deskewing and rotating images with jdeskew.
Document and token classification with all LayoutLM models provided by the Transformer library. (Yes, you can use any LayoutLM-model with any of the provided OCR-or pdfplumber tools straight away!).
Table detection and table structure recognition with table-transformer.
There is a small dataset for token classification available and a lot of new tutorials to show, how to train and evaluate this dataset using LayoutLMv1, LayoutLMv2, LayoutXLM and LayoutLMv3.
Comprehensive configuration of analyzer like choosing different models, output parsing, OCR selection. Check this notebook or the docs for more infos. Document layout analysis and table recognition now runs with Torchscript (CPU) as well and Detectron2 is not required anymore for basic inference.
[new] More angle predictors for determining the rotation of a document based on Tesseract and DocTr (not contained in the built-in Analyzer).
[new] Token classification with LiLT via transformers. We have added a model wrapper for token classification with LiLT and added a some LiLT models to the model catalog that seem to look promising, especially if you want to train a model on non-english data. The training script for LayoutLM can be used for LiLT as well and we will be providing a notebook on how to train a model on a custom dataset soon.

deepdoctection provides on top of that methods for pre-processing inputs to models like cropping or resizing and to post-process results, like validating duplicate outputs, relating words to detected layout segments or ordering words into contiguous text. You will get an output in JSON format that you can customize even further by yourself.

deepdoctection

Summary #

No notes link to this note