unstructured

unstructured

May 25, 2024 | seedling, permanent

tags :

Python Apps #

URL github

We get your data LLM-ready

80% of enterprise data exists in difficult-to-use formats like HTML, PDF, CSV, PNG, PPTX, and more. Unstructured effortlessly extracts and transforms complex data for use with every major vector database and LLM Framework. Gartner.

It’s all we do, and we’re the only ones who do it.

Used in LangChain to extract data.

Dependencies #

  • libmagic-dev (filetype detection)
  • poppler-utils (images and PDFs)
  • tesseract-ocr (images and PDFs, install tesseract-lang for additional language support) Arabic OCR can be done this package.
  • libreoffice (MS Office docs)
  • pandoc (EPUBs, RTFs and Open Office docs)

Strategies #

ref Available options:

auto (default strategy) #

The “auto” strategy will choose the partitioning strategy based on document characteristics and the function kwargs.

fast #

The “fast” strategy will leverage traditional NLP extraction techniques to quickly pull all the text elements. “Fast” strategy is not good for image based file types.

hi_res #

  • The “hi_res” strategy will identify the layout of the document using Detectron2. The advantage of “hi_res” is that it uses the document layout to gain additional information about document elements.
  • We recommend using this strategy if your use case is highly sensitive to correct classifications for document elements.

ocr_only #

Leverage Optical Character Recognition to extract text from the image based files.

Paddle OCR #

export OCR_AGENT="paddle"

when tried last time, <2024-04-11 Thu>, it was throwing this error: github, ref

``` et_ocr_agent raise ValueError( ValueError: (‘Environment variable OCR_AGENT’, ’ must be set to an existing ocr agent module, not unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle.’)

``` although “unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle” the module and the class inside it was existing

OCR of Images #

2024-04-11_19-36-18_screenshot.png #

class OCRAgentPaddle(OCRAgent): def load_agent(self, language: str == DEFAULT_PADDLE, LANG): import paddle from unstructured.paddledcr import PaddleOCR I ""Loads the PaddleOCR agent as a global variable to ensure that we only load it once. u


No notes link to this note

Go to random page

Previous Next