llmsherpa

llmsherpa

May 21, 2024 | seedling, permanent

tags :

Python Apps #

github Developer APIs to Accelerate LLM Projects

LayoutPDFReader #

Most PDF to text parsers do not provide layout information. Often times, even the sentences are split with arbritrary CR/LFs making it very difficult to find paragraph boundaries. This poses various challenges in chunking and adding long running contextual information such as section header to the passages while indexing/vectorizing PDF for LLM applications such as retrieval augmented generation (RAG).

LayoutPDFReader solves this problem by parsing PDFs along with hierarchical layout information such as:

  • Sections and subsections along with their levels.
  • Paragraphs - combines lines.
  • Links between sections and paragraphs.
  • Tables along with the section the tables are found in.
  • Lists and nested lists.
  • Join content spread across pages.
  • Removal of repeating headers and footers.
  • Watermark removal.

Supported chunk types #

ref

  • para
  • list_item
  • table
  • header

Backend service #

lmsherpa back end service is now fully open sourced under Apache 2.0 Licence. See https://github.com/nlmatics/nlm-ingestor

You can now run your own servers using a docker image! Support for different file formats: DOCX, PPTX, HTML, TXT, XML

  • OCR Support is built in
  • Blocks now have co-ordinates - use bbox propery of blocks such as sections
  • A new indent parser to better align all headings in a document to their corresponding level
  • The free server and paid server are not updated with latest code and users are requested to spawn their own servers using instructions in nlm-ingestor

AzmX backend service #

http://193.122.83.21:5001/api/parseDocument?renderFormat=allk


Go to random page

Previous Next