llmsherpa

May 21, 2024 | seedling, permanent

tags :

Python Apps #

github Developer APIs to Accelerate LLM Projects

LayoutPDFReader #

Most PDF to text parsers do not provide layout information. Often times, even the sentences are split with arbritrary CR/LFs making it very difficult to find paragraph boundaries. This poses various challenges in chunking and adding long running contextual information such as section header to the passages while indexing/vectorizing PDF for LLM applications such as retrieval augmented generation (RAG).

LayoutPDFReader solves this problem by parsing PDFs along with hierarchical layout information such as:

Sections and subsections along with their levels.
Paragraphs - combines lines.
Links between sections and paragraphs.
Tables along with the section the tables are found in.
Lists and nested lists.
Join content spread across pages.
Removal of repeating headers and footers.
Watermark removal.

Supported chunk types #

ref

para
list_item
table
header

Backend service #

lmsherpa back end service is now fully open sourced under Apache 2.0 Licence. See https://github.com/nlmatics/nlm-ingestor

You can now run your own servers using a docker image! Support for different file formats: DOCX, PPTX, HTML, TXT, XML

OCR Support is built in
Blocks now have co-ordinates - use bbox propery of blocks such as sections
A new indent parser to better align all headings in a document to their corresponding level
The free server and paid server are not updated with latest code and users are requested to spawn their own servers using instructions in nlm-ingestor

AzmX backend service #

http://193.122.83.21:5001/api/parseDocument?renderFormat=allk

Links to this note

Chunking for creating Embeddings