llmsherpa
tags :
Python Apps #
github Developer APIs to Accelerate LLM Projects
LayoutPDFReader #
Most PDF to text parsers do not provide layout information. Often times, even the sentences are split with arbritrary CR/LFs making it very difficult to find paragraph boundaries. This poses various challenges in chunking and adding long running contextual information such as section header to the passages while indexing/vectorizing PDF for LLM applications such as retrieval augmented generation (RAG).
LayoutPDFReader solves this problem by parsing PDFs along with hierarchical layout information such as:
- Sections and subsections along with their levels.
- Paragraphs - combines lines.
- Links between sections and paragraphs.
- Tables along with the section the tables are found in.
- Lists and nested lists.
- Join content spread across pages.
- Removal of repeating headers and footers.
- Watermark removal.
Supported chunk types #
- para
- list_item
- table
- header
Backend service #
lmsherpa back end service is now fully open sourced under Apache 2.0 Licence. See https://github.com/nlmatics/nlm-ingestor
You can now run your own servers using a docker image! Support for different file formats: DOCX, PPTX, HTML, TXT, XML
- OCR Support is built in
- Blocks now have co-ordinates - use bbox propery of blocks such as sections
- A new indent parser to better align all headings in a document to their corresponding level
- The free server and paid server are not updated with latest code and users are requested to spawn their own servers using instructions in nlm-ingestor
AzmX backend service #
http://193.122.83.21:5001/api/parseDocument?renderFormat=allk