Apache Tika
Apache Tika #
github The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).
written in Java
Some features when using Apache Tika. #
Medium article explaining Tika
Unified parser Interface #
Tika utilizes different third party parser libraries into a single parser interface. With this feature, the user no longer needs to select the correct parser library and according to the file type.
Low memory usage #
Tika consumes less memory resources and therefore it is easily embeddable into Java applications.
Fast processing #
Tika come with quick content detection and extraction from applications.
Flexible metadata #
Tika can comprehend all metadata models that are used to describe files.
Parser integration #
Tika can use various parser libraries available for each document type in a single application
MIME type detection #
Tika can detect and extract content from all the media types included in the MIME standards
Language detection #
Tika includes a language identification feature and can be used in documents based on language type