Text extraction

This module provides a framework for extracting text content from various file formats, such as PDFs and Office documents. The extracted content can be used programmatically or made available directly on your File objects.

Installation

composer require silverstripe/textextraction

GitHub repository

https://github.com/silverstripe/silverstripe-textextraction

Configuration

Configuration options, including enabling extraction for DataObjects, managing cached content length, swapping cache backends, and configuring PDF text extraction

Usage

Various methods for text extraction, including extraction via file path or File object, and using the FileTextExtractable extension

Apache Solr

Apache Solr's role in text extraction using Apache Tika, its configuration, and content indexing

Tika

Using Apache Tika for text extraction, using either CLI or REST server configurations

Edit on Github

Text extraction#

Installation#

GitHub repository#