Version 5 supported
This version of Silverstripe CMS is still supported though will not receive any additional features. Go to documentation for the most recent stable version.

Text extraction

This module provides a framework for extracting text content from various file formats, such as PDFs and Office documents. The extracted content can be used programmatically or made available directly on your File objects.

Installation

composer require silverstripe/textextraction

GitHub repository

https://github.com/silverstripe/silverstripe-textextraction

Configuration
Configuration options, including enabling extraction for DataObjects, managing cached content length, swapping cache backends, and configuring PDF text extraction
Usage
Various methods for text extraction, including extraction via file path or File object, and using the FileTextExtractable extension
Apache Solr
Apache Solr's role in text extraction using Apache Tika, its configuration, and content indexing
Tika
Using Apache Tika for text extraction, using either CLI or REST server configurations