Issue
While indexing files, you receive this error:
Caused by: org.apache.tika.exception.TikaException: Unable to extract PDF content
Resolution
You can check Solr sees by running the Tika extractor manually:
- Install Java
- Download https://archive.apache.org/dist/tika/tika-app-0.10.jar
- Download the PDF in question.
- Run the below command:
java -jar tika-app-0.10.jar {filename-to-test}
Cause
There are many possible causes for Tika to give this error, but here are a few:
- The PDF could be password-protected.
- It could be too big.
- It could be an incompatible format.
To rule out a version incompatibility, you can convert the PDF file that is generating the error to an earlier version. You can use something like this sample Ghostscript (https://www.ghostscript.com/) to achieve this:
$ gs \
-sDEVICE=pdfwrite \
-dCompatibilityLevel=1.5 \
-o output.pdf \
input.pdf