
To go with option 2 (render each page and then run OCR on that rendered image), you need to specify the ocr strategy: curl -T testOCR.pdf -header "X-Tika-PDFOcrStrategy: ocr_only" To go with option 1 for OCR'ing PDFs (run OCR against inline images), you need to specify configurations for the PDFParser like so:Ĭurl -T testOCR.pdf -header "X-Tika-PDFextractInlineImages: true"
Brew install imagemagick update#
This behavior is changed in Tika 2.x, where the PDFConfig remembers settings from tika-config.xml and will only temporarily update custom configs sent via headers. Note: With Tika server 1.x, the PDFConfig is generated for each document, so any configurations that you may specify in the tika-config.xml file that you pass to the tika-server on startup are overwritten. See also PDFParser notes for more details on options for performing OCR on PDFs. With the exceptions of the paths, we document the defaults in the following: In Tika 2.x, users can modify configurations via a tika-config.xml. An example of this is shown below:Ĭurl -T /path/to/tiff/image.jpg -header "X-Tika-OCRLanguage: eng"Ĭurl -T /path/to/tiff/image.jpg -header "X-Tika-OCRLanguage: eng+fra" Overriding Default Configuration These can be specified for specific requests using the X-Tika-OCRLanguage custom header. Java -jar /path/to/ in another window, issue a cURL requestĬurl -T /path/to/tiff/image.tiff -header "Content-type: image/tiff" Overriding the configured language as part of your requestĭifferent requests may need processing using different language models. For example, to post a TIFF file to the server and get back its OCR extracted text, run the following commands: in another window, start Tika server Once you have Tesseract and a fresh build of Tika 1.7-SNAPSHOT (including Tika server), you can easily use Tika-Server with Tesseract.

That's it! You should see the text extracted by Tesseract and flowed through Tika. For example, try that same file above with Tika: Once you have confirmed Tesseract is working, then you can simply use the Tika-app, built with 1.7-SNAPSHOT or later to use Tika OCR. Look for the text extracted by Tesseract. You should see the output of the text extraction in out.txt. Tesseract -psm 3 /path/to/tiff/file.tiff out.txt Once you have Tesseract installed, you should test it to make sure it's working. There's some advice on the Tesseract github issues + wiki on ways to speed it up, eg #263 and #1171 and this wiki page. To add language packs, see what's available then, e.g.To add language packs, see what's available yum search tesseract then, e.g. Add "epel" to your yum repositories if it isn't already installedġb.install tesseract: brew install tesseract tesseract-lang.install leptonica with tiff support: brew install leptonica -with-libtiff.uninstall leptonica: brew uninstall leptonica.uninstall tesseract: brew uninstall tesseract.

If you are having trouble getting Tesseract to work with TIFF files, read this link. If you have trouble installing via Brew, you can try installing Tesseract from source.

Brew install imagemagick mac#
Mac Installation Instructions brew install tesseract tesseract-lang Issues with Installing via Brew With TIKA-93 you can now use the awesome Tesseract OCR parser within Tika!įirst some instructions on getting it installed.
