pytesseract Library
I want to look into the python pytesseract library, which is an opticla character recognition tool for Python.
References
- pytesseract pypi reference
- Google Tesseract OCR
- With the
pytesseract
Python library, you must be able to invoke the tesseract command as tesseract.
- With the
Related
- Optical Character Recognition
- Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of types, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo, or from subtitle text superimposed on an image.
- It is a common way of digitizing printed texts so that they can be electronically edited, searched, stored more compactly, displayed online, and used in machine processes such as cognitive computing, machine translation, text-to-speech, key data and text mining. OCR is a field of research in pattern recognition, artificial intelligence, and computer vision.
- Early versions needed to be trained with images of each character, and worked on one font at a time. Advanced systems capable of producing a high degree of accuracy for most fonts are now common, and with support for a variety of image file format inputs.
- Google Tesseract OCR
Tessearct 4 adds a new neural net LSTM based OCR engine which is focuses on line recognition, but also still supports the legacy Tesseract OCR engine of Tesseract 3 which works by recognizing character patterns.
Tesseract was originally developed at Hewlett-Packard Laboratories Bristol UK and at Hewlett-Packard Co, Greeley Colorado USA between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. In 2005 Tesseract was open sourced by HP. From 2006 until November 2018 it was developed by Google.
- The latest source code is available form the main branch on GitHub
- You can either Install Tesseract via pre-built binary package or build it from source.
Command Line Usage:
$ tesseract imagename outputbase [-l lang] [--oem ocrenginemode] [--psm pagesegmode] [configfiles...]
Notes
Python-tesseract is an optical character recognition (OCR) tool for python. That is, it will recognize andreadthe text embedded in images. Python-tesseract is a wrapper for Google's Tesseract-OCR Engine. It is also useful as a stand-alone invocation script to tesseract, as it can read all image types supported by the Pillow and Leptonica imaging libraries [...] Additionally, if used as a script, Python-tesseract will print the recognized text instead of writing it to a file.
from PIL import Image
import pytesseract
# If you don't have tesseract executable in your PATH, include the following:
pytesseract.pytesseract.tesseract_cmd = r'<full_path_to_your_tesseract_executable>'
# Example tesseract_cmd = r'C:\Program Files (x86)\Tesseract-OCR\tesseract'
# In order to bypass the image conversions of pytesseract, just use relative or absolute image path
# NOTE: In this case you should provide tesseract supported images or tesseract will return error
print(pytesseract.image_to_string('test.png'))
# Batch processing with a single file containing the list of multiple image file paths
print(pytesseract.image_to_string('images.txt'))
# Timeout/terminate the tesseract job after a period of time
try:
print(pytesseract.image_to_string('test.jpg', timeout=2)) # Timeout after 2 seconds
print(pytesseract.image_to_string('test.jpg', timeout=0.5)) # Timeout after half a second
except RuntimeError as timeout_error:
# Tesseract processing is terminated
pass
# Get bounding box estimates
print(pytesseract.image_to_boxes(Image.open('test.png')))
# Get verbose data including boxes, confidences, line and page numbers
print(pytesseract.image_to_data(Image.open('test.png')))
# Get information about orientation and script detection
print(pytesseract.image_to_osd(Image.open('test.png')))
# Get a searchable PDF
pdf = pytesseract.image_to_pdf_or_hocr('test.png', extension='pdf')
with open('test.pdf', 'w+b') as f:
f.write(pdf) # pdf type is bytes by default
# Get HOCR output
hocr = pytesseract.image_to_pdf_or_hocr('test.png', extension='hocr')
# Get ALTO XML output
xml = pytesseract.image_to_alto_xml('test.png')
Functions
get_langauges
- Returns all currently supported languages by Tesseract OCR
get_tesseract_version
- Returns the Tesseract version installed in the system
image_to_string
- Returns unmodified characters and their box boundaries
image_to_boxes
- Returns result containing recognized characters and their box boundaries
image_to_data
- Returns result box boundaries, confidences, and other information.
image_to_osd
- Returns result containing information about orientation and script detection.
image_to_alto_xml
- Returns result in the form of Tesseract's ALTO XML format
run_and_get_output
- Returns the raw output from Tesseract OCT. Gives a bit more control over the parameters that are sent to tesseract.
run_and_get_multiple_output
- Returns like
run_and_get_output
but can handle multiple extensions
- Returns like
Parameters
image_to_data(image, lang=None, config='', nice=0, output_type=Output.STRING, timeout=0, pandas_config=None)
image
- Object or string - either PIL Image, NumPy array or file path of the image to be processed by Tesseract. If you pass object instead of file path,
pytesseract
will implicitly convert the image to RGB mode.
- Object or string - either PIL Image, NumPy array or file path of the image to be processed by Tesseract. If you pass object instead of file path,
lang
- String Tesseract language code string. Defaults to
eng
if not specified. Multiple languages example:lang='eng+fra'
- String Tesseract language code string. Defaults to
config
- String - Any additional custom configuration flags that are not available via the
pytesseract
function.
- String - Any additional custom configuration flags that are not available via the
nice
- Integer - modifies the processor priority for the Tesseract run.
output_type
- Class attribute - specifies the types of the output, defaults to
string
- Class attribute - specifies the types of the output, defaults to
timeout
- Integer or Float - duration in seconds of the OCR processing, after which,
pytesseract
will terminate and raiseRuntimeError
- Integer or Float - duration in seconds of the OCR processing, after which,
pandas_config
- Dict - only for the Output.DATAFRAME type.
Prerequisites
- Python 3.6+
- You need pillow installed
- You need Google Tesseract OCR installed and it must be able to be called as tesseract form the command line
Installation
$ pip install pytesseract # Install with pip
$ pip install -U git+https://github.com/madmaze/pytesseract.git
$ # Install from source
$ git clone https://github.com/madmaze/pytesseract.git
$ cd pytesseract && pip install -U .
$ conda install -c conda-forge pytesseract # Install with conda
Comments
You have to be logged in to add a comment
User Comments
There are currently no comments for this article.