pytesseract Library

I want to look into the python pytesseract library, which is an opticla character recognition tool for Python.

Date Created:

2 588

References

pytesseract pypi reference
Google Tesseract OCR
- With the pytesseract Python library, you must be able to invoke the tesseract command as tesseract.

Optical Character Recognition
- Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of types, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo, or from subtitle text superimposed on an image.
- It is a common way of digitizing printed texts so that they can be electronically edited, searched, stored more compactly, displayed online, and used in machine processes such as cognitive computing, machine translation, text-to-speech, key data and text mining. OCR is a field of research in pattern recognition, artificial intelligence, and computer vision.
- Early versions needed to be trained with images of each character, and worked on one font at a time. Advanced systems capable of producing a high degree of accuracy for most fonts are now common, and with support for a variety of image file format inputs.
Google Tesseract OCR

Tessearct 4 adds a new neural net LSTM based OCR engine which is focuses on line recognition, but also still supports the legacy Tesseract OCR engine of Tesseract 3 which works by recognizing character patterns.

Tesseract was originally developed at Hewlett-Packard Laboratories Bristol UK and at Hewlett-Packard Co, Greeley Colorado USA between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. In 2005 Tesseract was open sourced by HP. From 2006 until November 2018 it was developed by Google.

The latest source code is available form the main branch on GitHub
You can either Install Tesseract via pre-built binary package or build it from source.

Command Line Usage:

$ tesseract imagename outputbase [-l lang] [--oem ocrenginemode] [--psm pagesegmode] [configfiles...]

Notes

Python-tesseract is an optical character recognition (OCR) tool for python. That is, it will recognize and read the text embedded in images. Python-tesseract is a wrapper for Google's Tesseract-OCR Engine. It is also useful as a stand-alone invocation script to tesseract, as it can read all image types supported by the Pillow and Leptonica imaging libraries [...] Additionally, if used as a script, Python-tesseract will print the recognized text instead of writing it to a file.

from PIL import Image
import pytesseract
 # If you don't have tesseract executable in your PATH, include the following:
pytesseract.pytesseract.tesseract_cmd = r'<full_path_to_your_tesseract_executable>'
# Example tesseract_cmd = r'C:\Program Files (x86)\Tesseract-OCR\tesseract'

# In order to bypass the image conversions of pytesseract, just use relative or absolute image path
# NOTE: In this case you should provide tesseract supported images or tesseract will return error
print(pytesseract.image_to_string('test.png'))

# Batch processing with a single file containing the list of multiple image file paths
print(pytesseract.image_to_string('images.txt'))

# Timeout/terminate the tesseract job after a period of time
try:
    print(pytesseract.image_to_string('test.jpg', timeout=2)) # Timeout after 2 seconds
    print(pytesseract.image_to_string('test.jpg', timeout=0.5)) # Timeout after half a second
except RuntimeError as timeout_error:
    # Tesseract processing is terminated
    pass
    
# Get bounding box estimates
print(pytesseract.image_to_boxes(Image.open('test.png')))

# Get verbose data including boxes, confidences, line and page numbers
print(pytesseract.image_to_data(Image.open('test.png')))

# Get information about orientation and script detection
print(pytesseract.image_to_osd(Image.open('test.png')))

# Get a searchable PDF
pdf = pytesseract.image_to_pdf_or_hocr('test.png', extension='pdf')
with open('test.pdf', 'w+b') as f:
    f.write(pdf) # pdf type is bytes by default

# Get HOCR output
hocr = pytesseract.image_to_pdf_or_hocr('test.png', extension='hocr')

# Get ALTO XML output
xml = pytesseract.image_to_alto_xml('test.png')

Functions

get_langauges
- Returns all currently supported languages by Tesseract OCR
get_tesseract_version
- Returns the Tesseract version installed in the system
image_to_string
- Returns unmodified characters and their box boundaries
image_to_boxes
- Returns result containing recognized characters and their box boundaries
image_to_data
- Returns result box boundaries, confidences, and other information.
image_to_osd
- Returns result containing information about orientation and script detection.
image_to_alto_xml
- Returns result in the form of Tesseract's ALTO XML format
run_and_get_output
- Returns the raw output from Tesseract OCT. Gives a bit more control over the parameters that are sent to tesseract.
run_and_get_multiple_output
- Returns like run_and_get_output but can handle multiple extensions

Parameters

image_to_data(image, lang=None, config='', nice=0, output_type=Output.STRING, timeout=0, pandas_config=None)

image
- Object or string - either PIL Image, NumPy array or file path of the image to be processed by Tesseract. If you pass object instead of file path, pytesseract will implicitly convert the image to RGB mode.
lang
- String Tesseract language code string. Defaults to eng if not specified. Multiple languages example: lang='eng+fra'
config
- String - Any additional custom configuration flags that are not available via the pytesseract function.
nice
- Integer - modifies the processor priority for the Tesseract run.
output_type
- Class attribute - specifies the types of the output, defaults to string
timeout
- Integer or Float - duration in seconds of the OCR processing, after which, pytesseract will terminate and raise RuntimeError
pandas_config
- Dict - only for the Output.DATAFRAME type.

Prerequisites

Python 3.6+
You need pillow installed
You need Google Tesseract OCR installed and it must be able to be called as tesseract form the command line

Installation

$ pip install pytesseract # Install with pip
$ pip install -U git+https://github.com/madmaze/pytesseract.git
$ # Install from source
$ git clone https://github.com/madmaze/pytesseract.git
$ cd pytesseract && pip install -U .
$ conda install -c conda-forge pytesseract # Install with conda

pytesseract Library

References

Related

Notes

Functions

Parameters

Prerequisites

Installation

Comments

User Comments