pytesseract Library

I want to look into the python pytesseract library, which is an opticla character recognition tool for Python.

Date Created:
2 510

References



Related


  • Optical Character Recognition
    • Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of types, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo, or from subtitle text superimposed on an image.
    • It is a common way of digitizing printed texts so that they can be electronically edited, searched, stored more compactly, displayed online, and used in machine processes such as cognitive computing, machine translation, text-to-speech, key data and text mining. OCR is a field of research in pattern recognition, artificial intelligence, and computer vision.
    • Early versions needed to be trained with images of each character, and worked on one font at a time. Advanced systems capable of producing a high degree of accuracy for most fonts are now common, and with support for a variety of image file format inputs.
  • Google Tesseract OCR
Tessearct 4 adds a new neural net LSTM based OCR engine which is focuses on line recognition, but also still supports the legacy Tesseract OCR engine of Tesseract 3 which works by recognizing character patterns.
Tesseract was originally developed at Hewlett-Packard Laboratories Bristol UK and at Hewlett-Packard Co, Greeley Colorado USA between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. In 2005 Tesseract was open sourced by HP. From 2006 until November 2018 it was developed by Google.

Command Line Usage:

$ tesseract imagename outputbase [-l lang] [--oem ocrenginemode] [--psm pagesegmode] [configfiles...]

Notes


Python-tesseract is an optical character recognition (OCR) tool for python. That is, it will recognize and read the text embedded in images. Python-tesseract is a wrapper for Google's Tesseract-OCR Engine. It is also useful as a stand-alone invocation script to tesseract, as it can read all image types supported by the Pillow and Leptonica imaging libraries [...] Additionally, if used as a script, Python-tesseract will print the recognized text instead of writing it to a file.
from PIL import Image
import pytesseract
# If you don't have tesseract executable in your PATH, include the following:
pytesseract.pytesseract.tesseract_cmd = r'<full_path_to_your_tesseract_executable>'
# Example tesseract_cmd = r'C:\Program Files (x86)\Tesseract-OCR\tesseract'

# In order to bypass the image conversions of pytesseract, just use relative or absolute image path
# NOTE: In this case you should provide tesseract supported images or tesseract will return error
print(pytesseract.image_to_string('test.png'))

# Batch processing with a single file containing the list of multiple image file paths
print(pytesseract.image_to_string('images.txt'))

# Timeout/terminate the tesseract job after a period of time
try:
print(pytesseract.image_to_string('test.jpg', timeout=2)) # Timeout after 2 seconds
print(pytesseract.image_to_string('test.jpg', timeout=0.5)) # Timeout after half a second
except RuntimeError as timeout_error:
# Tesseract processing is terminated
pass

# Get bounding box estimates
print(pytesseract.image_to_boxes(Image.open('test.png')))

# Get verbose data including boxes, confidences, line and page numbers
print(pytesseract.image_to_data(Image.open('test.png')))

# Get information about orientation and script detection
print(pytesseract.image_to_osd(Image.open('test.png')))

# Get a searchable PDF
pdf = pytesseract.image_to_pdf_or_hocr('test.png', extension='pdf')
with open('test.pdf', 'w+b') as f:
f.write(pdf) # pdf type is bytes by default

# Get HOCR output
hocr = pytesseract.image_to_pdf_or_hocr('test.png', extension='hocr')

# Get ALTO XML output
xml = pytesseract.image_to_alto_xml('test.png')

Functions

  • get_langauges
    • Returns all currently supported languages by Tesseract OCR
  • get_tesseract_version
    • Returns the Tesseract version installed in the system
  • image_to_string
    • Returns unmodified characters and their box boundaries
  • image_to_boxes
    • Returns result containing recognized characters and their box boundaries
  • image_to_data
    • Returns result box boundaries, confidences, and other information.
  • image_to_osd
    • Returns result containing information about orientation and script detection.
  • image_to_alto_xml
    • Returns result in the form of Tesseract's ALTO XML format
  • run_and_get_output
    • Returns the raw output from Tesseract OCT. Gives a bit more control over the parameters that are sent to tesseract.
  • run_and_get_multiple_output
    • Returns like run_and_get_output but can handle multiple extensions

Parameters

image_to_data(image, lang=None, config='', nice=0, output_type=Output.STRING, timeout=0, pandas_config=None)
  • image
    • Object or string - either PIL Image, NumPy array or file path of the image to be processed by Tesseract. If you pass object instead of file path, pytesseract will implicitly convert the image to RGB mode.
  • lang
    • String Tesseract language code string. Defaults to eng if not specified. Multiple languages example: lang='eng+fra'
  • config
    • String - Any additional custom configuration flags that are not available via the pytesseract function.
  • nice
    • Integer - modifies the processor priority for the Tesseract run.
  • output_type
    • Class attribute - specifies the types of the output, defaults to string
  • timeout
    • Integer or Float - duration in seconds of the OCR processing, after which, pytesseract will terminate and raise RuntimeError
  • pandas_config
    • Dict - only for the Output.DATAFRAME type.

Prerequisites

  • Python 3.6+
  • You need pillow installed
  • You need Google Tesseract OCR installed and it must be able to be called as tesseract form the command line
Installation
$ pip install pytesseract # Install with pip
$ pip install -U git+https://github.com/madmaze/pytesseract.git
$ # Install from source
$ git clone https://github.com/madmaze/pytesseract.git
$ cd pytesseract && pip install -U .
$ conda install -c conda-forge pytesseract # Install with conda

Comments

You have to be logged in to add a comment

User Comments

Insert Math Markup

ESC
About Inserting Math Content
Display Style:

Embed News Content

ESC
About Embedding News Content

Embed Youtube Video

ESC
Embedding Youtube Videos

Embed TikTok Video

ESC
Embedding TikTok Videos

Embed X Post

ESC
Embedding X Posts

Embed Instagram Post

ESC
Embedding Instagram Posts

Insert Details Element

ESC

Example Output:

Summary Title
You will be able to insert content here after confirming the title of the <details> element.

Insert Table

ESC
Customization
Align:
Preview:

Insert Horizontal Rule

#000000

Preview:


View Content At Different Sizes

ESC

Edit Style of Block Nodes

ESC

Edit the background color, default text color, margin, padding, and border of block nodes. Editable block nodes include paragraphs, headers, and lists.

#ffffff
#000000

Edit Selected Cells

Change the background color, vertical align, and borders of the cells in the current selection.

#ffffff
Vertical Align:
Border
#000000
Border Style:

Edit Table

ESC
Customization:
Align:

Upload Lexical State

ESC

Upload a .lexical file. If the file type matches the type of the current editor, then a preview will be shown below the file input.

Upload 3D Object

ESC

Upload Jupyter Notebook

ESC

Upload a Jupyter notebook and embed the resulting HTML in the text editor.

Insert Custom HTML

ESC

Edit Image Background Color

ESC
#ffffff

Insert Columns Layout

ESC
Column Type:

Select Code Language

ESC
Select Coding Language

Insert Chart

ESC

Use the search box below

Upload Previous Version of Article State

ESC