Python 3 script to recognize text from an image or screen

The objective of this tutorial is the creation of several scripts for automating tasks through recognition of text on screen or in an image using Python3.

You can download the repository with the example code from the tutorial at the url https://github.com/al118345/Ejemplo_bot_python/blob/main/ejemplo_lectura_texto_en_imagen.py and consult the article https://1938.com.es/bot-click-imagen

Installation.

To install this project the only thing you will need is to have the libraries pytesseract and PIL in your environment. For this you can use the command command:

It is possible that you will have to install a special language, just in case, I leave you the commands for Mac and Ubuntu:

Example. Display the text containing an image on the screen.

The following code simply displays the text containing an image in a terminal.

Basically, we have a first part which is to load the image through the PIL library.

Once loaded, we simply use pytesseract to read the image and get its text.

Example. Show text that contains a part of the screen

This code is very similar to the previous one, you simply have to obtain the portion of the screen that you want to consult with ImageGrab of PIL and the coordinates of the queried rectangle.

Once obtained, we obtain the text of the selected coordinates.

How to improve OCR accuracy

Tesseract works much better when the input image is clean. Before sending the image to OCR, crop only the useful region, increase contrast, avoid very small fonts and remove visual noise when possible. If the source is a screenshot, capturing a smaller rectangle is usually more reliable than processing the full screen.

Language selection also matters. The lang parameter should match the expected text. If you are reading Spanish text, install and use spa; for English text, use eng. Mixed-language screenshots can work, but they tend to produce more errors and should be tested with real examples.

2026 note for scanned PDFs and historical documents

This tutorial is intentionally simple because it focuses on one image or a screen region. If the source is a scanned PDF, start with a document OCR pipeline instead. OCRmyPDF can add a searchable text layer and apply preprocessing such as page rotation, deskewing and cleaning before OCR. A practical first pass is to generate a searchable PDF plus a sidecar text file; then inspect low-quality pages separately.

For difficult pages, run Tesseract page by page with different page segmentation modes and keep TSV or hOCR output. Plain text is enough for a quick read, but TSV/hOCR gives coordinates and confidence values, which helps detect pages that need manual review. Historical documents with tears, stamps, handwriting or uneven typewriting will still need human correction after OCR.

When OCR is not the right tool

If the text is available in HTML, JSON, CSV or an API response, read that source directly. OCR is useful when the text only exists as pixels: screenshots, scanned documents, legacy software windows or images generated by another system.

For automation flows, OCR pairs naturally with image-click detection. You can first locate a region on screen and then read the text inside it, but keep a manual validation step if the result will trigger payments, messages or irreversible actions.

In real projects, store a few failed screenshots and review them before changing the script. Most OCR errors come from input quality, not from the Python wrapper itself, so a small test set is often more valuable than adding more conditional code.

Recommended OCR pipeline

A reliable OCR workflow usually has four steps. First, capture the smallest region that contains the text. Second, normalize the image: resize when fonts are too small, convert to grayscale and increase contrast if the text blends into the background. Third, run Tesseract with the correct language and page segmentation mode. Finally, validate the result before using it in another automated action.

This last validation step is important for bots. OCR output can contain missing accents, confused characters or line breaks in unexpected places. If the extracted value will be used to click, send a message or update a record, add a confidence check or a human review path.

Common errors and how to debug them

Empty output usually means the crop is wrong or the text is too small.
Wrong characters often point to contrast, language pack or font problems.
Slow execution is normally caused by sending images that are larger than necessary.
Different results between machines can come from different Tesseract versions or missing language packages.

For a complete automation flow, combine this article with the Python image-click bot and the keyboard and mouse automation example. OCR reads the screen; those tutorials explain how to act on the result.

Keep the OCR step observable: save the cropped image, the detected language and the raw output while testing, so every wrong decision can be traced back to the exact input that produced it.