Python 3 script to recognize text from an image or screen

The objective of this tutorial is the creation of several scripts for automating tasks through recognition of text on screen or in an image using Python3.

You can download the repository with the example code from the tutorial at the url https://github.com/al118345/Ejemplo_bot_python/blob/main/ejemplo_lectura_texto_en_imagen.py and consult the article https://1938.com.es/bot-click-imagen

Installation.

To install this project the only thing you will need is to have the libraries pytesseract and PIL in your environment. For this you can use the command command:

It is possible that you will have to install a special language, just in case, I leave you the commands for Mac and Ubuntu:

Example. Display the text containing an image on the screen.

The following code simply displays the text containing an image in a terminal.

Basically, we have a first part which is to load the image through the PIL library.

Once loaded, we simply use pytesseract to read the image and get its text.

Example. Show text that contains a part of the screen

This code is very similar to the previous one, you simply have to obtain the portion of the screen that you want to consult with ImageGrab of PIL and the coordinates of the queried rectangle.

Once obtained, we obtain the text of the selected coordinates.

How to improve OCR accuracy

Tesseract works much better when the input image is clean. Before sending the image to OCR, crop only the useful region, increase contrast, avoid very small fonts and remove visual noise when possible. If the source is a screenshot, capturing a smaller rectangle is usually more reliable than processing the full screen.

Language selection also matters. The lang parameter should match the expected text. If you are reading Spanish text, install and use spa; for English text, use eng. Mixed-language screenshots can work, but they tend to produce more errors and should be tested with real examples.

When OCR is not the right tool

If the text is available in HTML, JSON, CSV or an API response, read that source directly. OCR is useful when the text only exists as pixels: screenshots, scanned documents, legacy software windows or images generated by another system.

For automation flows, OCR pairs naturally with image-click detection. You can first locate a region on screen and then read the text inside it, but keep a manual validation step if the result will trigger payments, messages or irreversible actions.

In real projects, store a few failed screenshots and review them before changing the script. Most OCR errors come from input quality, not from the Python wrapper itself, so a small test set is often more valuable than adding more conditional code.

Recommended OCR pipeline

A reliable OCR workflow usually has four steps. First, capture the smallest region that contains the text. Second, normalize the image: resize when fonts are too small, convert to grayscale and increase contrast if the text blends into the background. Third, run Tesseract with the correct language and page segmentation mode. Finally, validate the result before using it in another automated action.

This last validation step is important for bots. OCR output can contain missing accents, confused characters or line breaks in unexpected places. If the extracted value will be used to click, send a message or update a record, add a confidence check or a human review path.

Common errors and how to debug them

  • Empty output usually means the crop is wrong or the text is too small.
  • Wrong characters often point to contrast, language pack or font problems.
  • Slow execution is normally caused by sending images that are larger than necessary.
  • Different results between machines can come from different Tesseract versions or missing language packages.

For a complete automation flow, combine this article with the Python image-click bot and the keyboard and mouse automation example. OCR reads the screen; those tutorials explain how to act on the result.

Keep the OCR step observable: save the cropped image, the detected language and the raw output while testing, so every wrong decision can be traced back to the exact input that produced it.

You can download the repository with the example code from the tutorial at the url https://github.com/al118345/Ejemplo_bot_python/blob/main/ejemplo_lectura_texto_en_imagen.py and consult the following video with the explanation of the script: