Python 3 bot: detect images on screen and click them automatically

The goal of this tutorial is to continue the project https://1938.com.es/bot-click-python and combine task automation through clicks with on-screen image search in Python 3.

You can download the repository with the example code for the tutorial at https://github.com/al118345/Ejemplo_bot_python or watch the following video:

Installation

To install this project, the only thing you need is the pynput library and screen-search in your environment. To do so, run the following command when you have downloaded the project:

If you use Linux and Python 3.7, run the following command as well if the error NOTE: You must install tkinter on Linux to use MouseInfo. Run the following: sudo apt-get install python3-tk python3-dev appears.

Another detail is that, if your version is 3.8, the version to install is python3.8-tk. If you do not do that, an error will appear.

Find an image on screen and click it

The following code simply searches for an image on the screen and, once found, clicks on it.

Basically, we have an infinite loop that, every 150 seconds, searches for the image on the screen as shown below.

The method used to search the screen is very simple. imagesearch() looks for the image you want. If it does not find it, it returns [-1, -1]; otherwise, it provides the coordinates. The method is used in the following way:

Finally, we click on the coordinates with the following function:

Reliability limits

Image-based automation works best when the interface is stable: same resolution, same theme, same zoom level and predictable window position. If the button changes color, if the browser scales the page, or if the operating system applies a different display density, the screenshot search may stop matching.

For that reason, this approach should be treated as a practical desktop automation technique, not as a substitute for a real API. When an API exists, prefer the API. Use image recognition only when the target application does not expose a reliable integration point.

Safety checklist

Add a visible stop condition or keyboard interrupt before leaving the loop running.
Use a reasonable delay to avoid uncontrolled clicking.
Test with harmless windows before automating production software.
Keep screenshots small and specific so false positives are less likely.

If the next step is reading text from the screen, continue with the OCR tutorial with Python and Tesseract.

A useful production rule is to log every detected match with its coordinates and timestamp while testing. If the bot clicks in the wrong place, those logs make it much easier to decide whether the problem was the template image, the screen scale, the wait time or the application state. Keep those logs only for debugging and remove sensitive screenshots when they are no longer needed.

How to make the bot easier to maintain

A practical improvement is to separate the automation into three small functions: capture the screen, decide whether the target image is present and execute the click. When those steps are independent, you can test the recognition logic with saved screenshots before allowing the bot to move the mouse. This reduces the risk of accidental clicks while you are tuning the template image.

It is also useful to add a maximum number of attempts and a timeout. Infinite loops are convenient for a first prototype, but production automation needs a clear end state: target found, target not found, user cancelled or unexpected screen. Those states make logs easier to read and prevent the script from running for hours without doing useful work.

If the target application changes often, keep several template images and test them with a confidence threshold. A single screenshot can be too brittle when the interface has dark mode, hover states, translations or different DPI settings.

When the bot becomes part of a repeated workflow, add a configuration file for image paths, delays and maximum attempts. Hard-coded values are fine in a first lesson, but configuration makes the same script reusable across environments without editing source code each time the screen changes.