Do you have a stack of old, handwritten family recipes laying around that you've been meaning to digitalise? Are you curious about what's going on in the intersection between AI agents and computer vision?
In this project, I built an OCR agent which can not only read in handwritten recipes, but convert the metrics from imperial to metric. Using a ReAct agent architecture, the agent has access to a number of tools, including:
- A tool that can extract text from an image using a vision-language model;
- A tool that can preprocess the image, if required, using common techniques such as denoising, deskewing and upscaling;
- A number of tools for converting units, including temperature (Fahrenheit to Celsius), length (inches to cm) and weight (cups to grams).
As the agent is based on LangGraph, it is also very simple to swap out models. As part of this project, I experimented with using proprietary models (such as GPT-4o as both the vision-language and reasoning model) and open weight models through Ollama (Qwen2.5-VL for the vision-language model and Qwen3 for the reasoning model).
If you want to check out the full code, you can access it in this repo. I also presented this project as an OpenCV Live episode, if you want to see step-by-step how to build everything.
This project was built upon this one in Hugging Face's excellent agent's course.
