by Mathias Vogel, Rahul Rade & Dr. Christian HenningOn a chocolate line in a crowded factory, an engineer stands before a monitor showing a tray of freshly molded bars. Before any system can judge whether those bars are intact or misshapen, it must answer a simpler question: where is each bar?And so begins the part of visual inspection no one celebrates. Box by box, image by image, the engineer highlights every chocolate on the tray, often hundreds at a time. Because most systems require dozens of annotated images before they can reliably detect anything, these hours of tedious setup become a silent tax on the engineer’s day.The work is brittle, too. A slightly darker ingredient, a small lighting shift, a smudged lens or jostled camera… any of these can invalidate hours of labeling. Many manufacturers walk away from vision systems altogether, measuring yield the old-fashioned way: weigh what went in, weigh what came out, and hope the difference tells a useful story.Across industries, the specifics change, but the underlying story is the same. Before you can inspect anything, you have to teach the system where to look. And today, that teaching falls on the engineers who already have the least time to spare.What if you could skip all that?Now imagine the same engineer on the same chocolate line, but this time, setup takes seconds. They upload a single image, draw one box around one chocolate bar, and the system instantly finds every other bar on the tray. There’s no annotation campaign, no curated dataset, no training workflow. The model doesn’t need 40 examples or even 10. It just needs one.This is the promise of training-free vision. Instead of teaching a model through repetition, the engineer teaches it by demonstration: this bar, right here. That single gesture becomes the reference the system uses to understand the rest of the scene.Because the model looks at structure rather than memorizing appearance, it stays stable even as conditions shift. A new mold, a slightly different surface finish, a camera adjusted by a few millimeters… these no longer trigger the dreaded “start over” moment. Setup becomes a repeatable act: draw one box, confirm the suggestions, and move on.And the impact goes far beyond chocolate. Anything that appears repeatedly on a line must be located before it can be inspected: cans running down a conveyor, connectors on a PCB, dozens of vacuum pump variants moving through production. Training-free vision collapses that effort into a workflow measured in seconds rather than hours, making inspection accessible in places where traditional vision was once too costly to adopt.How training-free vision worksTraining-free vision feels intuitive because it mirrors how people learn. If you show someone a cookie on a tray, you don’t need to define what a cookie “is.” They look for shapes, textures, and patterns that resemble the example. Our model does the same.Its general visual understanding comes from extensive pre-training on large, generic image datasets made up of broad, non-industrial imagery that captures edges, contours, geometry, and the relationships between shapes. It doesn’t need customer data to learn these fundamentals. Instead, it distills recurring visual structures that appear across many environments.When an engineer draws a box around an object, that box becomes a visual prompt. In LLMs, prompts are text. In our system, prompts are spatial cues: find things in this scene that correspond to this region. This eliminates the need for predefined classes like “chocolate bar” or “connector,” which rarely map cleanly onto the variability of real production.Traditional detectors struggle in factories because they’re built to name objects, not locate every visually similar region in structured industrial scenes. They often discard the spatial context factories depend on, like patterns on trays, repeated layouts, or rows on conveyors.Our model preserves that context. It analyzes not just the object inside the box but the geometry of the entire scene. That makes it far more resilient to the small but inevitable variations that tend to break brittle traditional systems.And because the heavy learning happens before the model reaches the factory, adaptation on the line is nearly effortless. Engineers work with defect-free images, highlight a few examples, and fine-tune if needed. There’s no requirement for defect images, long labeling cycles, or specialized ML infrastructure.The system is hardware-agnostic, too, running in real-time on GPUs and sub-second on CPUs, making it flexible enough for high-speed lines or edge environments.Okay, but what does it look like in action?To make training-free vision tangible, we built an interactive demo that mirrors how real inspections are created. Think of it as a preview of how quickly and intuitively an inspection can come together, without classes, training, or manual configuration.The demo has two modes: cross mode and intra mode.1. Cross modeCross mode mirrors live inspection on the line. Once you’ve shown the model one example, it can locate similar regions in every new image.How it works:Cross mode: Select examples in one image, and the model finds them in another.2. Intra modeIntra mode is designed for single frames containing many identical items. It’s useful for counting and for generating clean labels for later fine-tuning.How it works:Intra mode: Mark a few examples, and the model finds all matching regions.Across both modes, the principle is the same: the model learns by visual analogy, not by predefined classes or text prompts. You show one example, and it applies that understanding immediately, just like an experienced operator would.The demo runs with zero training. The performance you see is the model’s out-of-the-box capability. In production, fine-tuning can adapt the system for specific products, materials, or environments, but the demo shows what’s possible before any training at all.Real factory conditions, real resultsThe AI world is full of impressive numbers. But as any engineer knows, accuracy only matters if it holds up under real production conditions. In manufacturing, consecutive images tend to look nearly identical, with only minor shifts in lighting, alignment, or position. That’s the environment inspection systems face every day, and it’s the one that matters most to us.To evaluate the model in a way that reflects real use, we use the widely adopted COCO benchmark, a transparent, reproducible academic standard built from everyday, non-industrial images, and then adapt it to mimic factory conditions. In this “similar-scene” setup, the reference image is simply a lightly augmented version of the target.Under this more realistic test, our model achieves a 53.6 mAP, far outperforming leading open-source detectors like YOLOE-11-L (23.8) and OWL-ViT (19.4), despite those models being trained on much larger datasets. (If you’re curious, you can try YOLOE and compare.)The difference comes down to context. Public detectors focus on semantics (identifying what an object is), which strips away the spatial cues that define industrial scenes. They struggle not because they’re “bad models,” but because they weren’t built for the visual consistency and structured layouts of manufacturing.Our approach pairs a reference image with a visual prompt, giving the model freedom to extract contextual structure automatically: repetition, layout, and the spatial relationships that remain stable across shifts in lighting or orientation.We report scores without fine-tuning or calibration to reflect genuine training-free performance, the way the system behaves on day one in a real factory. And under those conditions, it delivers the strongest and most stable results.A new foundation for visual inspectionTraining-free vision marks a step change in how inspections are created, deployed, and maintained across factories. For the engineer who once spent hours drawing boxes, it means inspection that moves at the speed of production rather than slowing it down.It also clears a barrier that has quietly held back automation for years. When setup is costly, engineers fall back on manual checks. Yield becomes an approximation. Root-cause analysis becomes guesswork. By reducing setup to seconds, training-free vision makes advanced inspection feasible for more lines, products, and variants than ever before.Most importantly, it fits naturally into how inspection actually runs on the factory floor. The same model that works out of the box can be fine-tuned when needed. It runs in real time on GPUs and returns sub-second results on CPUs. And it works with standard industrial cameras. Whether deployed at the edge or in a dedicated workstation, it’s built for the practical constraints of production: latency, robustness, and the reality that every line is different.This approach wasn’t developed in isolation. It reflects years spent inside real factories, deploying the Inspector across thousands of products and learning where conventional systems struggle. And because Inspector is part of the broader EthonAI Industrial AI Platform, the regions it finds don’t just feed a vision system; they become the starting point for understanding quality, tracing root causes, and driving continuous improvement across sites.For the factories we serve, the impact is clear. Training-free vision turns inspection from a bottleneck into an enabler. It gives engineers reliable quality insight from the first image and brings vision into places where it was once too costly or too complex to adopt.We built training-free vision to move the line. Now every engineer can spend less time drawing boxes, and more time improving the process they know best.Try the interactive demo yourself hereMathias Vogel, Rahul Rade, and Christian Henning work on research and development of computer vision algorithms at EthonAI. Mathias Vogel is a Research Scientist with a Master’s degree from ETH Zurich in Machine Learning and Signal Processing. Rahul Rade is a Product Owner and also holds a Master’s degree from ETH Zurich in Machine Learning and Signal Processing. Christian Henning is Lead Research Scientist and holds a PhD from ETH Zurich, where he focused on machine learning research.Meet our engineers to explore how EthonAI can support your operational excellence programs. The EthonAI Industrial AI Platform is a powerful software suite to achieve operational excellence at scale. Förrlibuckstrasse 70, 8005 Zürich, Switzerland 1460 Broadway, New York, NY 10036, United States©2025 EthonAI AG. All Rights Reserved.