Accelerating self-driving software development with multimodal search

By Paul Newman
Co-founder and CTO, Oxa

At Oxa, we’re building the future of industrial autonomy with Oxa Driver - configurable self-driving software that has the foundational skills to drive in all industrial spaces.
Whether it’s a port, airport, or industrial facility, our customers typically come to us with a place in mind - a ‘somewhere’ that they want to have autonomous vehicles doing something useful in. But vitally, they want immense confidence that in the places they care about, operations are reliable, safe and explainable.
Before we deploy in a new environment, we need to prove coverage and diversity. It means showing that our self-driving software has been trained and tested against the full range of expected and unexpected situations it might encounter.
Search - retrieval and synthesis
This is where the concept of search becomes critical – it is the engine that processes and organises the datasets necessary to train, validate, and assure our software in places of interest.
We collect vast amounts of real-world sensor data every week. This includes images and point clouds of day-to-day scenes featuring everyday operations, and on rare occasions, edge cases – the unpredictable scenarios that test the limits of the software’s understanding. But with this type and volume of data, discovery becomes a challenge, as does the rarity of the edge cases.
Search allows us to interrogate our datasets and analyse semantic information to find what’s needed, what’s been captured and what’s missing. In our development toolchain, Oxa Foundry, search means two things: retrieving data that already exists, and, interestingly, automatically synthesising data when it doesn’t.
Retrieval allows us to find precise examples of the environments, behaviours or object interactions we need to train or validate against. If certain events aren’t available or are underrepresented in the data – such as a white forklift moving in dense fog – we use a heady mix of generative AI, simulation and digital twins to create them.
NVIDIA Cosmos for multimodal search
At Oxa, we’ve developed our own systems for multimodal search, and are leveraging NVIDIA Cosmos Dataset Search (CDS) to make this even more efficient.
CDS combines an intuitive interface with the ability to search by text, image or video. As it’s cloud-based, it scales easily. For teams like ours, which run large volumes of queries in parallel and work with high-throughput datasets, CDS delivers a clear performance benefit. Its filtering and scaling features allow us to execute hundreds of searches simultaneously without bottlenecks.
One example is emergency vehicle detection. Our software needs to be able to distinguish between a fire engine actively responding to an incident versus one that isn’t. The presence of a flashing blue light – visible only in video – is a critical cue. CDS’s video search makes it easier to identify and classify these sequences with greater precision.
We’ve already run our own tools alongside CDS, and they complement each other well. We’ve been delighted by how CDS enhances our internal systems, accelerating the retrieval of complex examples and expanding our overall search capabilities.
Generating what we can’t find
Driving around and waiting for edge cases to naturally occur in the real world isn’t feasible. Instead, we generate synthetic data using NVIDIA Cosmos World Foundation Models (WFMs) alongside Oxa Sensor Expansion and other tools within our Foundry toolchain. These tools enable us to create high-fidelity virtual scenes based on text and image prompts. They allow us to simulate difficult conditions like poor lighting, obscured hazards or dynamic objects in motion.
Together, they help us automatically create highly-targeted datasets or ‘syllabuses’ for model training and validation in a given place. This means we can respond more quickly to new deployment requests by covering the necessary edge cases efficiently and robustly. It makes our development process faster and more efficient.
A practical route to scalable autonomy
We treat search as a critical capability in our development stack. By combining retrieval and generation, powered by both Oxa Foundry and NVIDIA Cosmos, we can transform vast, unstructured data into structured, actionable training sets.
This collaborative approach helps us deploy with greater speed and confidence. It’s how we ensure Oxa Driver performs safely in the real world.


