October 4, 2022
Dr. Dmitriy Starson, CEO of Passio Inc.
AI breakthroughs come in waves and today we are going through an exponential wave of generative AI and text-to-image models. Most notably, models like DALL-E enable creation of photo-realistic images from textual descriptions but this is not the only exciting use case! At Passio we are focused on creating the most advanced domain-specific AI and computer vision technology and we are extremely excited about the applications of text-to-image models in synthetic data generation and training of machine learning models at scale.
In this post we are excited to share some of our early exploits in using generative AI in computer vision applications and the key lessons we’ve learned when using DALL-E and stable diffusion.
As the world of computer vision continues to evolve, so does the technology behind it. And this evolution is happening at an unprecedented pace. The technology stack is changing almost daily with new AI tools appearing overnight to solve challenges faced by earlier generations of AI. It feels like AI is starting to build AI tools to improve itself.
One of the recent developments in the space of AI and computer vision is text-to-image generation most notably represented by models like DALL-E, Stable Diffusion and Midjourney. This technology enables the creation of photo-realistic images from textual descriptions, which can be used to train and improve the performance of visual recognition algorithms and to generate synthetic data for training and testing machine learning models at scale.
Text-to-image generation is a relatively new technology that has only recently begun to be used in computer vision applications. The first text-to-image system was developed in the early 2000s by a team of researchers at MIT. This system, called Text2Image, was able to generate simple images from textual descriptions. Since then, there have been a number of advances in text-to-image generation, which have led to the development of more sophisticated systems that can generate photo-realistic images.
Text-to-image generation systems work by using a deep learning algorithm to learn the mapping between textual descriptions and images. The algorithm is trained on a dataset of images and their corresponding textual descriptions. Once the algorithm has been trained, it can then be used to generate images from new textual descriptions. The easiest and fastest way to experiment with text-to-image is via deployed models hosted on websites like https://www.craiyon.com/ and by looking at exciting examples here.
For more inspiration and amazing overview of the business applications of generative AI we encourage you to check out the latest post from Sequoia Capital: Generative AI A Creative New World.
Text-to-image generation has a number of potential applications. One application is in the training of visual recognition algorithms. By generating synthetic data, text-to-image generation can be used to train visual recognition algorithms more effectively. In addition, text-to-image generation can be used to generate images for testing and debugging computer vision applications. But how effective can this approach be? What are the limitations? Can synthetic text-to-image data replace the need for real-world data? Why not use model-to-model and conventional transfer learning instead?
To explore these questions we decided to test text-to-image generation in our most advanced use-case: recognition and analysis of foods. Over the past 4 years, our team at Passio has built, arguably, the most advanced and robust food recognition visual AI dataset with millions of images representing thousands of classes structured in our unique visual food taxonomy (check out how we did it here). And we decided to test text-to-image across several food recognition use cases.
We started by generating training data using text-to-image. Below you can see several examples of data we generated and you can compare the generated data with real-world data collected by our team.
By analyzing this data we can make a number of interesting observations:
Key Facts:
Results:
Accuracy with DALL-E mini + real data: 94%
Accuracy with real data: 90%
Accuracy with DALL-E mini data: 75%
Our general conclusion that synthetic data generated using text-to-image models can be extremely helpful in building machine learning models. The quality of data generated with DALL-E and similar models is very high when the prompts are properly constructed. The use of this synthetic data still requires high degree of supervision, especially when models are trained with that data are intended for real-world, production-level applications. The use of synthetic data is likely to become a critical component of dataset development efforts and we are excited to be integrating text-to-image data generation into Passio Mobile AI platform.
Quick summary points:
Grow your business with ready-to-use AI modules and AI/AR-driven experiences.