How Meta taught its CM3Leon to be the best at generating and captioning images
She handles complex queries with ease and edits images as directed.
Meta has introduced a new artificial intelligence model that can create images from text descriptions and write captions to them. The model is called CM3Leon and, according to the developers, has the best image generation quality among existing analogues.
CM3Leon differs from most other image generators in that it uses transformers, special neural network architectures that can process different types of data, such as text or images. Transformers allow the model to learn faster and better take into account the context of the input data. In addition, CM3Leon requires five times less computational resources and less training data than previous transformer-based methods.
Meta used millions of licensed images from Shutterstock to train CM3Leon. The most powerful version of the model has 7 billion parameters, twice as many as the competing DALL-E 2 model from OpenAI. Parameters define the model’s skill in solving a problem, such as generating text or images.
One of the key success factors of CM3Leon is a technique called SFT (supervised fine-tuning), which consists in additional tuning of the model for specific tasks. This technique has already been used to train text generators such as OpenAI’s ChatGPT, but Meta suggested that it could be useful for images as well. Indeed, SFT has improved CM3Leon’s work not only in generating images, but also in writing captions for them, as well as answering questions about images and editing images with text instructions (for example, “change the color of the sky to bright blue”).
Most image generators struggle with “complex” objects and text queries that are too restrictive. But the CM3Leon does it better – or at least not as often. In several examples compiled by Meta, CM3Leon created images based on requests such as “A small cactus wearing a straw hat and neon sunglasses in the Sahara desert”, “Close-up of a human hand”, “A raccoon anime protagonist preparing for an epic battle with a samurai sword ” and “Road sign in fantasy style with the text “1991””. For comparison, I ran the same queries through DALL-E 2. Some of the results were close. But the CM3Leon images were overall more relevant and detailed in my opinion, especially the badge.
CM3Leon can also understand instructions for editing existing images. For example, for the query “Create a high-quality image of a ‘room with a sink and a mirror’ with a bottle at (199, 130)”, the model might generate something visually coherent and, as Meta puts it, “corresponding to the context” – a room, a sink, a mirror, bottle and all. DALL-E 2 does not cope with the nuances of such requests at all, sometimes completely skipping the objects specified in the request.
And, of course, unlike DALL-E 2, CM3Leon can perform various requests for generating short or long captions and answering questions about a particular image. In these areas, the model performed better than even dedicated image captioning models (e.g. Flamingo, OpenFlamingo), despite seeing less text in its training data, Meta claims.
But what about bias? Generative AI models such as DALL-E 2 have been found to amplify societal prejudice, for example by creating images of positions of power – such as “CEO” or “director” – that depict mostly white males. Meta leaves this question unanswered, saying only that CM3Leon “can reflect any bias present in the training data.”
“As the AI industry continues to evolve, generative models like CM3Leon are becoming more and more advanced,” the company writes. “While the industry is still in the early stages of understanding and addressing these issues, we believe transparency will be key to accelerating progress.”