Artificial intelligence learns to understand the relationship between objects: a new method allows you to create realistic and consistent images
Artificial intelligence improves our understanding of visual relationships.
Scientists from the University of Twente (Netherlands) have developed a new artificial intelligence method that can build scenes from images that can serve as the basis for generating realistic and consistent images. They recently published their results in the journal IEEE Transactions on Pattern Analysis and Machine Intelligence.
Generative AI models can create images based on text queries. These models work best when they create images of single objects. Creating complete scenes is still difficult. Michael Ying Yang, a researcher at the ITC faculty at the University of Twente, has developed a new method that can build scenes from images that can serve as the basis for generating realistic and consistent images.
Humans are great at defining relationships between objects. “We can see that the chair is on the floor and the dog is walking down the street. AI models find this challenging,” explains Yang, associate professor of the Scene Understanding Group in the Department of Geoscience and Earth Observation (ITC). Improving the computer’s ability to detect and understand visual relationships is essential for image generation, but can also aid the perception of autonomous vehicles and robots.
Currently, there are methods for building a semantic understanding of an image, but they are slow. These methods use a two-stage approach. First, they display all the objects in the scene. In the second step, some specific neural network goes through all possible connections and then labels them with the correct relation. The number of connections this method must go through increases exponentially with the number of objects. “Our model takes just one step. It automatically predicts subjects, objects, and their relationships at the same time,” Yang says.
For this one-step method, the model looks at the visual characteristics of the objects in the scene and focuses on the most important details to determine relationships. It highlights important areas where objects interact or are related to each other. These techniques and relatively small training data are sufficient to determine the most important relationships between different objects. It remains only to generate a description of how they are related. “The model finds that in the sample image, a person is very likely to be interacting with a baseball bat. She then learns to describe the most likely relationship: ‘man-swings-baseball bat’,” Yang says.