Newswise —

Team of researchers in China has created a generative adversarial network to produce high-quality images based on text descriptions. Their approach involves integrating common-sense knowledge into the network to improve image generation. The method leverages common sense to provide a clearer starting point for image creation and also enhances various image features at three distinct levels of resolution. The researchers trained the network using a dataset of bird images and text descriptions, resulting in bird images that achieved competitive scores when compared to other neural network methods.

The group’s research was published Feb. 20 in Intelligent Computing, a Science Partner Journal.

Considering the well-known saying that "a picture is worth a thousand words," it is not surprising that text-to-image frameworks currently available have limitations. For instance, if you aim to create an image of a bird using a computer, you may provide a description that includes various details, such as the bird's size, body color, and beak shape. However, to generate a complete image, the computer needs to determine numerous additional details, including the bird's orientation, the background, and whether its beak should be open or closed.

If computers were equipped with what we consider to be common-sense knowledge, they would be better at making decisions when it comes to depicting unspecified details. For instance, they would know that a bird can stand on either one or two legs, but never on three.

The authors' image generation network was evaluated quantitatively against its predecessors and achieved competitive scores based on fidelity and distance metrics from real images. The authors also assessed the quality of the generated images qualitatively and described them as generally consistent, natural, sharp, and vivid.

The research article concludes that the integration of common sense into text-to-image synthesis can significantly facilitate its development.

The authors' neural network for text-to-image generation comprises three modules. The first module enhances the text description used for image generation. To achieve this, the authors used ConceptNet, a data source that represents general knowledge for language processing as a graph of related nodes, to retrieve relevant pieces of common-sense knowledge. They then added a filter to reject unnecessary information and select the most relevant bits of knowledge. To add variability to the generated images, the authors included some statistical noise. The input to the image generator thus includes the original text description, analyzed both as a sentence and as individual words, selected common-sense knowledge from ConceptNet, and noise.

The second module of the authors' neural network generates images through a multi-stage process, where each stage corresponds to an image size starting from 64 x 64 pixels and gradually increasing to 128 x 128 and 256 x 256. To achieve this, the module utilizes the authors' "adaptive entity refinement" unit, which incorporates common-sense knowledge of the details required for each image size.

The third module of the authors' neural network scrutinizes the generated images and discards those that do not correspond to the original text description. The network is classified as a "generative adversarial network" since it has a third module that evaluates the generator's output. The authors named their network CD-GAN because it is driven by common-sense.

CD-GAN was trained using the Caltech-UCSD Birds-200-2011 dataset, which catalogs 200 bird species using 11,788 specially annotated images.

Guokai Zhang from Tianjin University conducted the experiments and wrote the paper. Ning Xu from Tianjin University came up with the idea for the study. Chenggang Yan from Hangzhou Dianzi University analyzed the data. Bolun Zheng from Hangzhou Dianzi University and Yulong Duan from the 30th Research Institute of CETC provided valuable contributions to the analysis and writing of the manuscript. Bo Lv from the 30th Research Institute of CETC and An-An Liu from Tianjin University helped with the analysis and provided helpful discussions.

Journal Link: Intelligent Computing