current position:Home>The Institute of automation of the Chinese Academy of Sciences and the Northern Institute of electronic equipment have proposed a multi input text face synthesis method, and the data code has been open source

The Institute of automation of the Chinese Academy of Sciences and the Northern Institute of electronic equipment have proposed a multi input text face synthesis method, and the data code has been open source

2022-02-02 20:13:58 Heart of machine

Text face synthesis is based on one or more text descriptions , Generate real and natural face images , And try to ensure that the generated image conforms to the corresponding text description , It can be used for human-computer interaction , Artistic image generation , And generate a suspect's portrait based on the victim's description. . In response to this question , A text face synthesis method based on multi input is proposed by the Institute of automation of Chinese Academy of Sciences and the Northern Institute of electronic equipment (SEA-T2F), The first manually labeled large-scale face text description data set is established (CelebAText-HQ). This method realizes the face synthesis of multiple text inputs for the first time , Compared with the single input algorithm, the generated image is closer to the real face . Related achievements and papers 《Multi-caption Text-to-Face Synthesis: Dataset and Algorithm》 Has been ACM MM 2021 Employment .

 picture

  • Address of thesis :https://zhaoj9014.github.io/pub/MM21.pdf

  • Data sets and code are open source :https://github.com/cripac-sjx/SEA-T2F

 picture

chart 1 Text to face image generation results of different methods
 
Compared with the generation of text to natural image , Text to face generation is a more challenging task , One side , The face has finer texture and fuzzy features , It is difficult to establish the mapping between face image and natural language , On the other hand , The relevant data sets are either too small , Or it can be generated by network directly based on attribute tags , So far, , There is no large-scale manual face text description data set , It greatly limits the development of this field . Besides , Current text-based face generation methods [1,2,3,4] Are based on a text input , But one text is not enough to describe complex facial features , what's more , Due to the subjectivity of text description , Different people's descriptions of the same picture may conflict with each other , Therefore, face generation based on multiple text descriptions has great research significance .

For this problem , The team proposed a text face generation algorithm based on multi input . The algorithm adopts a three-stage generation countermeasure network framework , Random sampling Gaussian noise is used as input , Sentence features from different texts pass through SFIM The module is embedded into the network , In the second and third stages of the network, we introduce AMC modular , The word features described by different texts and the middle image features are fused through the attention mechanism , To produce finer density features . In order to better learn attribute information in text , The team designed an attribute classifier , The loss of attribute classification is introduced to optimize the network parameters .

 picture

chart 2 Schematic diagram of model frame

Besides , The team established a large-scale manual annotation dataset for the first time , First, in the CelebAMask-HQ The dataset is filtered 15010 A picture , Each picture is manually marked with ten text descriptions by ten staff , The ten descriptions describe different parts of the face in order from thick to thin .
 
experimental result
 
The team conducted qualitative and quantitative analysis of the proposed method [5,6], Experimental results show that , This method can not only generate high-quality images , And more in line with the text description .

 picture

chart 3 Compare the results with different methods

 picture

chart 4 Generation results of different number of inputs

 picture

surface 1 Quantitative comparison results of different methods

 picture

surface 2 Ablation results : The first three lines represent network removal SFIM,AMC, And attribute classification loss .


reference :

1.     Osaid Rehman Nasir, Shailesh Kumar Jha, Manraj Singh Grover, Yi Yu, Ajit Kumar, and Rajiv Ratn Shah. 2019. Text2FaceGAN: face generation from fine grained textual descriptions. In IEEE International Conference on Multimedia Big Data (BigMM). 58–67.
2.     Xiang Chen, Lingbo Qing, Xiaohai He, Xiaodong Luo, and Yining Xu. 2019. FTGAN: A fully-trained generative adversarial networks for text to face generation. arXiv preprint arXiv:1904.05729 (2019).
3.     David Stap, Maurits Bleeker, Sarah Ibrahimi, and Maartje ter Hoeve. 2020. Conditional image generation and manipulation for user-specified content. arXiv preprint arXiv:2005.04909 (2020).
4.     Weihao Xia, Yujiu Yang, Jing-Hao Xue, and Baoyuan Wu. 2021. TediGAN: Textguided diverse image generation and manipulation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2256–2265.
5.     Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. 2018. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1316–1324.
6.     Bowen Li, Xiaojuan Qi, Thomas Lukasiewicz, and Philip Torr. 2019. Controllable text-to-image generation. In Advances in Neural Information Processing Systems (NeuIPS). 2065–2075.

copyright notice
author[Heart of machine],Please bring the original link to reprint, thank you.
https://en.fheadline.com/2022/02/202202022013551054.html

Random recommended