IMPERSONATOR++ Liquid Warping GAN with Attention: A Unified Framework for Human Image Synthesis

“Impersonator++” essentially duplicates human developments from reference recordings and glues them onto source pictures. Proposed by scientists from ShanghaiTech University, Chinese Academy of Sciences and University of Chinese Academy of Sciences, it handles human movement impersonation, appearance move and novel view blend inside a bound together system.
Movement impersonation, appearance move and novel view combination throughout the fall under the umbrella of human picture union, the age of conceivable and photorealistic pictures of people. The field has applications in regions, for example, character activity, reenactment, virtual garments take a stab at, film and game making, etc.

Existing undertaking explicit strategies utilize 2D keypoints to gauge human body construction and express position data, yet can’t successfully describe various subjects’ body shapes or appendage pivots. The scientists propose utilizing a 3D body network recuperation module that can unravel posture and shape and model joint area and turn while likewise better portraying “customized body shape.”

To all the more precisely protect source data like surface, style, shading and face personality, the scientists planned an Attentional Liquid Warping GAN with Attentional Liquid Warping Block (AttLWB) that engenders the source data in both picture and highlight spaces to a blended reference. AttLWB utilizes a denoising convolutional auto-encoder to help remove valuable highlights and better portray the source personality.


The specialists say their strategy can likewise uphold more adaptable twisting from numerous sources. It right off the bat prepares a model on a broad preparing set and afterward calibrates utilizing one or few-shot learning with concealed pictures in a self-regulated manner to produce high-goal results. The group likewise applied one/barely any shot ill-disposed figuring out how to additionally improve the speculation capacity of inconspicuous source pictures.

To assess Impersonator++ execution, the scientists constructed an Impersonator (iPER) video dataset highlighting assorted styles of apparel. The dataset contains 206 video groupings with 241,564 FRAMES, and covers 30 human bodies with various shape, tallness and sexual orientation conditions wearing more than 100 dress things and performing irregular activities.


The researchers used the iPER, MotionSynthetic, FashionVideo, and YouTube-Dancer-18 datasets to evaluate personalization, loss functions, input concatenation, texture warping and feature warping etc. under one-shot and few-shot settings, and to perform qualitative comparisons.



The outcomes show that the proposed technique creates high-devotion pictures that safeguard face character, shape consistency and garments subtleties of the source, while 2D posture guided strategies like pG2, DSC, SHUP and DIAF battle to do as such. The strategy additionally accomplishes fair outcomes in cross impersonation, even with reference pictures out of the space of its preparation dataset.


We tackle human image synthesis, including human motion imitation, appearance transfer, and novel view synthesis, within a unified framework. It means that the model, once being trained, can be used to handle all these tasks. The existing task-specific methods mainly use 2D keypoints (pose) to estimate the human body structure. However, they only express the position information with no abilities to characterize the personalized shape of the person and model the limb rotations. In this paper, we propose to use a 3D body mesh recovery module to disentangle the pose and shape. It can not only model the joint location and rotation but also characterize the personalized body shape. To preserve the source information, such as texture, style, color, and face identity, we propose an Attentional Liquid Warping GAN with Attentional Liquid Warping Block (AttLWB) that propagates the source information in both image and feature spaces to the synthesized reference. Specifically, the source features are extracted by a denoising convolutional auto-encoder for characterizing the source identity well. Furthermore, our proposed method can support a more flexible warping from multiple sources. To further improve the generalization ability of the unseen source images, a one/few-shot adversarial learning is applied. In detail, it firstly trains a model in an extensive training set. Then, it finetunes the model by one/few-shot unseen image(s) in a self-supervised way to generate high-resolution (512 times 512 and 1024 times 1024) results. Also, we build a new dataset, namely Impersonator (iPER) dataset, for the evaluation of human motion imitation, appearance transfer, and novel view synthesis. Extensive experiments demonstrate the effectiveness of our methods in terms of preserving face identity, shape consistency, and clothes details.

FrameWork Overview


The training pipeline of our method. We randomly sample a pair of images from a video, denoting the source and the reference image as I_{s_i} and I_r(a) A body mesh recovery module will estimate the 3D mesh of each image and render their correspondence map, C_s and C_t(b) The flow composition module will first calculate the transformation flow T based on two correspondence maps and their projected vertices in the image space. Then it will separate the source image I_{s_i} into a foreground image I^{ft}_{s_i} and a masked background I_{bg}. Finally it warps the source image based on the transformation flow T and produces a warped image I_{syn}(c) In the last GAN module, the generator consists of three streams, which separately generates the background image hat{I}_{bg} by G_{BG}, reconstructs the source image hat{I}_s by G_{SID} and synthesizes the target image hat{I}_t under the reference condition by G_{TSF}. To preserve the details of the source image, we propose a novel LWB and AttLWB which propagates the source features of G_{SID} into G_{TSF} at several layers and preserve the source information, in terms of texture, style and color.



Illustration of our LWB and AttLWB. They have the same structure illustrated in (b) but with separate AddWB (illustrated in (a)) or AttWB (illustrated in (b)). (a) is the structure of AddWB. Through AddWB, widehat{X}_t^{l} is obtained by aggregation of warped source features and features from G_{TSF}(b) is the shared structure of (Attentional) Liquid Warping Block. {X^{l}_{s_1}, X^{l}_{s_2}, …, X^{l}_{s_n}} are the feature maps of different sources extracted by G_{SID} at the l^{th} layer. {T_{s_1to t}, T_{s_2to t},…,T_{s_nto t}} are the transformation flows from different sources to the target. X^{l}_t is the feature map of G_{TSF} at the l^{th} layer. (c) is the architecture of AttWB. Through AttWB, final output features widehat{X}_t^{l} is obtained with SPADE by denormalizing feature map from G_{TSF} with weighted combination of warped source features by a bilinear sampler (BS) with respect to corresponding flow T_{s_ito t}.

Network Architectures


The details of network architectures of our Attentional Liquid Warping GAN, including the generator and the discriminator. Here s represents the stride size in convolution and transposed convolution.

GitHub Repo:

Paper Work:

Shanghai University Data-Sets:


Made by: Wen Liu, Zhixin Piao, Zhi Tu, Wenhan Luo, Lin Ma, Shenghua Gao

Leave a Reply