Cartoon World

A cartoonizer able to turn your images or videos to high-quality real-life cartoons using the power of GANs and Vgg16 NN archticture

INTRODUCTION

Cartoons are a very popular art form that has been widely applied in diverse scenes, from publication in printed media to storytelling for children. Some cartoon artwork was created based on real-world scenes. However, manually re-creating real-life based scenes can be very laborious and requires reﬁned skills.
The evolution in the ﬁeld of Machine Learning has expanded the possibilities of creating visual arts. Some famous products have been created by turning real-world photography into usable cartoon scene materials, where the process is called image cartoonization.

White box cartoonization is a method that reconstructs high-quality real-life pictures into exceptional cartoon images using the GAN framework.

FLOWCHART OF WHITE-BOX-CARTOONIZATION MODEL :

ARCHITECTURE OF WBC MODEL:

the architecture of the generator network and discriminator network in the above ﬁgure. The generator network is a fully convolutional U-Net-like network. I used convolution layers with stride2 for down-sample and bilinear interpolation layers for upsampling to avoid checkerboard artefacts. The network consists of only three kinds of layers: convolution, Leaky ReLU (LReLU) and bilinear resize layers. This enables it to be easily embedded in edge devices such as mobile phones. PatchGAN is adopted in the discriminator network, where the last layer is a convolution layer. Each pixel in the output feature map corresponds to a patch in the input image, with the patch size equals to the perceptive ﬁeld, and is used to judge whether the patch belongs to cartoon images or generated images. PatchGAN enhances the discriminative ability of details and accelerates training. Spectral normalization is placed after every convolution layer (except the last one) to enforce Lipschitz constrain on the network and stabilize training.

------------------------------------ METHODOLOGY ----------------------------------------

INTRODUCTION TO GENERATIVE ADVERSARIAL NETWORKS (GANs)

Generative adversarial networks (GANs) are an exciting recent innovation in machine learning. GANs are generative models: they create new data instances that resemble your training data. For example, GANs can create images that look like photographs of human faces, even though the faces don't belong to any real person.
Generative Adversarial Networks, or GANs for short, are an approach to generative modelling using deep learning methods, such as convolutional neural networks. Generative modelling is an unsupervised learning task in machine learning that involves automatically discovering and learning the regularities or patterns in input data in such a way that the model can be used to generate or output new examples that plausibly could have been drawn from the original dataset.
GANs are a clever way of training a generative model by framing the problem as a supervised learning problem with two sub-models: the generator model that we train to generate new examples, and the discriminator model that tries to classify examples as either real (from the domain) or fake (generated). The two models are trained together in a zero-sum game, adversarial, until the discriminator model is fooled about half the time, meaning the generator model is generating plausible examples.
GANs are an exciting and rapidly changing ﬁeld, delivering on the promise of generative models in their ability to generate realistic examples across a range of problem domains, most notably in image-to-image translation tasks such as translating photos of summer to winter or day to night, and in generating photorealistic photos of objects, scenes, and people that even humans cannot tell are fake.

DEEP CONVOLUTIONAL GENERATIVE ADVERSARIAL NETWORKS (DCGAN) :

The deep convolutional generative adversarial network, or DCGAN for short, is an extension of the GAN architecture for using deep convolutional neural networks for both the generator and discriminator models and conﬁgurations for the models and training that result in the stable training of a generator model.
The DCGAN is important because it suggested the constraints on the model required to e ectively develop high-quality generator models in practice. This architecture, in turn, provided the basis for the rapid development of a large number of GAN extensions and applications.

ARCHITECTURE OF GAN

The architecture of a GAN has two basic elements: the generator network and the discriminator network. Each network can be any neural network, such as an Artiﬁcial Neural Network (ANN), a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN), or a Long Short Term Memory (LSTM). The discriminator has to have fully connected layers with a classiﬁer at the end.
A generative adversarial network (GAN) has two parts:
- The generator learns to generate plausible data. The generated instances become negative training examples for the discriminator.
- The discriminator learns to distinguish the generator's fake data from real data. The discriminator penalizes the generator for producing implausible results

When training begins, the generator produces fake data, and the discriminator quickly learns to tell that it's fake:

As training progresses, the generator gets closer to producing output that can fool the discriminator:

Finally, if generator training goes well, the discriminator gets worse at telling the di erence between real and fake. It starts to classify fake data as real, and its accuracy decreases.

Here's a picture of the whole system:

Both the generator and the discriminator are neural networks. The generator output is connected directly to the discriminator input. Through backpropagation, the discriminator's classiﬁcation provides a signal that the generator uses to update its weights.

------------------------------------- THE DISCRIMINATOR ---------------------------------

The discriminator in a GAN is simply a classiﬁer. It tries to distinguish real data from the data created by the generator. It could use any network architecture appropriate to the type of data it's classifying.

DISCRIMINATOR TRAINING DATA :
Real data instances, such as real pictures of people. The discriminator uses these instances as positive examples during training.
Fake data instances created by the generator. The discriminator uses these instances as negative examples during training.

In Figure 1, the two "Sample" boxes represent these two data sources feeding into the discriminator. During discriminator training, the generator does not train. Its weights remain constant while it produces examples for the discriminator to train on.

TRAINING THE DISCRIMINATOR

The discriminator connects to two loss functions. During discriminator training, the discriminator ignores the generator loss and just uses the discriminator loss. We use the generator loss during generator training, as described in the next section.
During discriminator training:
1. The discriminator classiﬁes both real data and fake data from the generator.
2. The discriminator loss penalizes the discriminator for misclassifying a real instance as fake or a fake instance as real.
3. The discriminator updates its weights through backpropagation from the discriminator loss through the discriminator network.

----------------------- THE GENERATOR -----------------------

The generator part of a GAN learns to create fake data by incorporating feedback from the discriminator. It learns to make the discriminator classify its output as real.
Generator training requires tighter integration between the generator and the discriminator than discriminator training requires. The portion of the GAN that trains the generator includes: ● random input ● generator network, which transforms the random input into a data instance ● discriminator network, which classiﬁes the generated data ● discriminator output ● generator loss, which penalizes the generator for failing to fool the discriminator

USING THE DISCRIMINATOR TO TRAIN THE GENERATOR

At the same time, we don't want the discriminator to change during generator training. Trying to hit a moving target would make a hard problem even harder for the generator. So we train the generator with the following procedure:

Sample random noise.
Produce generator output from sampled random noise.
Get discriminator "Real" or "Fake" classiﬁcation for generator output.
Calculate loss from discriminator classiﬁcation.
Backpropagate through both the discriminator and generator to obtain gradients.
Use gradients to change only the generator weights.

WHITE-BOX-CARTOONIZATION:

The surface representation contains a smooth surface of cartoon images.
The structure representation refers to the sparse colour blocks and ﬂattens global content in the celluloid style workﬂow.
The texture representation reﬂects high-frequency texture, contours, and details in cartoon images.
A Generative Adversarial Network (GAN) framework is used to learn the extracted representations and to cartoonize images.

SURFACE REPRESENTATION

The surface representation imitates the cartoon painting style where artists roughly draw drafts with coarse brushes and have smooth surfaces similar to cartoon images.
To smooth images and meanwhile keep the global semantic structure, a di erentiable guided ﬁlter is adopted for edge-preserving ﬁltering
Edge-preserving ﬁltering is an image processing technique that smooths away noise or textures while retaining sharp edges. Examples are the median, bilateral, guided, and anisotropic di usion ﬁlters.

SURFACE LOSS FORMULA :

Lsurface (G, Ds) = log Ds (Fdgf (Ic, Ic)) + log (1 − Ds (Fdgf (G (Ip), G (Ip))))

Where,

G = Generator, Ds = Discriminator, Ic = Reference Cartoon Image, Ip = Input Photo, Fdgf = It takes an image I as input and itself as a guide map, returns extracted surface representation Fdgf (I, I) with textures and details removed.

Note: A discriminator Ds is introduced to judge whether model outputs and reference cartoon images have similar surfaces, and guide generator G to learn the information stored in the extracted surface representation.

STRUCTURE REPRESENTATION

We at ﬁrst used the felzenszwalb algorithm to segment images into separate regions. As superpixel algorithms only consider the similarity of pixels and ignore semantic information, we further introduce selective search to merge segmented regions and extract a sparse segmentation map.
Standard superpixel algorithms colour each segmented region with an average of the pixel value. We found this lowers global contrast, darkens images, and causes a hazing e ect on the ﬁnal results by analysing the processed dataset. We thus propose an adaptive colouring algorithm
Adaptive coloring formula,

Si,j = (θ1 ∗ S + θ2 ∗ Š)^µ

(θ1, θ2) = (0, 1) σ(S) < γ1
(0.5, 0.5) γ1 < σ(S) < γ2
(1, 0) γ2 < σ(S).

Where we ﬁnd γ1 = 20, γ2 = 40 and µ = 1.2 generate good results.

STRUCTURE LOSS FORMULA:

Lstructure= || VGGn (G (Ip)) − VGGn (Fst (G (Ip))) ||

Where, G = Generator
Ip = Input Photo
Fst = Structure Representation Extraction.

Note: We use high-level features extracted by a pre-trained VGG16 network to enforce spatial constraints between our results and extracted structure representation.

TEXTURE REPRESENTATION :

The high-frequency features of cartoon images are key learning objectives, but luminance and colour information make it easy to distinguish between cartoon images and real-world photos. We thus propose a random colour shift algorithm. The random colour shift can generate random intensity maps with luminance and colour information removed.
Frcs extract single-channel texture representation from colour images, which retains high-frequency textures and decreases the inﬂuence of colour and luminance.

Frcs (Irgb) = (1 − α) (β1 ∗ Ir + β2 ∗ Ig + β3 ∗ Ib) + α ∗ Y

Where, Irgb represents 3-channel RGB colour images, Ir, Ig and Ib represent three colour channels, and Y represents standard grayscale image converted from RGB colour image.

Note: We set α = 0.8, β1, β2 and β3 ∼ U(−1, 1).

TEXTURAL REPRESENTATION FORMULA :

Ltexture (G, Dt) = log Dt (Frcs (Ic)) + log (1 − Dt (Frcs (G (Ip))))

Where,
G = Generator, Dt = Discriminator, Ic = Reference Cartoon Image, Ip = Input Photo, Frcs = Extract single-channel texture representation from colour images, which retains high-frequency textures and decreases the inﬂuence of colour and luminance.

@Reference

Demo

Examples :

result.1.mp4

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
images_uploaded		images_uploaded
saved_models		saved_models
test		test
.gitattributes		.gitattributes
Cartoon_World.ipynb		Cartoon_World.ipynb
README.md		README.md
guided_filter.py		guided_filter.py
network.py		network.py
videos.rar		videos.rar
working-with-video-in-python.ipynb		working-with-video-in-python.ipynb

khaHesham/Cartoon-World

Folders and files

Latest commit

History

Repository files navigation