Skip to content

A cartoonizer able to turn your images or videos to high-quality real-life cartoons using the power of GANs and Vgg16 NN archticture

Notifications You must be signed in to change notification settings

khaHesham/Cartoon-World

Repository files navigation

Cartoon World

  • A cartoonizer able to turn your images or videos to high-quality real-life cartoons using the power of GANs and Vgg16 NN archticture
  • INTRODUCTION

    • Cartoons are a very popular art form that has been widely applied in diverse scenes, from publication in printed media to storytelling for children. Some cartoon artwork was created based on real-world scenes. However, manually re-creating real-life based scenes can be very laborious and requires refined skills.

      The evolution in the field of Machine Learning has expanded the possibilities of creating visual arts. Some famous products have been created by turning real-world photography into usable cartoon scene materials, where the process is called image cartoonization.

      White box cartoonization is a method that reconstructs high-quality real-life pictures into exceptional cartoon images using the GAN framework.

  • FLOWCHART OF WHITE-BOX-CARTOONIZATION MODEL :
  • ARCHITECTURE OF WBC MODEL:

  • the architecture of the generator network and discriminator network in the above figure. The generator network is a fully convolutional U-Net-like network. I used convolution layers with stride2 for down-sample and bilinear interpolation layers for upsampling to avoid checkerboard artefacts. The network consists of only three kinds of layers: convolution, Leaky ReLU (LReLU) and bilinear resize layers. This enables it to be easily embedded in edge devices such as mobile phones. PatchGAN is adopted in the discriminator network, where the last layer is a convolution layer. Each pixel in the output feature map corresponds to a patch in the input image, with the patch size equals to the perceptive field, and is used to judge whether the patch belongs to cartoon images or generated images. PatchGAN enhances the discriminative ability of details and accelerates training. Spectral normalization is placed after every convolution layer (except the last one) to enforce Lipschitz constrain on the network and stabilize training.
  • ------------------------------------ METHODOLOGY ----------------------------------------

    • INTRODUCTION TO GENERATIVE ADVERSARIAL NETWORKS (GANs)
      • Generative adversarial networks (GANs) are an exciting recent innovation in machine learning. GANs are generative models: they create new data instances that resemble your training data. For example, GANs can create images that look like photographs of human faces, even though the faces don't belong to any real person.
      • Generative Adversarial Networks, or GANs for short, are an approach to generative modelling using deep learning methods, such as convolutional neural networks. Generative modelling is an unsupervised learning task in machine learning that involves automatically discovering and learning the regularities or patterns in input data in such a way that the model can be used to generate or output new examples that plausibly could have been drawn from the original dataset.
      • GANs are a clever way of training a generative model by framing the problem as a supervised learning problem with two sub-models: the generator model that we train to generate new examples, and the discriminator model that tries to classify examples as either real (from the domain) or fake (generated). The two models are trained together in a zero-sum game, adversarial, until the discriminator model is fooled about half the time, meaning the generator model is generating plausible examples.
      • GANs are an exciting and rapidly changing field, delivering on the promise of generative models in their ability to generate realistic examples across a range of problem domains, most notably in image-to-image translation tasks such as translating photos of summer to winter or day to night, and in generating photorealistic photos of objects, scenes, and people that even humans cannot tell are fake.
    • DEEP CONVOLUTIONAL GENERATIVE ADVERSARIAL NETWORKS (DCGAN) :
      • The deep convolutional generative adversarial network, or DCGAN for short, is an extension of the GAN architecture for using deep convolutional neural networks for both the generator and discriminator models and configurations for the models and training that result in the stable training of a generator model.
      • The DCGAN is important because it suggested the constraints on the model required to e ectively develop high-quality generator models in practice. This architecture, in turn, provided the basis for the rapid development of a large number of GAN extensions and applications.
    • ARCHITECTURE OF GAN
          • The architecture of a GAN has two basic elements: the generator network and the discriminator network. Each network can be any neural network, such as an Artificial Neural Network (ANN), a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN), or a Long Short Term Memory (LSTM). The discriminator has to have fully connected layers with a classifier at the end.
          • A generative adversarial network (GAN) has two parts:
            • The generator learns to generate plausible data. The generated instances become negative training examples for the discriminator.
            • The discriminator learns to distinguish the generator's fake data from real data. The discriminator penalizes the generator for producing implausible results
        • When training begins, the generator produces fake data, and the discriminator quickly learns to tell that it's fake:
        • As training progresses, the generator gets closer to producing output that can fool the discriminator:
        • Finally, if generator training goes well, the discriminator gets worse at telling the di erence between real and fake. It starts to classify fake data as real, and its accuracy decreases.
        • Here's a picture of the whole system:
        • Both the generator and the discriminator are neural networks. The generator output is connected directly to the discriminator input. Through backpropagation, the discriminator's classification provides a signal that the generator uses to update its weights.
        • ------------------------------------- THE DISCRIMINATOR ---------------------------------
            • The discriminator in a GAN is simply a classifier. It tries to distinguish real data from the data created by the generator. It could use any network architecture appropriate to the type of data it's classifying.
            • DISCRIMINATOR TRAINING DATA :
            • The discriminator's training data comes from two sources:
            • Real data instances, such as real pictures of people. The discriminator uses these instances as positive examples during training.
            • Fake data instances created by the generator. The discriminator uses these instances as negative examples during training.
            • In Figure 1, the two "Sample" boxes represent these two data sources feeding into the discriminator. During discriminator training, the generator does not train. Its weights remain constant while it produces examples for the discriminator to train on.

            • TRAINING THE DISCRIMINATOR
            • The discriminator connects to two loss functions. During discriminator training, the discriminator ignores the generator loss and just uses the discriminator loss. We use the generator loss during generator training, as described in the next section.

              During discriminator training:

              1. The discriminator classifies both real data and fake data from the generator.
              2. The discriminator loss penalizes the discriminator for misclassifying a real instance as fake or a fake instance as real.
              3. The discriminator updates its weights through backpropagation from the discriminator loss through the discriminator network.
            • ----------------------- THE GENERATOR -----------------------
                • The generator part of a GAN learns to create fake data by incorporating feedback from the discriminator. It learns to make the discriminator classify its output as real.

                  Generator training requires tighter integration between the generator and the discriminator than discriminator training requires. The portion of the GAN that trains the generator includes: ● random input ● generator network, which transforms the random input into a data instance ● discriminator network, which classifies the generated data ● discriminator output ● generator loss, which penalizes the generator for failing to fool the discriminator

                • USING THE DISCRIMINATOR TO TRAIN THE GENERATOR
                    To train a neural net, we alter the net's weights to reduce the error or loss of its output. In our GAN, however, the generator is not directly connected to the loss that we're trying to a ect. The generator feeds into the discriminator net, and the discriminator produces the output we're trying to a ect. The generator loss penalizes the generator for producing a sample that the discriminator network classifies as fake. Backpropagation adjusts each weight in the right direction by calculating the weight's impact on the output — how the output would change if you changed the weight. But the impact of a generator weight depends on the impact of the discriminator weights it feeds into. So backpropagation starts at the output and flows back through the discriminator into the generator.

                    At the same time, we don't want the discriminator to change during generator training. Trying to hit a moving target would make a hard problem even harder for the generator. So we train the generator with the following procedure:

                    1. Sample random noise.
                    2. Produce generator output from sampled random noise.
                    3. Get discriminator "Real" or "Fake" classification for generator output.
                    4. Calculate loss from discriminator classification.
                    5. Backpropagate through both the discriminator and generator to obtain gradients.
                    6. Use gradients to change only the generator weights.

                WHITE-BOX-CARTOONIZATION:

                  We propose to separately identify three white-box representations from images: 1. The surface representation 2. The structure representation 3. The texture representation
                • The surface representation contains a smooth surface of cartoon images.
                • The structure representation refers to the sparse colour blocks and flattens global content in the celluloid style workflow.
                • The texture representation reflects high-frequency texture, contours, and details in cartoon images.
                • A Generative Adversarial Network (GAN) framework is used to learn the extracted representations and to cartoonize images.

                SURFACE REPRESENTATION

              • The surface representation imitates the cartoon painting style where artists roughly draw drafts with coarse brushes and have smooth surfaces similar to cartoon images.
              • To smooth images and meanwhile keep the global semantic structure, a di erentiable guided filter is adopted for edge-preserving filtering
              • Edge-preserving filtering is an image processing technique that smooths away noise or textures while retaining sharp edges. Examples are the median, bilateral, guided, and anisotropic di usion filters.
              • SURFACE LOSS FORMULA :

              • Lsurface (G, Ds) = log Ds (Fdgf (Ic, Ic)) + log (1 − Ds (Fdgf (G (Ip), G (Ip))))
              • Where,

                  G = Generator, Ds = Discriminator, Ic = Reference Cartoon Image, Ip = Input Photo, Fdgf = It takes an image I as input and itself as a guide map, returns extracted surface representation Fdgf (I, I) with textures and details removed.
                • Note: A discriminator Ds is introduced to judge whether model outputs and reference cartoon images have similar surfaces, and guide generator G to learn the information stored in the extracted surface representation.

                STRUCTURE REPRESENTATION

              • We at first used the felzenszwalb algorithm to segment images into separate regions. As superpixel algorithms only consider the similarity of pixels and ignore semantic information, we further introduce selective search to merge segmented regions and extract a sparse segmentation map.
              • Standard superpixel algorithms colour each segmented region with an average of the pixel value. We found this lowers global contrast, darkens images, and causes a hazing e ect on the final results by analysing the processed dataset. We thus propose an adaptive colouring algorithm
              • Adaptive coloring formula,
                • Si,j = (θ1 ∗ S + θ2 ∗ Š)^µ

                • (θ1, θ2) = (0, 1) σ(S) < γ1
                • (0.5, 0.5) γ1 < σ(S) < γ2
                • (1, 0) γ2 < σ(S).

                Where we find γ1 = 20, γ2 = 40 and µ = 1.2 generate good results.

                STRUCTURE LOSS FORMULA:

                Lstructure= || VGGn (G (Ip)) − VGGn (Fst (G (Ip))) ||

              • Where, G = Generator
              • Ip = Input Photo
              • Fst = Structure Representation Extraction.
              • Note: We use high-level features extracted by a pre-trained VGG16 network to enforce spatial constraints between our results and extracted structure representation.

                TEXTURE REPRESENTATION :

              • The high-frequency features of cartoon images are key learning objectives, but luminance and colour information make it easy to distinguish between cartoon images and real-world photos. We thus propose a random colour shift algorithm. The random colour shift can generate random intensity maps with luminance and colour information removed.
              • Frcs extract single-channel texture representation from colour images, which retains high-frequency textures and decreases the influence of colour and luminance.
              • Frcs (Irgb) = (1 − α) (β1 ∗ Ir + β2 ∗ Ig + β3 ∗ Ib) + α ∗ Y

                Where, Irgb represents 3-channel RGB colour images, Ir, Ig and Ib represent three colour channels, and Y represents standard grayscale image converted from RGB colour image.

              • Note: We set α = 0.8, β1, β2 and β3 ∼ U(−1, 1).
              • TEXTURAL REPRESENTATION FORMULA :

                Ltexture (G, Dt) = log Dt (Frcs (Ic)) + log (1 − Dt (Frcs (G (Ip))))

                • Where,

                  G = Generator, Dt = Discriminator, Ic = Reference Cartoon Image, Ip = Input Photo, Frcs = Extract single-channel texture representation from colour images, which retains high-frequency textures and decreases the influence of colour and luminance.

                Demo

                Examples :

                result.1.mp4