Introduction
Variational Autoencoders (VAEs) are generative models explicitly designed to capture the underlying probability distribution of a given dataset and generate novel samples. They utilize an architecture that comprises an encoder-decoder structure. The encoder transforms input data into a latent form, and the decoder aims to reconstruct the original data based on this latent representation. The VAE is programmed to minimize the dissimilarity between the original and reconstructed data, enabling it to comprehend the underlying data distribution and generate new samples that conform to the same distribution.
One notable advantage of VAEs is their ability to generate new data samples resembling the training data. Because the VAE’s latent space is continuous, the decoder can generate new data points that seamlessly interpolate among the training data points. VAEs find applications in various domains like density estimation and text generation.
This article was published as a part of the Data Science Blogathon.
The Architecture of Variational Autoencoder
A VAE typically has two major components: An encoder connection and a decoder connection. An encoder network transforms The input data into a low-dimensional secret space, often called a “secret code”.
Various neural network topologies, such as fully connected or convolutional neural networks, can be investigated for implementing encoder networks. The architecture chosen is based on the characteristics of the data. The encoder network produces essential parameters, such as the mean and variance of a Gaussian distribution, necessary for sampling and generating the latent code.
Similarly, researchers can construct the decoder network using various types of neural networks, and its objective is to reconstruct the original data from the provided latent code.
Example of VAE architecture: fen
A VAE comprises an encoder network that maps input data to a latent code and a decoder network that conducts the inverse operation by translating the latent code back to the reconstruction data. By undergoing this training process, the VAE learns an optimized latent representation that captures the fundamental characteristics of the data, enabling precise reconstruction.
Intuitions About the Regularization
In addition to the architectural aspects, researchers apply regularization to the latent code, making it a vital element of VAEs. This regularization prevents overfitting by encouraging a smooth distribution of the latent code rather than simply memorizing the training data.
The regularization not only aids in generating new data samples that interpolate smoothly between training data points but also contributes to the VAE’s ability to generate novel data resembling the training data. Moreover, this regularization prevents the decoder network from perfectly reconstructing the input data, promoting the learning of a more general data representation that enhances the VAE’s capacity for generating diverse data samples.
Mathematically, in VAEs, researchers express the regularization by incorporating a Kullback-Leibler (KL) divergence term into the loss function. The encoder network generates parameters (e.g., mean and log-variance) of a Gaussian distribution for sampling the latent code. The loss function of a VAE includes the calculation of the KL divergence between the distribution of the learned latent variables and a prior distribution, normal distribution. Researchers incorporate the KL divergence term to encourage the latent variables to possess distributions similar to the prior distribution.
here is the formula for KL divergence:
KL(q(z∣x)∣∣p(z)) = E[log q(z∣x) − log p(z)]
In summary, the regularization incorporated in VAEs plays a crucial role in enhancing the model’s capacity to generate fresh data samples while mitigating the risk of overfitting the training data.
Mathematical Details of VAEs
Probabilistic Framework and Assumptions
The probabilistic framework of a VAE can be outlined as follows:
Latent Variables
This is crucial in enabling their representation within a model constructed using a simpler (typically exponential) conditional distribution concerning the observed variable. It is characterized by a probability distribution with two variables: p(x, z). While the variable x is visible in the dataset under consideration, the variable z is not. The total probability distribution can be stated as p(x, z) = p(x|z)p(z).
Observed Variables
We have an observed variable x, which is assumed to follow a likelihood distribution p(x|z) (for example, a Bernoulli distribution).
Likelihood Distribution
L(x, z) is a function that depends on two variables. If we set the value of x, the likelihood function can be understood as a distribution representing the probability distribution of z for that particular fixed x. However, if we set the value of z, the likelihood function should not be regarded as a distribution for x. In most cases, it does not adhere to the characteristics of a distribution, such as summing up to 1. Nevertheless, certain scenarios exist where the likelihood function can formally meet the distribution criteria and satisfy the requirement of summing to 1.
The combined distribution of the latent and observable variables is as follows: p(x,z) = p(x|z)p(z). A joint probability distribution presents the probability distribution for multiple random variables.
The main purpose of a VAE is to understand the true posterior distribution of the latent variables, denoted as p(z|x). A VAE accomplishes this by employing an encoder network to approximate the genuine posterior distribution with a learned approximation q(z|x).
Posterior Distribution
In Bayesian statistics, a posterior probability refers to the adjusted or updated probability of an event happening in light of newly acquired information. Update the prior probability by applying Bayes’ theorem to calculate the posterior probability.
The VAE learns the model parameters by maximizing the Evidence Lower Bound (ELBO):
ELBO = E[log(p(x|z))] – KL(q(z|x)||p(z))
ELBO consists of two terms. The first term is the reconstruction term, which calculates the ability of the VAE to recover the input data correctly. The second term, the KL variance, defines the difference between the estimated posterior distribution (q(z|x)) and the prior distribution (p(z)).
By employing a probabilistic framework, VAE models generate the data assuming that the input data from a latent space is on specific probabilistic distributions. The objective is to learn the true posterior distribution by maximizing the likelihood of the input data.
Variational Inference Formulation
The formulation of Variational Inference in a VAE is as follows:
- Approximate posterior distribution: We have an approximation of the posterior distribution q(z|x).
- True posterior distribution: We have the true posterior distribution p(z|x).
The aim is to find a similar distribution (q(z|x)) that approximates the true distribution (p(z|x)) as closely as possible, using the KL divergence method.
The KL variance equation compares two probability distributions, q(z|x) and p(z|x), to measure their differences.
During VAE training, we try to minimize the KL divergence by increasing the evidence of lower boundary (ELBO), a combination of the reconstruction term and the KL divergence. The reconstruction term assesses the model’s ability to reconstruct input data, while the KL divergence measures the difference between the approximate and actual distributions.
Neural Networks in the Model
Neural networks are commonly used to implement VAEs, where both the encoder and decoder components are implemented as neural networks. During the training process, the VAE adjusts the parameters of the encoder and decoder networks to minimize two key components: the reconstruction error and the KL divergence between the variational distribution and the true posterior distribution. This optimization task is often accomplished using techniques like stochastic gradient descent or other suitable optimization algorithms.
Variational Autoencoder Execution
Before getting into the configuration of a Variational Autoencoder (VAE), it is critical first to understand the fundamental concepts. While VAE implementation can be intricate, we can simplify learning by following a logical and coherent structure.
Our approach will involve gradually introducing the fundamental concepts and progressively delving into implementation details. We will adopt a hands-on approach to enhance comprehension and provide illustrative examples throughout the learning journey.
Data Preparation
The provided code includes loading the MNIST dataset, a widely utilized dataset for machine learning and computer vision tasks. This dataset comprises 60,000 grayscale images of handwritten digits (0-9), each with a size of 28×28 pixels, along with their corresponding labels indicating the digit represented in each image. This allows us to link the images with their respective categories or names. To prepare the input data for training, the code applies normalization by dividing all pixel values by 255. Furthermore, we reshape the input data to incorporate a batch dimension. This preprocessing step ensures that you format the data properly for model training.
import tensorflow as tf
import numpy as np
(x_train, y_train)
,(x_test, y_test) =
tf.keras.datasets.mnist.load_data()
# Normalize the input data
x_train = x_train / 255.
# Reshape the input data to have an additional batch dimension
x_train = x_train.reshape((-1, 28*28))
x_test = x_test.reshape((-1, 28*28))
Model Definition
In the VAE model, we have an encoder and a decoder that work together. The encoder maps the input image to the latent space using two dense layers with a ReLU activation function. On the other hand, the decoder takes the latent vector as input and reconstructs the original image using two dense layers.
input_dim = 28*28
hidden_dim = 512
latent_dim = 128
Encoder Architecture
encoder_input = tf.keras.Input(shape=(input_dim,))
encoder_hidden = tf.keras.layers.Dense(hidden_dim, activation='relu')(encoder_input)
latent = tf.keras.layers.Dense(latent_dim)(encoder_hidden)
encoder = tf.keras.Model(encoder_input, latent)
Decoder Architecture
decoder_input = tf.keras.Input(shape=(latent_dim,))
decoder_hidden = tf.keras.layers.Dense(hidden_dim, activation='relu')(decoder_input)
decoder_output = tf.keras.layers.Dense(input_dim)(decoder_hidden)
decoder = tf.keras.Model(decoder_input, decoder_output)
VAE Architecture
inputs = tf.keras.Input(shape=(input_dim,))
latent = encoder(inputs)
outputs = decoder(latent)
vae = tf.keras.Model(inputs, outputs)
Training the Model
To train the VAE, we utilize the Adam optimizer and the binary cross-entropy loss function. The training is performed in mini-batches, where the loss is calculated, and gradients are backpropagated for each image. Repeat this process.
loss_fn = tf.keras.losses.BinaryCrossentropy()
optimizer = tf.keras.optimizers.Adam()
num_epochs = 50
for epoch in range(num_epochs):
for x in x_train:
x = x[tf.newaxis, ...]
with tf.GradientTape() as tape:
reconstructed = vae(x)
loss = loss_fn(x, reconstructed)
grads = tape.gradient(loss, vae.trainable_variables)
optimizer.apply_gradients(zip(grads, vae.trainable_variables))
print(f'Epoch {epoch+1}/{num_epochs}, Loss: {loss.numpy():.4f}')
Output:
Epoch 1: Loss - 0.3559
Epoch 2: Loss - 0.3550
.
.
.
Generate Samples
In this updated code, we redefine the latent_samples variable with a shape of (5, latent_dim), allowing it to generate five random samples instead of 10. We also modified the for loop to iterate five times, displaying five generated samples instead of 10. Additionally, we adjust the subplot function to arrange the generated samples in a grid with one row and five columns.
# Generate samples
latent_samples = tf.random.normal(shape=(5, latent_dim))
generated_samples = decoder(latent_samples)
# Plot the generated samples
import matplotlib.pyplot as plt
for i in range(5):
plt.subplot(1, 5, i+1)
plt.imshow(generated_samples[i].numpy().reshape(28, 28), cmap='gray')
plt.axis('off')
plt.show()
output:
When you run this code, it will generate a figure showcasing five images that resemble the ones from the MNIST test set. The system will display these photographs in a grid arrangement featuring one row and five columns. The system will showcase them in grayscale, using the ‘grey’ color map, without axes.
Visualization of Latent Space
To gain insights into the latent space of a VAE, you can follow these steps:
- Utilize the VAE to encode the training data points, projecting them into the latent space.
- Employ a dimensionality reduction technique like t-SNE to map the high-dimensional latent space onto a 2D space suitable for visualization.
- Plot the data points in the 2D space, allowing for a visual exploration of the latent space.
By following this process, you can effectively visualize and comprehend the underlying structure and distribution of the latent space in the VAE.
import tensorflow as tf
from sklearn.manifold import TSNE
latent_vectors = encoder(x_train).numpy()
latent_2d = TSNE(n_components=2).fit_transform(latent_vectors)
# Ploting latent space
plt.scatter(latent_2d[:, 0], latent_2d[:, 1], c=y_train, cmap='viridis')
plt.colorbar()
plt.show()
output:
Gaining insights into the structure and organization of the data trained on a Variational Autoencoder (VAE) by visualizing its latent space. This visualization technique offers a valuable means of comprehending the underlying patterns and relationships within the data.
Conclusion
A variational autoencoder (VAE) is an enhanced form of an autoencoder that incorporates regularization techniques to mitigate overfitting and ensure desirable properties in the latent space for effective generative processes. Functioning as a generative system, VAEs share a similar objective with generative adversarial networks. Like a conventional autoencoder, a VAE comprises an encoder and a decoder. Its training aims to minimize the reconstruction error between the encoded-decoded data and the original input.
Key Takeaways
- Variational autoencoders (VAEs) can learn to reconstruct and generate new samples from a provided dataset.
- By utilizing a latent space, VAEs can represent data continuously and smoothly, facilitating the generation of variations of the input data with smooth transitions.
- The architecture of a VAE consists of an encoder network that maps the input data to the latent space, a decoder network responsible for reconstructing the data from the latent space, and a loss function that combines a reconstruction loss and a regularization term.
- VAEs have demonstrated their utility in image generation, anomaly detection, and semi-supervised learning tasks.
Frequently Asked Questions
A. Variational autoencoders (VAEs) are probabilistic generative models with different components, including neural networks called encoders and decoders. The encoder network handles the first part, and the decoder network handles the second part.
A. One of the main benefits of VAEs is their ability to generate new data samples that closely resemble the training data. Achieve this through a continuous latent space, enabling the decoder to produce new data points that smoothly interpolate between the existing training data points.
A. A notable limitation of variational autoencoders is their tendency to produce blurry and unrealistic outputs. This issue arises from their approach to recovering data distributions and calculating loss functions.
A. GANs produce highly realistic images but can be challenging to train and work with. On the other hand, VAEs are generally easier to train but may not always achieve the same level of image quality as GANs.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
By Analytics Vidhya, July 26, 2023.