# Variational distribution is zero, equality holds in the above

Variational Autoencoders (VAEs) have two parts, namely, encoder and decoder. This network encodes the data sample to latent representation and then decodes back to data space which is equivalent to data. One of the uses of variational autoencoder shown by 10, is that it is used for creating a different distribution of a random variable when some definite distribution is given by use of function g which can be learned by given data. It is shown in Figure 1 that x is 2D and normally distributed, g(x) = x/10 + x/||x|| is roughly ring-shaped when VAE is used. In 11, it is stated that the main idea behind VAEs is that they minimise the variational lower bound associated with the data point x. The marginal likelihood is composed of a sum over the marginal likelihoods of individual datapoints. Also, for any i, marginal likelihood can be given as in (2), note that ? is generative parameter and ? is variational parameter and q(z|x) is used to obtain z and is viewed as encoder or approximate inference network and p?(x|z) is viewed as decoder network : The KL divergence on the RHS of (2) is approximate from the true posterior, which is non-negative since KL divergence is always positive. The second term on RHS denotes the variational lower bound of the marginal likelihood of datapoint i, the relation can be reformulated as: When the divergence of the approximate from the true posterior distribution is zero, equality holds in the above equation. The task is to differentiate and optimise the variational lower bound w.r.t. both ? and ?. But optimising w.r.t. to ? has very high variance as Monte Carlo gradient estimator is used and which is problematic 11. As stated in 12, variational encoder trains the encoder to produce the parameters of q. If z is continuous then backpropagation through samples of z can be used to obtain the gradient w.r.t. ?. Learning consists of maximising variational lower bound and all of the expectations in it can be approximated using Monte Carlo sampling. Also note that VAE framework can also maximise importance weighted autoencoder, as it can be formulated similarly to the non-weighted autoencoder 12. As mentioned in 12, the main drawback of variation autoencoder is that samples from variational autoencoder trained on images tend to be on the blurry side. The cause of the blurriness maybe due to the intrinsic effect of maximum likelihood which minimises the KL divergence. Another cause of it, maybe that the VAEs used in practice generally have a Gaussian distribution for p. Another drawback of VAEs is that they tend to use only small subset of the dimensions of z, showcasing that the encoder was not able to transform most of the directions in input space to space where the marginal distribution matches the factorized prior.