Skip to content
/ mini Public

Minimal implementation of interesting ML building blocks

Notifications You must be signed in to change notification settings

hwijeen/mini

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Mini basics

These are my simplified implementations of the building blocks of ML that I find worth implementing. Some details can only be learned through implementation.

Backlog

  • Forward mode differentiation with torch.func
  • Grouped Query Attention
  • Tokenizer

Done

Tensor parallelism

  • Minimal implementation of an MLP layer in tensor model parallelism using Pytorch
  • Reference: MegatronLM, tutorial
  • We need a different approach from the data parallelism in order to fit a model bigger than the GPU memory. The idea of tensor parallelism is to split the model in a way that minimizes the communication and keeping the GPUs compute bound.
  • MLP(X) = GeLU(Dropout(GeLU(XW_1)W'))
  • The first weight is split column wise. Given $Y = GeLU(XW)$, split W into columns ($W = [W_1; W_2]$) to get $Y = GeLU([XW_1; XW_2] = [GeLU(Y_1); GeLU(Y_2)]$ which allows computing GeLU without synchronization.
  • The second weight is row-wise split. $YW' = Y_1W'_1 + Y_2W'_2$. The synchronization happens once, right before the dropout layer.

minigrad.py

minidiffusion.py

  • Minimal implementation of forward and backward diffusion process.

  • Reference: CMU Generative AI course, tiny-diffusion, and Huggingface's Diffusers.

  • Diffusion model is a generative model that enables generating an image. This means that we need to learn p(x).

  • It is assumed about the data generation process that that there are latent variable that give rise to observation. z -> x.

  • As we introduce the latent variable, computing the gradient of p(x) is impossible as it entails marginalizing out z. So we resort to minimizing the lower bound.

  • Forward and reverse process are exact specification of the "z - x". Markov assumption is made.

  • Forward process goes from x to z. It defines q(x_t|x_{t-1}) as adding Gaussian noise to previous state. Mean and std are set such that the distribution at the end of the forward process will follow Gaussian with 0 mean and std of I.

  • The process of going from z to x is called reverse process in the diffusion model. The exact reverse process p(x_{t-1}|x_t) is intractable due to the dependence on x_0 (original data). Instead we define p(x_{t-1}|x_t, x_0).

  • We are going to define learned reverse process that behaves like the exact reverse process like above, but this one is not conditioned on the original image x_0. It will allow us to go from the random noise z to an image x_0. This will be parameterized with a neural network (e.g. UNet)

  • The ELBO objective can be seen as matching the states from the exact reverse process and learned reverse process.

  • The training procedure is at a time step, we analytically calculate x_t with forward process (=iteratively applying Gaussian noise), and have the learned backward process match the exact backward process.

  • Matching can be defined as matching the mean, the original image reconstructed from x_t, the error that gave rise to x_t.

  • In practice we sample a few time steps, as gradient from different time steps are pretty much correlated.

minirope.py

  • Implementation of RoPE.
  • Reference: Eq24 in the original Rotary Embedding paper, Huggingface implementation, lucidrain's implementation
  • RoPE adds positional information into attention computation in every layer.
  • It "rotates" the query vector and key vector with matrix multiplication. Rotation matrix is a block diagonal matrix, meaning that we are actually rotating multiple two-dimensional vectors independently.
  • Block-diagonal Matrix * vector operation is efficiently implemented with element-wise multiplication.
  • Details: Interleaving can be done with torch.stack or einops.rearrange.

About

Minimal implementation of interesting ML building blocks

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages