# Low-rank Adaptation for Fast Text-to-Image Diffusion Fine-tuning
<!-- #region -->
<p align="center">
<img src="contents/alpha_scale.gif">
</p>
<!-- #endregion -->
> Using LoRA to fine tune on illustration dataset : $W = W_0 + \alpha \Delta W$, where $\alpha$ is the merging ratio. Above gif is scaling alpha from 0 to 1. Setting alpha to 0 is same as using the original model, and setting alpha to 1 is same as using the fully fine-tuned model.
<!-- #region -->
<p align="center">
<img src="contents/lora_pti_example.jpg">
</p>
<!-- #endregion -->
> SD 1.5 PTI on Kiriko, the game character, Various Prompts.
<!-- #region -->
<p align="center">
<img src="contents/disney_lora.jpg">
</p>
<!-- #endregion -->
> `"baby lion in style of <s1><s2>"`, with disney-style LoRA model.
<!-- #region -->
<p align="center">
<img src="contents/pop_art.jpg">
</p>
<!-- #endregion -->
> `"superman, style of <s1><s2>"`, with pop-art style LoRA model.
## Main Features
- Fine-tune Stable diffusion models twice as fast than dreambooth method, by Low-rank Adaptation
- Get insanely small end result (1MB ~ 6MB), easy to share and download.
- Compatible with `diffusers`
- Support for inpainting
- Sometimes _even better performance_ than full fine-tuning (but left as future work for extensive comparisons)
- Merge checkpoints + Build recipes by merging LoRAs together
- Pipeline to fine-tune CLIP + Unet + token to gain better results.
- Out-of-the box multi-vector pivotal tuning inversion
# Lengthy Introduction
Thanks to the generous work of Stability AI and Huggingface, so many people have enjoyed fine-tuning stable diffusion models to fit their needs and generate higher fidelity images. **However, the fine-tuning process is very slow, and it is not easy to find a good balance between the number of steps and the quality of the results.**
Also, the final results (fully fined-tuned model) is very large. Some people instead works with textual-inversion as an alternative for this. But clearly this is suboptimal: textual inversion only creates a small word-embedding, and the final image is not as good as a fully fine-tuned model.
Well, what's the alternative? In the domain of LLM, researchers have developed Efficient fine-tuning methods. LoRA, especially, tackles the very problem the community currently has: end users with Open-sourced stable-diffusion model want to try various other fine-tuned model that is created by the community, but the model is too large to download and use. LoRA instead attempts to fine-tune the "residual" of the model instead of the entire model: i.e., train the $\Delta W$ instead of $W$.
$$
W' = W + \Delta W
$$
Where we can further decompose $\Delta W$ into low-rank matrices : $\Delta W = A B^T $, where $A, \in \mathbb{R}^{n \times d}, B \in \mathbb{R}^{m \times d}, d << n$.
This is the key idea of LoRA. We can then fine-tune $A$ and $B$ instead of $W$. In the end, you get an insanely small model as $A$ and $B$ are much smaller than $W$.
Also, not all of the parameters need tuning: they found that often, $Q, K, V, O$ (i.e., attention layer) of the transformer model is enough to tune. (This is also the reason why the end result is so small). This repo will follow the same idea.
Now, how would we actually use this to update diffusion model? First, we will use Stable-diffusion from [stability-ai](https://stability.ai/). Their model is nicely ported through Huggingface API, so this repo has built various fine-tuning methods around them. In detail, there are three subtle but important distictions in methods to make this work out.
1. [Dreambooth](https://arxiv.org/abs/2208.12242)
First, there is LoRA applied to Dreambooth. The idea is to use prior-preservation class images to regularize the training process, and use low-occuring tokens. This will keep the model's generalization capability while keeping high fidelity. If you turn off prior preservation, and train text encoder embedding as well, it will become naive fine tuning.
2. [Textual Inversion](https://arxiv.org/abs/2208.01618)
Second, there is Textual inversion. There is no room to apply LoRA here, but it is worth mentioning. The idea is to instantiate new token, and learn the token embedding via gradient descent. This is a very powerful method, and it is worth trying out if your use case is not focused on fidelity but rather on inverting conceptual ideas.
3. [Pivotal Tuning](https://arxiv.org/abs/2106.05744)
Last method (although originally proposed for GANs) takes the best of both worlds to further benefit. When combined together, this can be implemented as a strict generalization of both methods.
Simply you apply textual inversion to get a matching token embedding. Then, you use the token embedding + prior-preserving class image to fine-tune the model. This two-fold nature make this strict generalization of both methods.
Enough of the lengthy introduction, let's get to the code.
# Installation
```bash
pip install git+https://github.com/cloneofsimo/lora.git
```
# Getting Started
## 1. Fine-tuning Stable diffusion with LoRA CLI
If you have over 12 GB of memory, it is recommended to use Pivotal Tuning Inversion CLI provided with lora implementation. They have the best performance, and will be updated many times in the future as well. These are the parameters that worked for various dataset. _ALL OF THE EXAMPLE ABOVE WERE TRAINED WITH BELOW PARAMETERS_
```bash
export MODEL_NAME="runwayml/stable-diffusion-v1-5"
export INSTANCE_DIR="./data/data_disney"
export OUTPUT_DIR="./exps/output_dsn"
lora_pti \
--pretrained_model_name_or_path=$MODEL_NAME \
--instance_data_dir=$INSTANCE_DIR \
--output_dir=$OUTPUT_DIR \
--train_text_encoder \
--resolution=512 \
--train_batch_size=1 \
--gradient_accumulation_steps=4 \
--scale_lr \
--learning_rate_unet=1e-4 \
--learning_rate_text=1e-5 \
--learning_rate_ti=5e-4 \
--color_jitter \
--lr_scheduler="linear" \
--lr_warmup_steps=0 \
--placeholder_tokens="<s1>|<s2>" \
--use_template="style"\
--save_steps=100 \
--max_train_steps_ti=1000 \
--max_train_steps_tuning=1000 \
--perform_inversion=True \
--clip_ti_decay \
--weight_decay_ti=0.000 \
--weight_decay_lora=0.001\
--continue_inversion \
--continue_inversion_lr=1e-4 \
--device="cuda:0" \
--lora_rank=1 \
# --use_face_segmentation_condition\
```
[Check here to see what these parameters mean](https://github.com/cloneofsimo/lora/discussions/121).
## 2. Other Options
Basic usage is as follows: prepare sets of $A, B$ matrices in an unet model, and fine-tune them.
```python
from lora_diffusion import inject_trainable_lora, extract_lora_ups_down
...
unet = UNet2DConditionModel.from_pretrained(
pretrained_model_name_or_path,
subfolder="unet",
)
unet.requires_grad_(False)
unet_lora_params, train_names = inject_trainable_lora(unet) # This will
# turn off all of the gradients of unet, except for the trainable LoRA params.
optimizer = optim.Adam(
itertools.chain(*unet_lora_params, text_encoder.parameters()), lr=1e-4
)
```
Another example of this, applied on [Dreambooth](https://arxiv.org/abs/2208.12242) can be found in `training_scripts/train_lora_dreambooth.py`. Run this example with
```bash
training_scripts/run_lora_db.sh
```
Another dreambooth example, with text_encoder training on can be run with:
```bash
training_scripts/run_lora_db_w_text.sh
```
## Loading, merging, and interpolating trained LORAs with CLIs.
We've seen that people have been merging different checkpoints with different ratios, and this seems to be very useful to the community. LoRA is extremely easy to merge.
By the nature of LoRA, one can interpolate between different fine-tuned models by adding different $A, B$ matrices.
Currently, LoRA cli has three options : merge full model with LoRA, merge LoRA with LoRA, or merge full model with LoRA and changes to `ckpt` format (original format)
```
SYNOPSIS
lora_add PATH_1 PATH_2 OUTPUT_PATH <flags>
POSITIONAL ARGUME