I’m been playing around with methods such as prompt tuning and LoRA, which are parameter efficient as they only fine-tune a very small fraction (that is, <1%) of all parameters.

But for both methods, you have to cache the intermediate gradients during backprop, meaning that you don’t save on GPU memory at inference (or at most a small amount of GPU memory saved, due to not having to store optimizer states for frozen layers). For instance, I’ve had LoRA reduce GPU memory footprint for my custom model from 8.5GB -> 8.1GB, which is very minimal. Fine-tuning time reduction also isn’t really a major advantage, with finetuning the same model reduced by 20ms per batch, from 210ms to 190ms.

This begs the question - what really is the practical reason for the popularity of parameter-efficient fine-tuning (e.g. prompt tuning w/ 1.6k+ citations) if it doesn’t really save on GPU memory and training time?

I can see two possible reasons (but I’m not really convinced they really explain the ‘hype’ around parameter-efficient fine tuning):

  1. The fine-tuned model checkpoint for the downstream task is very significantly reduced. For example, in prompt tuning, we only need to save the tiny trained soft prompt (~very few megabytes), rather than the entire changed model weights (~many, many GBs) on our hard disk/SSD.
    1. But from a practical point-of-view, I feel that most people suffer from a lack of compute (e.g. GPU memory) than hard disk space. In other words, it seems that training time and GPU memory consumption are more relevant concerns than saving on checkpoint storage space.
  2. The second is robustness to domain shifts (since we are preserving the majority of the original model’s weights rather than destructively re-learning them), which was mentioned in the prompt tuning paper but not so much in the LoRA paper.
    1. I could see this as a possible reason, but the gains in performance in the prompt tuning paper in the out-of-distribution setting are marginal at best, and LoRA doesn’t mention domain shifts.

(EDIT - I’m also wondering if there is there something else I’m missing to decrease GPU memory and runtime? I’ve heard QLoRA which adds 4-bit quantization of the model on top of LoRA, so perhaps that’s a way to tackle memory efficiency for LoRA. But I don’t know if there’s anything to reduce memory footprint for prompt tuning?)

  • SouthernXBlend@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    Not a LoRA expert, but I’m guessing that your model is also on your GPU. The majority of that 8.5GB footprint is probably just your model, meaning that LoRA actually is giving you a significant decrease in GPU memory usage added during training.

    Try just loading your model and checking your GPU memory usage. If it’s ~8GB, LoRA is cutting your training memory usage from 0.5 to 0.1GB.

  • Becomefilthyrich@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    Have extensively working with this past 6 months , specifically llm and whisper finetuning

    Lora defenitly gives a significant boost in training speed as well as a massive reduction in mem requirements

  • bjergerk1ng@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    I think the big win comes from combining LoRA with quantization (i.e. QLoRA) which you can’t normally do with full fine-tuning.

    • patricky168@alien.topOPB
      link
      fedilink
      English
      arrow-up
      1
      ·
      10 months ago

      Thanks - I was wondering though, for QLoRA what does the LoRA bit really do?

      Since I feel like there have been some success(?) in just quantizing the model and doing full fine-tuning and it still reduces memory consumption, so does the LoRA mainly assist in trying to “recover” the lost precision? Or does the LoRA part in QLoRA still significantly reduce memory vs. say, just 4 bit quantization + full finetuning?

  • MadScie254@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    The motivation for parameter-efficient fine-tuning lies in the ability to improve model performance without drastically increasing computational requirements. While it may not directly reduce runtime or GPU memory usage, it allows for better utilization of existing resources. By fine-tuning only a subset of the model parameters, we can achieve similar performance gains as full fine-tuning while minimizing the computational overhead. This approach is particularly useful when working with limited computing resources or when fine-tuning large models that would otherwise be impractical to train from scratch.

  • lorenmontez@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    I had the exact same observations and concerns in my projects. For developing a VLM, I have tested to confirm that LoRA/adaptors can lead to significantly better training efficiency and improved robustness as OP suggested. For developing a 3D diffusion model, I found that LoRA has minimal advantages, and so simply fine tune a smaller model can have a better performance (larger batches help significantly in diffusion models).

  • koolaidman123@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    You’re doing something wrong. I’ve managed to reduce vram usage cost by >4x w lora on 7b llama models from 160gb vram to 40gb.

    Performance is a separate issue, but thats the tradeoff for memory savings

  • Forsaken-Data4905@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    The point is that the adapted layers have a significantly higher parameter count in the freezed model, leading to huge savings of memory. You never take your gradient with respect to adapted layers, only to adaptor layers and whatever is left of the original model.

    This is of course not necessarily true for smaller models.

  • lightSpeedBrick@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    My understanding is that with LoRA you reduce the number of trainable parameters and therefore the memory needed to track optimizer states (e.g for Adam that tracks 2 state parameters for each model parameter). This means that you need far less RAM to fine-tune the model. Imagine 70B parameters * 4 bytes for fp32 training plus 70B * 8bytes for Adam. Lora reduces that second part to say 1% of 70B * 8 bytes.

    You can also use gradient checkpointing, which isn’t specific to LoRA, to reduce memory consumption at the expense of training time. Here you recompute activations during back-prop and cache some intermediate activations.

    Can you explain what you mean by “caching intermediate gradients during backprop”? I’m not familiar with what that is.

    • patricky168@alien.topOPB
      link
      fedilink
      English
      arrow-up
      1
      ·
      10 months ago

      Yeah what I mean is that despite LoRA only updating gradients for the adapters on the attention weights, we still need to calculate gradients for downstream layers that aren’t being updated and that takes GPU memory. So the only memory saved is from the optimizer states if I am not mistaken.

      • Maykey@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        10 months ago

        This “only memory saved” amounts to throwing away 2 copies of the entire model. Pretty sweet deal.

  • CrysisAverted@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    So you froze the non lora weights and you arent seeing significant improvement to train time or memory usage during training?

  • gwern@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    For example, in prompt tuning, we only need to save the tiny trained soft prompt (~very few megabytes), rather than the entire changed model weights (~many, many GBs) on our hard disk/SSD. But from a practical point-of-view, I feel that most people suffer from a lack of compute (e.g. GPU memory) than hard disk space. In other words, it seems that training time and GPU memory consumption are more relevant concerns than saving on checkpoint storage space.

    I think you severely overestimate how many people are training a LoRA ever, and underestimate how many are using them (ie. downloading them). For every person who actually gets their hands dirty training their own LoRA and burning GPU, there’s probably >100 downloading it (often repeatedly) just as part of a set of LoRAs to generate their own images. Look at Civitai or Hugging Face bandwidth usage. It makes a huge difference to the vastly overwhelming majority of people if the checkpoint is 100 gigabytes or 0.001 gigabytes! And if you have to spend a terabyte of disk space to store the handful of mods you want to try out, too…