Introducing LaserTune: An Efficient Fine-Tuning Algorithm for On-Edge Devices and On-premise Cloud

Jalaj Upadhyay
Feb 24
4 min read

Updated: 6 days ago

A typical pipeline for generating a large-language model consists of four stages: (i) pre-training, (ii) supervised fine-tuning and alignment, (iii) fine-tuning, and (iv) inference. Fine-tuning is particularly crucial when adapting a model for specific tasks while optimizing for efficiency. While pre-training and alignment are done on publicly available datasets, fine-tuning and inference are done on private data. However, when working with resource-constrained devices, fine-tuning becomes significantly more challenging. Recently, DeepSeek announced several optimizations to increase the inference speed and training. LaserTune pushes this boundary even further, making fine-tuning and inference possible on users’ devices and the local cloud.

Resource-constrained devices (e.g., laptops, smartphones, smartwatches) or on-premise cloud servers have limited hardware, software, and network capabilities, as well as heterogeneity in data, operating system, and system availability. Key constraints include (1) processing power (GPU/CPU), (2) memory usage, (3) battery life, (4) network bandwidth, and (5) I/O interfaces. Optimization to better harness underlying hardware addresses some of these limitations, but successful fine-tuning on such devices requires both algorithmic improvements and hardware optimizations.

Central Philosophy

LaserTune was designed with the philosophy that algorithmic improvements should be independent of computer architecture, model type, operating system, or hardware to avoid limiting the end user. So, LaserTune can be used for any resource-constrained devices, including but not limited to edge devices (e.g., laptops, iPads) and on-premise cloud.

Comparative Study

To compare fine-tuning algorithms, we can set an accuracy goal and design an algorithm that uses the minimum resources to achieve it. This approach aligns with business needs, as different applications (e.g., medical or financial agents) require varying accuracy levels. In this blog, we compare LaserTune with the state-of-the-art fine-tuning algorithm, with results applicable across all models. The following setup is used for this comparison:

Models: Qwen-2.5-7B (4-bit quantization)
Specification of the hardware: 1x RTX 3090 24GB.
Dataset used: Twitter Financial News Sentiment
Hyper-parameter settings for LoRA and RS-LoRA were taken from the hugging face repository, assuming those are the best-tuned hyper-parameters.

Before we give our comparative study, we point out a central question. Most fine-tuning results are stated with respect to the training loss, but the main goal is ensuring we have accurate answers on unseen data. This forms the basis of the next section.

What does proper fine-tuning mean?

A technique that fine-tunes user data is not helpful if it overfits the data; the goal (as in any machine learning model) is to minimize the loss of unseen data samples. In stochastic gradient descent (SGD), loss spikes, or "catapults," improve generalization (ICML 2024). Since our training loss is small and these spikes are hard to track, we use gradient norm as a substitute, as it's strongly correlated with training loss spikes. We plot these curves to show that our fine-tuning algorithm outperforms LoRA and RS-LoRA in generalization. The gradient norm for RS-LoRA and LoRA is minimal (max 6), making it barely visible in the plot.

Comparative plot for gradient norm between LaserTune, LoRA, and RS-LoRA

From the plot, we can see the following general trends:

LaserTune shows a behavior in gradients that indicates better generalization, a desirable feature of resource-constrained fine-tuning (or any training error).
LoRA (and its variant RS-LoRA) exhibit almost no gradient spikes; the gradients are almost always the same.

Training Loss and Memory Trade-off

Ideally, one would improve all aspects (1)-(5), but the most pressing issue is the device's memory and compute time constraints. Memory, especially RAM (Mac M-series) and VRAM (PC), is critical, as computation on persistent hard drives is significantly slower than on RAM/VRAM. Consider two fine-tuning algorithms, A and B, with two typical constraints—GPU memory and compute time. This creates four possible scenarios:

The choice between fine-tuning algorithms is clear when A or B uses less space and less compute time (i.e., the entries in blue boxes). In particular, this is the ideal goal we want to achieve. LaserTune achieves it.

Training Loss

The following plot compares the training loss of LaserTune with LoRA and its rank-stable version, RS-LoRA. LaserTune shows a faster decay in training loss, reaching half the loss of both LoRA and RS-LoRA. It also fine-tunes quicker—RS-LoRA takes 40 minutes, while LaserTune takes only 25 minutes. In fact, in less than one-third of the time, LaserTune achieves a 2x reduction in training loss.

Space Requirements

The second key constraint in fine-tuning on-edge devices is memory. If fine-tuning consumes all available memory, it slows down the system, making memory optimization essential. LaserTune uses significantly less memory than RS-LoRA, which uses 6.2 GB more (about 162%), and LoRA, which uses 5.8 GB more (or about 158%); this implies a similar decrease in bandwidth requirements in distributed settings.

Comparitive Plot for GPU memory usage between LaserTune, RS-LoRA and LoRA

Conclusion

Fine-tuning on resource-constrained devices requires balancing memory efficiency and compute time. Our findings indicate the following

Memory-efficient methods like LaserTune (LT) are preferable for consumer devices.
Better generalization property of the fine-tuning compared to LoRA and RS-LoRA.
Even though LoRA variants like rank-stable LoRA offer better efficiency than standard approaches, they are still significantly slower than LaserTune.

Stay tuned for more updates on LaserTune and efficient ML fine-tuning strategies!