Exploring SparseGPT: Enhancing Language Models Through Pruning
Written on
Chapter 1: Understanding Language Model Limitations
Language models like GPT-3 and LaMDA have showcased remarkable capabilities, yet they come with significant limitations. The rumored extensive parameters of GPT-4 only add to the intrigue. However, increased parameters also necessitate more storage space and high-performance hardware for training, leading to higher costs and energy consumption.
The Generative Pretrained Transformer (GPT) family is among the most recognized language models, though it is not the sole contender. These language models (LMs) can perform a wide range of tasks, exhibiting impressive behaviors such as in-context learning. Unfortunately, training and inference remain prohibitively expensive.
The BigScience consortium aimed to democratize LM access by releasing BLOOM. Prior to BLOOM, major tech companies (like Google, Microsoft, NVIDIA, and OpenAI) monopolized the development of LMs. BLOOM is a substantial resource available on HuggingFace, but it comes with its own drawbacks—it has a staggering 175 billion parameters, requiring 320 GB of storage and "at least five A100 GPUs with 80GB of memory each for inference."
In contrast, StableDiffusion emerged as an open-source model that surpassed OpenAI's DALL-E and can be run on GPUs with a minimum of 8 GB of memory. This raises a crucial question: if open-source can succeed in text-to-image generation, can it do the same for language models?
This article investigates that possibility. The authors indicate that while many current methods rely on quantization (which diminishes model accuracy), an alternative exists: pruning.
Pruning is not a novel concept; it originated with decision trees, where irrelevant branches are trimmed (also used as a technique to mitigate overfitting). In neural networks, pruning involves either eliminating weights (unstructured pruning) or entire rows/columns of weight matrices (structured pruning). However, these techniques typically necessitate extensive retraining, which can lead to accuracy degradation.
To avoid retraining while pruning, we need an efficient one-shot method. Unfortunately, prior techniques have been too computationally intensive for models containing billions of parameters.
The authors introduce SparseGPT, the first effective one-shot pruning method tailored for models with 10 to 100 billion parameters. SparseGPT simplifies the pruning challenge by framing it as a vast sparse regression problem. This approach utilizes a novel approximate sparse regression solver to tackle a layer-wise compression problem efficiently, achievable within hours on the largest openly available GPT models (175B parameters) using a single GPU. Remarkably, SparseGPT maintains accuracy, resulting in negligible accuracy loss post-pruning, with no need for fine-tuning.
Chapter 2: The Mechanics of SparseGPT
The core contribution of this paper is the ability to reduce the model size by 50-60% without significant accuracy loss, while previous methods saw over a 30% collapse in performance. This method is also compatible with further compression techniques, such as quantization. It operates locally, focusing on updating the weights of a layer, and can be generalized to larger models, allowing for more sparsification without compromising accuracy.
The challenge arises when attempting to optimize both the weights of a layer and its pruning mask together, particularly in wide layers, which constitutes an NP-hard problem. Thus, efficient optimization models that handle a few million parameters (like AdaPrune) are essential.
The optimization process requires several careful approximations to mitigate complexity and runtime, making it applicable to transformer architectures. The strategy involves independent pruning of the different rows of the weight matrix, which introduces its own set of difficulties.
The authors utilized the OPT model family and BLOOM in their study. The OPT model allows flexibility in using sizes from 125M to 175B, facilitating the evaluation of scaling laws. They employed WikiTest2, a benchmark dataset for NLP tasks, to assess perplexity after applying the SparseGPT algorithm against unstructured sparsity, examining its impact across various model sizes. Their findings indicate that unstructured pruning leads to a collapse in sparsity, rendering it unfeasible for large LMs.
In contrast, SparseGPT reveals a promising trend: larger models are significantly easier to sparsify. This is likely due to their higher level of over-parameterization and greater noise resilience. The authors suggest that further exploration of this phenomenon could yield valuable insights.
Chapter 3: Results and Future Directions
The authors closely examined large LMs using BLOOM and OPT 175B, observing that SparseGPT could achieve reasonable perplexity even with 80% sparsity. Notably, SparseGPT enables the removal of approximately 100 billion weights from these models with minimal effects on accuracy.
In concluding their study, the authors demonstrated a novel algorithm that significantly reduces model weights (up to 50-60%) without compromising performance, potentially lowering the computational requirements for using large LMs. This advancement also decreases inference costs by practically setting 100 billion parameters to zero.
Another intriguing outcome is that larger models exhibit enhanced sparsification capabilities. As the number of parameters rises, the relative accuracy reduction for sparse models diminishes to the extent that inducing 50% sparsity results in nearly no accuracy decline for the largest models.
Despite some literature suggesting that these large models may be under-fitted, the question remains whether the ability to zero out so many parameters is a consequence of underfitting.
The authors propose several avenues for future exploration, including investigating fine-tuning methods for large-scale models to enhance accuracy recovery. They hypothesize that achieving 80-90% sparsity through progressive pruning and fine-tuning is feasible. They also plan to investigate the application of their methods during training to reduce the computational burden associated with pre-training these massive models.
As they await the open-source release of the pruned version for testing, the authors invite readers to explore their other articles and connect on LinkedIn. Additionally, they encourage support through claps, shares, and subscriptions to stay updated on future publications. For those interested in machine learning and AI resources, a GitHub repository is forthcoming.