*Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium
Amitnikhade
January 10, 2023
*Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium
Photo by Lucas Pezeta
Precision at a fraction of the size: Experience the power of quantization for your deep learning models
Model quantization is a technique for reducing the precision of the weights and activations of a neural network model. This process can be used to decrease the model’s memory footprint and computational complexity, making it easier to deploy on resource-constrained devices such as smartphones and edge devices. In addition, quantization can also improve model performance by reducing the number of bits needed to represent weights and activations, which can lead to faster inference times. There are various approaches to quantization, including post-training quantization, quantization-aware training, and hybrid quantization. Overall, model quantization is a valuable tool that allows the deployment of large, complex models on a wide range of devices.
Model quantization is useful in situations where you need to deploy a deep learning model on a resource-constrained device, such as a mobile phone or an edge device. These devices often have limited memory and computational resources, making it difficult to run large, complex models. By quantizing the model, you can reduce the size of the model and the amount of resources required to run it, which makes it possible to deploy the model on these devices.
In addition to resource constraints, model quantization can also be useful in situations where you need to reduce the inference time of the model. By reducing the precision of the weights and activations, you can speed up the inference process, which can be important in real-time applications such as video streaming or online gaming.
Overall, model quantization is a powerful tool that can enable the deployment of large and complex models on resource-constrained devices, and can also be used to improve the performance of the model by reducing inference times.
Model quantization is useful in situations where you need to deploy a deep learning model on a resource-constrained device, such as a mobile phone or an edge device. These devices often have limited memory and computational resources, making it difficult to run large, complex models. By quantizing the model, you can reduce the size of the model and the amount of resources required to run it, which makes it possible to deploy the model on these devices.
In addition to resource constraints, model quantization can also be useful in situations where you need to reduce the inference time of the model. By reducing the precision of the weights and activations, you can speed up the inference process, which can be important in real-time applications such as video streaming or online gaming.
Overall, model quantization is a powerful tool that can enable the deployment of large and complex models on resource-constrained devices, and can also be used to improve the performance of the model by reducing inference times.
Some potential drawbacks to using model quantization include:
Overall, while model quantization can be a useful tool in certain situations, it is important to carefully evaluate the trade-offs and consider whether it is the right approach for your use case.
There are a few alternatives to model quantization that can be used to reduce the size and computational complexity of a deep learning model:
Overall, these alternatives can be effective in certain situations, but it is important to carefully evaluate the trade-offs and choose the right approach for your use case.
from transformers import BertTokenizer, BertModel
import torch
import os
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained("bert-base-uncased")
quantized_model = torch.quantization.quantize_dynamic(
model, {torch.nn.Linear}, dtype=torch.qint8
)
def print_size_of_model(model):
torch.save(model.state_dict(), "temp.p")
print('Size (MB):', os.path.getsize("temp.p")/1e6)
os.remove('temp.p')
print_size_of_model(model)
print_size_of_model(quantized_model)
Quantization is extremely useful in the case of transformer-based models as their size is basically large. And deploying them may consume a lot of resources and hence it’s important to optimize and compress the model before deployment. Using this code you can quantize language models like Bert and others too.
from fastT5 import (OnnxT5, get_onnx_runtime_sessions,
generate_onnx_representation, quantize)
from transformers import AutoTokenizer
model_or_model_path = 't5-small'
# Step 1. convert huggingfaces t5 model to onnx
onnx_model_paths = generate_onnx_representation(model_or_model_path)
# Step 2. (recommended) quantize the converted model for fast inference and to reduce model size.
quant_model_paths = quantize(onnx_model_paths)
Quantizing a T5 model has been, typically a difficult task. FastT5 is a library that helps us to overcome this problem. It quantizes the encoder and decoder to a smaller size.
import tensorflow as tf import pathlib from tensorflow import keras model = keras.models.load_model('/content/model.h5') converter = tf.lite.TFLiteConverter.from_keras_model(model) converter.optimizations = [tf.lite.Optimize.DEFAULT] tflite_model_quant = converter.convert() tflite_models_dir = pathlib.Path("/content/l") tflite_models_dir.mkdir(exist_ok=True, parents=True) tflite_model_quant_file = tflite_models_dir/"model_quant.tflite" tflite_model_quant_file.write_bytes(tflite_model_quant)
TFLite is a TensorFlow class that enables us to convert huge models to mobile-compatible models. It also involves model quantization. Pytorch also provides a tflite alternative which is PyTorch mobile.
from optimum.onnxruntime import ORTQuantizer, ORTModelForSequenceClassification
from optimum.onnxruntime.configuration import AutoQuantizationConfig
model_id = "distilbert-base-uncased-finetuned-sst-2-english"
onnx_model = ORTModelForSequenceClassification.from_pretrained(model_id, from_transformers=True)
quantizer = ORTQuantizer.from_pretrained(onnx_model)
dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)
model_quantized_path = quantizer.quantize(save_dir="path/to/output/model",
quantization_config=dqconfig,
)
Optimum is a specialized extension of the popular Transformers library that allows for training and executing deep learning models on targeted hardware with the highest level of efficiency. By utilizing various optimization techniques such as pruning, quantization, and utilizing specialized hardware like TPUs or GPUs, Optimum can improve the performance of models and decrease the computational requirements for both training and deployment.
Nothing to conclude, just try implementing the above stuff into your project and comment on the performance improvement you notice after using these techniques.
https://huggingface.co/docs/optimum/
https://www.tensorflow.org/lite/models
Thanks.
This is some really Amazing work…..will use it for sure thank dude 🔥