On this weblog, we are going to implement a customized quantizer that reduces higher-precision mannequin parameters to decrease precision. Particularly, we are going to create a customized linear layer, known as the W8A16_Linear_Layer, that replaces the usual linear layer throughout quantization. This layer will handle each the quantization course of and inference utilizing quantized parameters. This system, generally often known as post-training quantization, is utilized after a mannequin is absolutely skilled.
Initially, the mannequin might be skilled utilizing the unique high-precision parameters. Submit-training, we are going to quantize the mannequin parameters, changing the weights to 8-bit integers (int8) whereas retaining 16-bit floating-point (FP16) activations. These quantized weights will then be saved inside the customized layer and used throughout inference to supply outputs. After quantization, the mannequin might be saved both regionally or on the Hugging Face Hub. Let’s dive into the small print of this implementation.
In a earlier weblog put up, I mentioned how quantization works on the per-tensor, per-channel, and per-group ranges. For a deeper understanding, I like to recommend reviewing that put up.
On this weblog, we’ll lengthen these ideas into sensible functions, particularly quantizing a pre-trained mannequin and using quantized weights for inference.
In the course of the preliminary coaching section, mannequin parameters are up to date in high-precision format. Nevertheless, utilizing these high-precision weights instantly can lead to important reminiscence overhead, particularly when assets are restricted. This presents a problem when deploying giant language fashions on edge units or environments with constrained reminiscence. Quantization mitigates these points by compressing mannequin dimension whereas sustaining an affordable stage of accuracy.
Earlier than we bounce into the implementation, it’s vital to spotlight the steps concerned in our strategy
- Implement a customized linear layer, W8A16_Linear_Layer.
- Construct a pattern neural community for testing.
- Develop a way to exchange commonplace linear layers with the customized W8A16_Linear_Layer.
- Validate the layer substitute utilizing a pattern neural community.
- Obtain an open-source pre-trained mannequin from Hugging Face.
- Apply quantization to the pre-trained mannequin.
The next libraries are required for this implementation:
import torch
import torch.nn as
import torch.nn.purposeful as F
from transformers import AutoModelForCausalLM, AutoTokenizer
import numpy as np
Under is an easy neural community carried out for testing functions. Please word that the structure and values of this mannequin are arbitrary and shouldn’t be a spotlight of consideration.
class NeuralNetwork(torch.nn.Module):
def __init__(self):
tremendous().__init__()self.embd = torch.nn.Embedding(4, 8)
self.linear_1 = nn.Linear(8, 16)
self.linear_2 = nn.linear(16, 4, bias = False)
self.lm_head = nn.Linear(4, 6, bias = False)
def ahead(self, x):
x = self.embd(x)
x = self.linear_1(x)
x = self.linear_2(x)
x = self.lm_head(x)
return x
This mannequin comprises an embedding layer with a vocabulary dimension of 4, permitting us to retrieve embeddings for values between 0 and three. The embedding dimension is ready to eight. Moreover, there are linear layers, which take two parameters: enter options and output options. You’ll be able to optionally set a bias time period for every linear layer.
Right here is the mannequin structure earlier than quantization:
NeuralNetwork(
(embd): Embedding(4, 8)
(linear_1): Linear(in_features=8, out_features=16, bias=True)
(linear_2): Linear(in_features=16, out_features=4, bias=False)
(lm_head): Linear(in_features=4, out_features=6, bias=False)
)
Let’s now take a look at the mannequin’s output with some enter. The enter values needs to be integers between 0 and three:
enter = torch.randint(0, 4, (1, 2), dtype=torch.lengthy)
enter = tensor([[1, 2]])
When invoking the mannequin with this enter, I get the next output. Discover that the output form is (1, 6):
mannequin = NeuralNetwork()
output = mannequin(enter)output = tensor([[[-0.1355, -0.2336, 0.4340, 0.0514, -0.2103, 0.0347],
[-0.0025, 0.0313, 0.3539, 0.1842, -0.1190, 0.1377]]
grad_fn=<UnsafeViewBackward0>)
Now let’s implement the customized linear layers.
The W8A16 naming conference describes the information sorts used for weights and activations:
- W8: Weights might be saved in 8 bits (int8).
- A16: Activations might be saved in 16 bits (FP16).
This tradition linear layer is initialized with the variety of enter options, output options, and an elective bias time period. We make the most of register buffers to retailer the quantized weights (int8), scales, and biases as a result of these values are non-trainable parameters and shouldn’t be a part of backpropagation. These buffers will retailer the quantized weights and their corresponding scales throughout inference.
class W8A16LinearLayer(nn.Module):
def __init__(self, input_features, output_features, bias=True, dtype=torch.float32):
tremendous().__init__()self.register_buffer("int8_weights", torch.randint(-128,127, (output_features, input_features), dtype=torch.int8))
self.register_buffer("scales", torch.randn((output_features), dtype= dtype))
if bias:
self.register_buffer("bias", torch.randn((1, output_features), dtype = dtype))
else:
self.bias = None
def ahead(self, inputs):
converted_weights = self.int8_weights.to(inputs.dtype)
output = F.linear(inputs, converted_weights) * self.scales
if self.bias shouldn't be None:
output = output + self.bias
return output
def quantize(self, weights):
w_fp32 = weights.clone().to(torch.float32)
scales = w_fp32.abs().max(dim=-1).values/127
scales = scales.to(weights.dtype)
int8_weights = torch.spherical(weights/scales.unsqueeze(1)).to(torch.int8)
self.int8_weights = int8_weights
self.scales = scales
On this implementation:
- The
ahead
methodology handles activations by dequantizing the saved int8 weights and performing matrix multiplication with the inputs. If bias is enabled, it’s added to the output. This course of mimics the conduct of an ordinary linear layer in a neural community. - The
quantize
methodology performs per-channel quantization by first calculating scales for every channel. It then quantizes the weights by dividing every channel by its respective scale and shops the quantized int8 weights.
Now that we’ve carried out the customized W8A16 linear layer, we’d like a way to recursively traverse the layers of a mannequin, establish the linear layers, and substitute them with the W8A16 linear layers.
def replace_linear_layer_with_W8A16Linear_layer_and_quantization(module, goal , exclude_list):
for identify, little one in module.named_children():
if isinstance(little one, nn.Linear) and never any([x == name for x in exclude_list]):
old_bias = little one.bias
old_weights = little one.weightnew_module = goal(little one.in_features, little one.out_features,
old_bias shouldn't be None, little one.weight.dtype)
setattr(module, identify, new_module)
getattr(module, identify).quantize(old_weights)
if old_bias shouldn't be None:
getattr(module, identify).bias = old_bias
else:
replace_linear_layer_with_W8A16Linear_layer(little one, goal, exclude_list)
This operate recursively traverses all layers of the mannequin, together with sub-modules (e.g., layers inside a sequential block). For instance, think about the next structure:
class SimpleNN(nn.Module):
def __init__(self):
tremendous(SimpleNN, self).__init__()
self.linear1 = nn.Linear(10, 20)
self.linear2 = nn.Linear(20, 5)
self.nested = nn.Sequential(
nn.Linear(5, 10),
nn.ReLU(),
nn.Linear(10, 2)
)
For fashions with nested layers, such because the Sequential block proven above, the operate will recurse by way of every sub-layer. An exclude_list can be outlined to forestall sure layers from being changed—for example, for those who don’t wish to quantize the ultimate output layer.
Right here’s what the arguments characterize:
- module: The mannequin that must be quantized.
- goal: The customized W8A16 layer.
- exclude_list: A listing of layer names that shouldn’t be changed.
If the module’s little one is a linear layer and isn’t within the exclude record, the operate retrieves the layer’s bias and weights. It creates a brand new occasion of the W8A16 linear layer (known as new_module) and replaces the present linear layer. That is achieved utilizing setattr():
setattr(module, identify, new_module)
Subsequent, it quantizes the weights of the previous linear layer and updates the int8_weights buffer of the W8A16 layer. That is achieved by way of getattr():
getattr(module, identify).quantize(old_weights)
If the bias is ready to True, this operate additionally assigns the bias of the previous mannequin to the W8A16 layer.
Let’s now take a look at the substitute operate in our pattern neural community. We’ll create two separate cases to check with and with out an exclude record. Within the first occasion, we exclude the "lm_head"
layer from being changed.
model_1 = NeuralNetwork()
model_2 = NeuralNetwork()replace_linear_layer_with_W8A16Linear_layer_and_quantization(model_1,
W8A16LinearLayer, ["lm_head"])
The mannequin structure after substitute will seem like this, the place linear_1
and linear_2
are changed by the customized W8A16 layers:
model_1 = NeuralNetwork(
(embd): Embedding(4, 8)
(linear_1): W8A16LinearLayer()
(linear_2): W8A16LinearLayer()
(lm_head): Linear(in_features=4, out_features=6, bias=False)
)
If we use an empty exclude_list
, all linear layers, together with the output layer (lm_head
), might be changed:
replace_linear_layer_with_W8A16Linear_layer(model_2, W8A16LinearLayer, [])model_2 = NeuralNetwork(
(embd): Embedding(4, 8)
(linear_1): W8A16LinearLayer()
(linear_2): W8A16LinearLayer()
(lm_head): W8A16LinearLayer()
)
Now, let’s observe how the parameters of our take a look at mannequin are up to date after quantization.
model_3 = NeuralNetwork()
replace_linear_layer_with_W8A16Linear_layer_and_quantization(model_3, W8A16LinearLayer, ["lm_head"])model_3 = NeuralNetwork(
(embd): Embedding(4, 8)
(linear_1): W8A16LinearLayer()
(linear_2): W8A16LinearLayer()
(lm_head): Linear(in_features=4, out_features=6, bias=False)
)
The next code snippet retrieves the weights and scales of the W8A16 layers, in addition to the usual embedding and lm_head
linear layer:
for identify, little one in model_3.named_children():
if isinstance(little one, W8A16LinearLayer):
print(little one.int8_weights, little one.scales)
else:
print(little one.weight)
- The embedding and output layer values are floating-point values, as seen beneath:
embd Embedding(4, 8)
Parameter containing:
tensor([[-2.8465e-01, -1.2422e+00, -1.3466e-01, 1.4696e+00, 3.0245e-01,-3.5536e-01, 4.1974e-01, 1.0006e+00].
[-9.1691e-01, 2.0813e-01, 7.2216e-01, -3.9679e-02, 5.0168e-02,6.3664e-01, 4.1839e-01, 1.6653e-03].
[-1.5351e+00, -1.6491e-01, 3.0808e-02, 7.2144e-01, -7.2972e-01,1.6778e+00, 4.4839e-01, 1.5266e+00],
[ 4.4786e-01, 1.1266e+00, -3.7570e-02, -6.8816e-01, 1.7809e+00,1.2777e+00, 4.6440e-01, -8.9209e-01]], requires_grad=True)lm_head Linear(in_features=4, out_features=6, bias=False)
Parameter containing:
tensor([[ 0.3276, -0.0866, 0.2146, -0.2245],
[-0.4933, 0.0411, -0.0107, 0.3286],
[-0.4630, 0.1674, -0.4232, -0.2771],
[ 0.1484, -0.4660, 0.1650, 0.0888],
[-0.1192, 0.0308, 0.0635, -0.3392].
[ 0.0972, -0.2974, -0.1220, -0.1710]], requires_grad=True)
- W8A16 linear layer weights and scales after quantization. You’ll be able to discover that the weights of the
W8A16LinearLayer
range between integer values (-128, 127), indicating 8-bit quantization:
linear_1 W8A16LinearLayer()
tensor([[ 104, 92, -115, 71, 127, -93, 113, -46],
[ 41, -123, 105, 127, 48, 23, 75, -109],
[ 77, 50, 117, 97, -127, 70, 5, 60],
[ 90, -15, -18, -21, 127, 91, -122, 32],
[ 25, 98, 14, -107, -64, -69, -94, -127],
[ -94, 93, -43, 53, 127, -28, -111, -46],
[ -49, 112, 60, -40, -92, -71, 12, -127],
[ 52, 41, 38, 69, -53, 127, -71, -93],
[ 127, 30, 4, -102, -98, -105, -110, 12],
[ -34, -8, 84, 101, 75, 5, -57, -127],
[ 31, 87, 105, 127, -2, 56, -58, -57],
[ 31, 108, -125, -55, -103, 33, -125, 127],
[ -78, -40, 78, -127, -109, -103, 108, -79],
[ -89, -127, 66, -14, -111, 92, -52, 55],
[ -78, 127, 46, 16, 35, 75, 101, 118],
[-127, 3, -26, -120, 3, 70, -85, 69]], dtype=torch.int8) scales :
tensor([0.0026, 0.0027, 0.0024, 0.0025, 0.0028, 0.0026, 0.0027, 0.0020, 0.0025,
0.0024, 0.0017, 0.0027, 0.0027, 0.0023, 0.0023, 0.0025].
grad_fn=<DivBackward0>)
linear_2 W8A16LinearLayer()
tensor([[ -32, 63, 39, 11, -23, -127, -124, -18, -118, 60, 42, -5,-66, -30, -113, 14],
[ 101, 111, -41, 9, -116, -104, 13, -14, 47, -69, 36, -127,-5, 103, 125, 51].
[ 20, -15, 89, 111, -83, -79, 94, 127, 101, 104, -107, -85,-102, 71, -47, 85]
[ -52, -120, 90, -36, 37, -48, 53, -82, -87, 66, 72, -96, 87, -127, -67, 79]], dtype=torch.int8)
scales :
tensor([0.0019, 0.0019, 0.0019, 0.0019], grad_fn=<DivBackward0>)
Subsequent, we quantize an open-source mannequin from Hugging Face.
model_id = "Salesforce/codegen-350M-mono"
mannequin = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True)
We purpose to cut back the mannequin dimension by way of quantization.
memory_footprint_before_quantization = mannequin.get_memory_footprint()/1e+6
print(f"mannequin dimension earlier than quantization : {memory_footprint_before_quantization} MB")
The unique mannequin dimension is:
mannequin dimension earlier than quantization : 797.310976 MB
The pre-quantization mannequin structure is as follows:
CodeGenForCausalLM(
(transformer): CodeGenModel(
(wte): Embedding(51200, 1024)
(drop): Dropout(p=0.0, inplace=False)
(h): ModuleList(
(0-19): 20 x CodeGenBlock(
(ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(attn): CodeGenAttention(
(attn_dropout): Dropout(p=0.0, inplace=False)
(resid_dropout): Dropout(p=0.0, inplace=False)
(qkv_proj): Linear(in_features=1024, out_features=3072, bias=False)
(out_proj): Linear(in_features=1024, out_features=1024, bias=False)
)
(mlp): CodeGenMLP(
(fc_in): Linear(in_features=1024, out_features=4096, bias=True)
(fc_out): Linear(in_features=4096, out_features=1024, bias=True)
(act): NewGELUActivation()
(dropout): Dropout(p=0.0, inplace=False)
)
)
)
(ln_f): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(lm_head): Linear(in_features=1024, out_features=51200, bias=True)
)
We substitute the mannequin’s linear layers with W8A16LinearLayer
(apart from lm_head
) to use quantization:
replace_linear_layer_with_W8A16Linear_layer_and_quantization(mannequin,
W8A16LinearLayer, ["lm_head"])
The post-quantization structure now contains W8A16LinearLayer
:
CodeGenForCausalLM(
(transformer): CodeGenModel(
(wte): Embedding(51200, 1024)
(drop): Dropout(p=0.0, inplace=False)
(h): ModuleList(
(0-19): 20 x CodeGenBlock(
(ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(attn): CodeGenAttention(
(attn_dropout): Dropout(p=0.0, inplace=False)
(resid_dropout): Dropout(p=0.0, inplace=False)
(qkv_proj): W8A16LinearLayer()
(out_proj): W8A16LinearLayer()
)
(mlp): CodeGenMLP(
(fc_in): W8A16LinearLayer()
(fc_out): W8A16LinearLayer()
(act): NewGELUActivation()
(dropout): Dropout(p=0.0, inplace=False)
)
)
)
(ln_f): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(lm_head): Linear(in_features=1024, out_features=51200, bias=True)
)
Now let’s examine the mannequin dimension after the quantization. The mannequin dimension is considerably lowered:
memory_footprint_after_quantization = mannequin.get_memory_footprint()/1e+6
print(f"mannequin dimension after quantization : {np.spherical(memory_footprint_after_quantization,2)} MB")mannequin dimension after quantization : 546.02 MB
This quantization saved roughly 250 MB of reminiscence:
print(f"Reminiscence saved : {np.spherical((memory_footprint_before_quantization - memory_footprint_after_quantization), 2)} MB")Reminiscence saved : 251.29 MB
Saving the Quantized Mannequin’s Parameters Domestically
As soon as we’ve quantized the mannequin, it’s important to retailer the quantized weights to permit others to make use of the mannequin. On this part, we are going to retailer the mannequin’s parameters regionally. You even have the choice to avoid wasting them on the Hugging Face Hub for simpler sharing and accessibility.
quantized_model_state_dict = mannequin.state_dict()
quantized_model_state_dict.keys()
The above command outputs the keys of the mannequin’s parameters, together with weights and biases. For instance, the output may seem like this:
odict_keys(['transformer.wte.weight', 'transformer.h.0.ln_1.weight', 'transformer.h.0.ln_1.bias', 'transformer.h.0.attn.qkv_proj.int8_weights', 'transformer.h.0.attn.qkv_proj.scales', 'transformer.h.0.attn.out_proj.int8_weights', 'transformer.h.0.attn.out_proj.scales', 'transformer.h.0.mlp.fc_in.bias', 'transformer.h.0.mlp.fc_in.int8_weights', 'transformer.h.0.mlp.fc_in.scales', 'transformer.h.0.mlp.fc_out.bias', 'transformer.h.0.mlp.fc_out.int8_weights', 'transformer.h.0.mlp.fc_out.scales', 'transformer.h.1.ln_1.weight', 'transformer.h.1.ln_1.bias', 'transformer.h.1.attn.qkv_proj.int8_weights', 'transformer.h.1.attn.qkv_proj.scales', 'transformer.h.1.attn.out_proj.int8_weights', 'transformer.h.1.attn.out_proj.scales', 'transformer.h.1.mlp.fc_in.bias', 'transformer.h.1.mlp.fc_in.int8_weights', 'transformer.h.1.mlp.fc_in.scales', 'transformer.h.1.mlp.fc_out.bias', 'transformer.h.1.mlp.fc_out.int8_weights', 'transformer.h.1.mlp.fc_out.scales', 'transformer.h.2.ln_1.weight', 'transformer.h.2.ln_1.bias', 'transformer.h.2.attn.qkv_proj.int8_weights', 'transformer.h.2.attn.qkv_proj.scales', 'transformer.h.2.attn.out_proj.int8_weights', 'transformer.h.2.attn.out_proj.scales', 'transformer.h.2.mlp.fc_in.bias', 'transformer.h.2.mlp.fc_in.int8_weights', 'transformer.h.2.mlp.fc_in.scales', 'transformer.h.2.mlp.fc_out.bias', 'transformer.h.2.mlp.fc_out.int8_weights', 'transformer.h.2.mlp.fc_out.scales', 'transformer.h.3.ln_1.weight', 'transformer.h.3.ln_1.bias', 'transformer.h.3.attn.qkv_proj.int8_weights', 'transformer.h.3.attn.qkv_proj.scales', 'transformer.h.3.attn.out_proj.int8_weights', 'transformer.h.3.attn.out_proj.scales', 'transformer.h.3.mlp.fc_in.bias', 'transformer.h.3.mlp.fc_in.int8_weights', 'transformer.h.3.mlp.fc_in.scales', 'transformer.h.3.mlp.fc_out.bias', 'transformer.h.3.mlp.fc_out.int8_weights', 'transformer.h.3.mlp.fc_out.scales', 'transformer.h.4.ln_1.weight', 'transformer.h.4.ln_1.bias', 'transformer.h.4.attn.qkv_proj.int8_weights', 'transformer.h.4.attn.qkv_proj.scales', 'transformer.h.4.attn.out_proj.int8_weights', 'transformer.h.4.attn.out_proj.scales', 'transformer.h.4.mlp.fc_in.bias', 'transformer.h.4.mlp.fc_in.int8_weights', 'transformer.h.4.mlp.fc_in.scales', 'transformer.h.4.mlp.fc_out.bias', 'transformer.h.4.mlp.fc_out.int8_weights', 'transformer.h.4.mlp.fc_out.scales', 'transformer.h.5.ln_1.weight', 'transformer.h.5.ln_1.bias', 'transformer.h.5.attn.qkv_proj.int8_weights', 'transformer.h.5.attn.qkv_proj.scales', 'transformer.h.5.attn.out_proj.int8_weights', 'transformer.h.5.attn.out_proj.scales', 'transformer.h.5.mlp.fc_in.bias', 'transformer.h.5.mlp.fc_in.int8_weights', 'transformer.h.5.mlp.fc_in.scales', 'transformer.h.5.mlp.fc_out.bias', 'transformer.h.5.mlp.fc_out.int8_weights', 'transformer.h.5.mlp.fc_out.scales', 'transformer.h.6.ln_1.weight', 'transformer.h.6.ln_1.bias', 'transformer.h.6.attn.qkv_proj.int8_weights', 'transformer.h.6.attn.qkv_proj.scales', 'transformer.h.6.attn.out_proj.int8_weights', 'transformer.h.6.attn.out_proj.scales', 'transformer.h.6.mlp.fc_in.bias', 'transformer.h.6.mlp.fc_in.int8_weights', 'transformer.h.6.mlp.fc_in.scales', 'transformer.h.6.mlp.fc_out.bias', 'transformer.h.6.mlp.fc_out.int8_weights', 'transformer.h.6.mlp.fc_out.scales', 'transformer.h.7.ln_1.weight', 'transformer.h.7.ln_1.bias', 'transformer.h.7.attn.qkv_proj.int8_weights', 'transformer.h.7.attn.qkv_proj.scales', 'transformer.h.7.attn.out_proj.int8_weights', 'transformer.h.7.attn.out_proj.scales', 'transformer.h.7.mlp.fc_in.bias', 'transformer.h.7.mlp.fc_in.int8_weights', 'transformer.h.7.mlp.fc_in.scales', 'transformer.h.7.mlp.fc_out.bias', 'transformer.h.7.mlp.fc_out.int8_weights', 'transformer.h.7.mlp.fc_out.scales', 'transformer.h.8.ln_1.weight', 'transformer.h.8.ln_1.bias', 'transformer.h.8.attn.qkv_proj.int8_weights', 'transformer.h.8.attn.qkv_proj.scales', 'transformer.h.8.attn.out_proj.int8_weights', 'transformer.h.8.attn.out_proj.scales', 'transformer.h.8.mlp.fc_in.bias', 'transformer.h.8.mlp.fc_in.int8_weights', 'transformer.h.8.mlp.fc_in.scales', 'transformer.h.8.mlp.fc_out.bias', 'transformer.h.8.mlp.fc_out.int8_weights', 'transformer.h.8.mlp.fc_out.scales', 'transformer.h.9.ln_1.weight', 'transformer.h.9.ln_1.bias', 'transformer.h.9.attn.qkv_proj.int8_weights', 'transformer.h.9.attn.qkv_proj.scales', 'transformer.h.9.attn.out_proj.int8_weights', 'transformer.h.9.attn.out_proj.scales', 'transformer.h.9.mlp.fc_in.bias', 'transformer.h.9.mlp.fc_in.int8_weights', 'transformer.h.9.mlp.fc_in.scales', 'transformer.h.9.mlp.fc_out.bias', 'transformer.h.9.mlp.fc_out.int8_weights', 'transformer.h.9.mlp.fc_out.scales', 'transformer.h.10.ln_1.weight', 'transformer.h.10.ln_1.bias', 'transformer.h.10.attn.qkv_proj.int8_weights', 'transformer.h.10.attn.qkv_proj.scales', 'transformer.h.10.attn.out_proj.int8_weights', 'transformer.h.10.attn.out_proj.scales', 'transformer.h.10.mlp.fc_in.bias', 'transformer.h.10.mlp.fc_in.int8_weights', 'transformer.h.10.mlp.fc_in.scales', 'transformer.h.10.mlp.fc_out.bias', 'transformer.h.10.mlp.fc_out.int8_weights', 'transformer.h.10.mlp.fc_out.scales', 'transformer.h.11.ln_1.weight', 'transformer.h.11.ln_1.bias', 'transformer.h.11.attn.qkv_proj.int8_weights', 'transformer.h.11.attn.qkv_proj.scales', 'transformer.h.11.attn.out_proj.int8_weights', 'transformer.h.11.attn.out_proj.scales', 'transformer.h.11.mlp.fc_in.bias', 'transformer.h.11.mlp.fc_in.int8_weights', 'transformer.h.11.mlp.fc_in.scales', 'transformer.h.11.mlp.fc_out.bias', 'transformer.h.11.mlp.fc_out.int8_weights', 'transformer.h.11.mlp.fc_out.scales', 'transformer.h.12.ln_1.weight', 'transformer.h.12.ln_1.bias', 'transformer.h.12.attn.qkv_proj.int8_weights', 'transformer.h.12.attn.qkv_proj.scales', 'transformer.h.12.attn.out_proj.int8_weights', 'transformer.h.12.attn.out_proj.scales', 'transformer.h.12.mlp.fc_in.bias', 'transformer.h.12.mlp.fc_in.int8_weights', 'transformer.h.12.mlp.fc_in.scales', 'transformer.h.12.mlp.fc_out.bias', 'transformer.h.12.mlp.fc_out.int8_weights', 'transformer.h.12.mlp.fc_out.scales', 'transformer.h.13.ln_1.weight', 'transformer.h.13.ln_1.bias', 'transformer.h.13.attn.qkv_proj.int8_weights', 'transformer.h.13.attn.qkv_proj.scales', 'transformer.h.13.attn.out_proj.int8_weights', 'transformer.h.13.attn.out_proj.scales', 'transformer.h.13.mlp.fc_in.bias', 'transformer.h.13.mlp.fc_in.int8_weights', 'transformer.h.13.mlp.fc_in.scales', 'transformer.h.13.mlp.fc_out.bias', 'transformer.h.13.mlp.fc_out.int8_weights', 'transformer.h.13.mlp.fc_out.scales', 'transformer.h.14.ln_1.weight', 'transformer.h.14.ln_1.bias', 'transformer.h.14.attn.qkv_proj.int8_weights', 'transformer.h.14.attn.qkv_proj.scales', 'transformer.h.14.attn.out_proj.int8_weights', 'transformer.h.14.attn.out_proj.scales', 'transformer.h.14.mlp.fc_in.bias', 'transformer.h.14.mlp.fc_in.int8_weights', 'transformer.h.14.mlp.fc_in.scales', 'transformer.h.14.mlp.fc_out.bias', 'transformer.h.14.mlp.fc_out.int8_weights', 'transformer.h.14.mlp.fc_out.scales', 'transformer.h.15.ln_1.weight', 'transformer.h.15.ln_1.bias', 'transformer.h.15.attn.qkv_proj.int8_weights', 'transformer.h.15.attn.qkv_proj.scales', 'transformer.h.15.attn.out_proj.int8_weights', 'transformer.h.15.attn.out_proj.scales', 'transformer.h.15.mlp.fc_in.bias', 'transformer.h.15.mlp.fc_in.int8_weights', 'transformer.h.15.mlp.fc_in.scales', 'transformer.h.15.mlp.fc_out.bias', 'transformer.h.15.mlp.fc_out.int8_weights', 'transformer.h.15.mlp.fc_out.scales', 'transformer.h.16.ln_1.weight', 'transformer.h.16.ln_1.bias', 'transformer.h.16.attn.qkv_proj.int8_weights', 'transformer.h.16.attn.qkv_proj.scales', 'transformer.h.16.attn.out_proj.int8_weights', 'transformer.h.16.attn.out_proj.scales', 'transformer.h.16.mlp.fc_in.bias', 'transformer.h.16.mlp.fc_in.int8_weights', 'transformer.h.16.mlp.fc_in.scales', 'transformer.h.16.mlp.fc_out.bias', 'transformer.h.16.mlp.fc_out.int8_weights', 'transformer.h.16.mlp.fc_out.scales', 'transformer.h.17.ln_1.weight', 'transformer.h.17.ln_1.bias', 'transformer.h.17.attn.qkv_proj.int8_weights', 'transformer.h.17.attn.qkv_proj.scales', 'transformer.h.17.attn.out_proj.int8_weights', 'transformer.h.17.attn.out_proj.scales', 'transformer.h.17.mlp.fc_in.bias', 'transformer.h.17.mlp.fc_in.int8_weights', 'transformer.h.17.mlp.fc_in.scales', 'transformer.h.17.mlp.fc_out.bias', 'transformer.h.17.mlp.fc_out.int8_weights', 'transformer.h.17.mlp.fc_out.scales', 'transformer.h.18.ln_1.weight', 'transformer.h.18.ln_1.bias', 'transformer.h.18.attn.qkv_proj.int8_weights', 'transformer.h.18.attn.qkv_proj.scales', 'transformer.h.18.attn.out_proj.int8_weights', 'transformer.h.18.attn.out_proj.scales', 'transformer.h.18.mlp.fc_in.bias', 'transformer.h.18.mlp.fc_in.int8_weights', 'transformer.h.18.mlp.fc_in.scales', 'transformer.h.18.mlp.fc_out.bias', 'transformer.h.18.mlp.fc_out.int8_weights', 'transformer.h.18.mlp.fc_out.scales', 'transformer.h.19.ln_1.weight', 'transformer.h.19.ln_1.bias', 'transformer.h.19.attn.qkv_proj.int8_weights', 'transformer.h.19.attn.qkv_proj.scales', 'transformer.h.19.attn.out_proj.int8_weights', 'transformer.h.19.attn.out_proj.scales', 'transformer.h.19.mlp.fc_in.bias', 'transformer.h.19.mlp.fc_in.int8_weights', 'transformer.h.19.mlp.fc_in.scales', 'transformer.h.19.mlp.fc_out.bias', 'transformer.h.19.mlp.fc_out.int8_weights', 'transformer.h.19.mlp.fc_out.scales', 'transformer.ln_f.weight', 'transformer.ln_f.bias', 'lm_head.weight', 'lm_head.bias'])
To save lots of the quantized mannequin’s parameters regionally, use the torch.save()
operate, as proven beneath:
torch.save(quantized_model_state_dict, "quantized_model_state_dict.pt")
Importing the Mannequin’s Parameters to Hugging Face Hub
For those who’d want to make the quantized mannequin extra accessible, you may add it to the Hugging Face Hub utilizing the HfApi()
operate from the Hugging Face library. This enables others to simply retrieve and use the mannequin out of your repository.
On this weblog, we carried out a W8A16 linear layer to quantize the weights and used this quantization throughout inference. We additionally created a operate to exchange the unique linear layers with the customized quantized layers within the mannequin. After quantizing the take a look at mannequin, we utilized quantization to an open-source mannequin from the Hugging Face repository and saved the ensuing mannequin regionally. I hope this weblog helped deepen your understanding of mannequin quantization. A lot of the data I utilized right here comes from a deep studying quick course, which I extremely suggest for anybody seeking to increase their experience in machine studying and generative AI.
Thank You