By Ethan Yang, AI Software Development Engineer, Intel
Background
“Segment Anything, We All Get Unemployed!” — Recently, such a phrase has gone viral on social media! This is referring to the Segment Anything Model (SAM). What is SAM? What functions does it possess? Let’s find out.
SAM is a powerful artificial intelligence image segmentation application developed by Meta AI Lab. It can automatically identify which pixels in an image belong to an object and perform automatic stylistic processing on different objects in the image. It can be used for analyzing scientific images, editing photos, and more.
SAM’s complete application consists of an image encoder model and a mask decoder + prompt encoder model, both of which can be interpreted as separate static models, image encoder takes the major computing workload during inference. Therefore, improving the execution efficiency of the image encoder becomes one of the main optimization directions for SAM applications.
In this blog, we will focus on demonstrating how to achieve quantization compression of the SAM encoder using the OpenVINO™ NNCF model compression tool to improve the performance of inferencing on the CPU.
Quantization Introduction
Before we dive into the practical implementation, we must mention the concept of quantization. Quantization refers to mapping the expression range of model parameters from FP32 to INT8 or INT4 without changing the model structure. It represents the same information with a smaller value bit-width, achieving compression of the model size and reducing memory consumption.
Intel AVX512 VNNI extension instructions compress the INT8 matrix multiplication and addition operations, which originally required three clock cycles, to one clock cycle. In the latest AMX instruction set, multiple VNNI modules are stacked to achieve a multiple-fold performance improvement within a single cycle.
NNCF Post-Training Quantization Mode
NNCF, short for Neural Network Compression Framework, is a solution implementation within the OpenVINO™ toolkit specifically designed for model compression and acceleration. It includes various model compression algorithms such as quantization, pruning, and binarization. The usage of NNCF can be categorized into two modes: Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT). While QAT requires the original training script and dataset, PTQ allows direct compression of the trained model file without the need for additional training scripts and labeled datasets. This is a new feature introduced by NNCF in the OpenVINO™ 2023.0 release. PTQ can be achieved through the following two steps:
Prepare a calibration dataset In the quantization process, the calibration data is used solely for calculating data range and distribution and does not require additional labeled data. Additionally, a DataLoader object and transform_fn data conversion functions need to be defined. The DataLoader is responsible for reading each element of the calibration dataset, while the transform_fn is used to convert the read elements into direct input data for OpenVINO™ model inference.
Run model quantization. First, import the model object and then bind the model object with the calibration dataset using the nncf.quantize() interface to initiate the quantization task. NNCF supports various model object types, including openvino.runtime.Model, torch.nn.Module, onnx.ModelProto, and tensorflow.Module.
(Optional) Accuracy control mode.
If the exported model by NNCF in the default mode shows a higher-than-expected decrease in accuracy, the accuracy control mode can be used for post-training quantization. In this case, a labeled test dataset is required to evaluate the sensitivity of each layer’s impact on model accuracy loss during the quantization process. The layers are then gradually reverted to their original precision based on the evaluation until the model achieves the desired accuracy.
Segment Anything + NNCF Practical Application
Next, let’s take a step-by-step look at how to use NNCF’s PTQ mode to complete the quantization of the SAM encoder.
-Define the data loader
In this example, the coco128 dataset is used as the calibration dataset, which includes 128 .jpg format images. Since the data loader must be a torch DataLoader class when quantizing ONNX or IR static models, we need to inherit torch.utils.data.Dataset and reconstruct a dataset class that includes the getitem method for iterating over each object in the dataset, and the len method to get the number of objects in the dataset. Finally, a DataLoader is generated using the torch.utils.data.DataLoader method.
-Define the data format conversion module
The next step is to define the data conversion module. We can use the previously defined preprocess_image function to preprocess the data. It’s worth noting that since the calibration_loader module returns a single data object in the torch tensor format, and the OpenVINO™ Python interface does not support this data type, we need to convert it to the numpy format first.
-Run NNCF quantization
To ensure the accuracy of the quantized model, we use the original FP32 ONNX format model as the input object instead of the FP16 IR format model. Then, the model is passed into the nncf.quantize interface for quantization. This interface has several important additional parameters:
model_type: Model type used to enable special quantization strategies. For example, for transformer models, we need to prioritize model accuracy.
preset: Quantization mode. The default mode is PERFORMANCE, which uses symmetric quantization for both weight and bias of convolutions to improve model performance. In this case, we use the MIXED mode to achieve a balance between model accuracy and performance.
Since the SAM encoder model has a complex network structure and the quantization process requires traversing the parameters of each layer multiple times, the quantization process may take longer. It is recommended to use hardware devices with more than 32GB of memory. If memory is insufficient, you can reduce the number of calibration data by setting the subset_size parameter to 100.
Model accuracy comparison:
Next, we compare the inference results of the INT8 and FP16 models:
It can be seen that in both prompt and auto modes, the INT8 model shows almost no change in accuracy compared to the FP16 model.
Note: In auto mode, masks are displayed in randomly generated colors.
Performance comparison:
Finally, we compare the performance indicators using the benchmark_app tool provided by OpenVINO™:
It can be found that on the CPU, the INT8 model achieves approximately a 30% improvement compared to the FP16 model, and the model size is reduced from around 350MB to less than 100MB.
Conclusion
Given the outstanding automatic segmentation capability of SAM, it is expected that there will be more and more application scenarios where this technology will be deployed. During the industrialization process, developers often focus on striking a balance between performance and accuracy to obtain a more cost-effective solution.
OpenVINO™ NNCF tool achieves significant improvements in model runtime efficiency and reduces model space occupation by quantizing and compressing a part of the Segment Anything encoder without significantly impacting model accuracy.
Notices & Disclaimers
Intel technologies may require enabled hardware, software or service activation.
No product or component can be absolutely secure. Your costs and results may vary.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.