Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exporting edgetpu models #20

Open
lkaino opened this issue Sep 8, 2023 · 23 comments
Open

Exporting edgetpu models #20

lkaino opened this issue Sep 8, 2023 · 23 comments

Comments

@lkaino
Copy link

lkaino commented Sep 8, 2023

I tried generating new models with different input size from https://github.com/DeGirum/ultralytics_degirum, but the scripts are not working. There is a clear typo in https://github.com/DeGirum/ultralytics_degirum/blob/131c0b71c03bf3455d43c6ede5af2813c7dfa64f/dg_quantize.py#L11, tflite.Interpreter should be tf.lite.Interpreter. Even after fixing that it does not work.

Is above repository being maintained? What would be the easiest way to export for example "yolov8n_relu6_coco--576x576_quant_tflite_edgetpu_1" with a smaller input image size, e.g. 320x320?

@shashichilappagari
Copy link
Contributor

shashichilappagari commented Sep 12, 2023

@lkaino My apologies for delay in responding. We are maintaining another repo that we are syncing daily to support quantized tflite export. The repo is at: https://github.com/DeGirum/ultralytics_yolov8/tree/franklin_current

In this fork, we introduced two new parameters for export: separate_outputs and export_hw_optimized. You can export using the regular export command but by passing these two extra parameters set to True. You can then run predict using the resulting tflite file. Please let us know if you encounter any difficulties in making this work.

@lkaino
Copy link
Author

lkaino commented Sep 15, 2023

@lkaino My apologies for delay in responding. We are maintaining another repo that we are syncing daily to support quantized tflite export. The repo is at: https://github.com/DeGirum/ultralytics_yolov8/tree/franklin_current

In this fork, we introduced two new parameters for export: separate_outputs and export_hw_optimized. You can export using the regular export command but by passing these two extra parameters set to True. You can then run predict using the resulting tflite file. Please let us know if you encounter any difficulties in making this work.

@shashichilappagari thanks for your response! I was able to export an edgetpu version of yolov8s from franklin_current branch hash 1ea62c585b7 using following code:

from ultralytics import YOLO

model_name = 'yolov8s'
input_width = 320
input_height = 320
model = YOLO(f"{model_name}.pt")
model.export(format='edgetpu',
             simplify=True,
             imgsz=(input_height, input_width),
             export_hw_optimized=True,
             separate_outputs=True
             )

However, not all of the operations will run on Edge TPU, is this normal or did I do something wrong?

WARNING ⚠️ 'ultralytics.yolo.v8' is deprecated since '8.0.136' and will be removed in '8.1.0'. Please use 'ultralytics.models.yolo' instead.
WARNING ⚠️ 'ultralytics.yolo.utils' is deprecated since '8.0.136' and will be removed in '8.1.0'. Please use 'ultralytics.utils' instead.
Note this warning may be related to loading older models. You can update your model to current structure with:
    import torch
    ckpt = torch.load("model.pt")  # applies to both official and custom models
    torch.save(ckpt, "updated-model.pt")

Ultralytics YOLOv8.0.177 🚀 Python-3.10.12 torch-2.0.1+cu117 CPU (Intel Core(TM) i7-8850H 2.60GHz)
YOLOv8s summary (fused): 168 layers, 11156544 parameters, 0 gradients
[W NNPACK.cpp:64] Could not initialize NNPACK! Reason: Unsupported hardware.

PyTorch: starting from 'yolov8s.pt' with input shape (1, 3, 320, 320) BCHW and output shape(s) ((1, 1600, 64), (1, 400, 64), (1, 100, 64), (1, 1600, 80), (1, 400, 80), (1, 100, 80)) (21.5 MB)

TensorFlow SavedModel: starting export with tensorflow 2.13.0...

ONNX: starting export with onnx 1.14.1 opset 17...
============= Diagnostic Run torch.onnx.export version 2.0.1+cu117 =============
verbose: False, log level: Level.ERROR
======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================

ONNX: simplifying with onnxsim 0.4.33...
ONNX: export success ✅ 5.4s, saved as 'yolov8s.onnx' (42.6 MB)
TensorFlow SavedModel: running 'onnx2tf -i "yolov8s.onnx" -o "yolov8s_saved_model" -nuo --verbosity info -oiqt -qt per-tensor -prf /ultralytics_yolov8/ultralytics/utils/replace.json'

Automatic generation of each OP name started ========================================
Automatic generation of each OP name complete!

Model loaded ========================================================================

Model conversion started ============================================================
saved_model output started ==========================================================
saved_model output complete!
Float32 tflite output complete!
Float16 tflite output complete!
Input signature information for quantization
signature_name: serving_default
input_name.0: images shape: (1, 320, 320, 3) dtype: <dtype: 'float32'>
Dynamic Range Quantization tflite output complete!
fully_quantize: 0, inference_type: 6, input_inference_type: FLOAT32, output_inference_type: FLOAT32
INT8 Quantization tflite output complete!
fully_quantize: 0, inference_type: 6, input_inference_type: INT8, output_inference_type: INT8
Full INT8 Quantization tflite output complete!
INT8 Quantization with int16 activations tflite output complete!
Full INT8 Quantization with int16 activations tflite output complete!
TensorFlow SavedModel: export success ✅ 160.3s, saved as 'yolov8s_saved_model' (139.2 MB)
Edge TPU: WARNING ⚠️ Edge TPU known bug https://github.com/ultralytics/ultralytics/issues/1185

Edge TPU: starting export with Edge TPU compiler 16.0.384591198...
Edge TPU: running 'edgetpu_compiler -s -d -k 10 --out_dir "yolov8s_saved_model" "yolov8s_saved_model/yolov8s_full_integer_quant.tflite"'
Edge TPU Compiler version 16.0.384591198
Searching for valid delegate with step 10
Try to compile segment with 240 ops
Started a compilation timeout timer of 180 seconds.
ERROR: Restored original execution plan after delegate application failure.
Compilation failed: Compilation failed due to large activation tensors in model.
Compilation child process completed within timeout period.
Try to compile segment with 230 ops
Intermediate tensors: model/tf.math.add_72/Add;model/tf.nn.convolution_48/convolution;model/tf.nn.convolution_60/convolution;Const_1,model/tf.math.add_61/Add;model/tf.nn.convolution_49/convolution;Const_3,model/tf.math.multiply_259/Mul,model/tf.math.add_73/Add;model/tf.nn.convolution_49/convolution;model/tf.nn.convolution_61/convolution;Const_4,model/tf.math.add_82/Add;model/tf.nn.convolution_49/convolution;model/tf.nn.convolution_70/convolution;Const_5,model/tf.math.add_60/Add;model/tf.nn.convolution_48/convolution;Const
Compilation failed! 
Started a compilation timeout timer of 180 seconds.

Model compiled successfully in 5100 ms.

Input model: yolov8s_saved_model/yolov8s_full_integer_quant.tflite
Input size: 10.74MiB
Output model: yolov8s_saved_model/yolov8s_full_integer_quant_edgetpu.tflite
Output size: 11.04MiB
On-chip memory used for caching model parameters: 7.03MiB
On-chip memory remaining for caching model parameters: 1.25KiB
Off-chip memory used for streaming uncached model parameters: 3.62MiB
Number of Edge TPU subgraphs: 1
Total number of operations: 240
Operation log: yolov8s_saved_model/yolov8s_full_integer_quant_edgetpu.log

Model successfully compiled but not all operations are supported by the Edge TPU. A percentage of the model will instead run on the CPU, which is slower. If possible, consider updating your model to use only operations supported by the Edge TPU. For details, visit g.co/coral/model-reqs.
Number of operations that will run on Edge TPU: 230
Number of operations that will run on CPU: 10

Operator                       Count      Status

RESIZE_NEAREST_NEIGHBOR        2          Mapped to Edge TPU
QUANTIZE                       2          Mapped to Edge TPU
MUL                            64         Mapped to Edge TPU
MUL                            1          More than one subgraph is not supported
CONV_2D                        2          More than one subgraph is not supported
CONV_2D                        69         Mapped to Edge TPU
ADD                            6          Mapped to Edge TPU
LOGISTIC                       64         Mapped to Edge TPU
LOGISTIC                       1          More than one subgraph is not supported
MAX_POOL_2D                    3          Mapped to Edge TPU
CONCATENATION                  13         Mapped to Edge TPU
PAD                            7          Mapped to Edge TPU
RESHAPE                        6          More than one subgraph is not supported
Compilation child process completed within timeout period.
Compilation succeeded! 

@lkaino
Copy link
Author

lkaino commented Sep 15, 2023

I exported yolov8n and it compiled without issues.

Unfortunately, the new postprocess function (decode_bbox()) in your branch depends on torch, which is not included in my target environment in Frigate (https://github.com/blakeblackshear/frigate). This is shame, I don't have the time to convert it to use numpy/tensorflow.

It seems I have to abandon this project for now, thanks for the help.

@shashichilappagari
Copy link
Contributor

@lkaino My apologies for delay in responding. We are maintaining another repo that we are syncing daily to support quantized tflite export. The repo is at: https://github.com/DeGirum/ultralytics_yolov8/tree/franklin_current
In this fork, we introduced two new parameters for export: separate_outputs and export_hw_optimized. You can export using the regular export command but by passing these two extra parameters set to True. You can then run predict using the resulting tflite file. Please let us know if you encounter any difficulties in making this work.

@shashichilappagari thanks for your response! I was able to export an edgetpu version of yolov8s from franklin_current branch hash 1ea62c585b7 using following code:

from ultralytics import YOLO

model_name = 'yolov8s'
input_width = 320
input_height = 320
model = YOLO(f"{model_name}.pt")
model.export(format='edgetpu',
             simplify=True,
             imgsz=(input_height, input_width),
             export_hw_optimized=True,
             separate_outputs=True
             )

However, not all of the operations will run on Edge TPU, is this normal or did I do something wrong?

WARNING ⚠️ 'ultralytics.yolo.v8' is deprecated since '8.0.136' and will be removed in '8.1.0'. Please use 'ultralytics.models.yolo' instead.
WARNING ⚠️ 'ultralytics.yolo.utils' is deprecated since '8.0.136' and will be removed in '8.1.0'. Please use 'ultralytics.utils' instead.
Note this warning may be related to loading older models. You can update your model to current structure with:
    import torch
    ckpt = torch.load("model.pt")  # applies to both official and custom models
    torch.save(ckpt, "updated-model.pt")

Ultralytics YOLOv8.0.177 🚀 Python-3.10.12 torch-2.0.1+cu117 CPU (Intel Core(TM) i7-8850H 2.60GHz)
YOLOv8s summary (fused): 168 layers, 11156544 parameters, 0 gradients
[W NNPACK.cpp:64] Could not initialize NNPACK! Reason: Unsupported hardware.

PyTorch: starting from 'yolov8s.pt' with input shape (1, 3, 320, 320) BCHW and output shape(s) ((1, 1600, 64), (1, 400, 64), (1, 100, 64), (1, 1600, 80), (1, 400, 80), (1, 100, 80)) (21.5 MB)

TensorFlow SavedModel: starting export with tensorflow 2.13.0...

ONNX: starting export with onnx 1.14.1 opset 17...
============= Diagnostic Run torch.onnx.export version 2.0.1+cu117 =============
verbose: False, log level: Level.ERROR
======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================

ONNX: simplifying with onnxsim 0.4.33...
ONNX: export success ✅ 5.4s, saved as 'yolov8s.onnx' (42.6 MB)
TensorFlow SavedModel: running 'onnx2tf -i "yolov8s.onnx" -o "yolov8s_saved_model" -nuo --verbosity info -oiqt -qt per-tensor -prf /ultralytics_yolov8/ultralytics/utils/replace.json'

Automatic generation of each OP name started ========================================
Automatic generation of each OP name complete!

Model loaded ========================================================================

Model conversion started ============================================================
saved_model output started ==========================================================
saved_model output complete!
Float32 tflite output complete!
Float16 tflite output complete!
Input signature information for quantization
signature_name: serving_default
input_name.0: images shape: (1, 320, 320, 3) dtype: <dtype: 'float32'>
Dynamic Range Quantization tflite output complete!
fully_quantize: 0, inference_type: 6, input_inference_type: FLOAT32, output_inference_type: FLOAT32
INT8 Quantization tflite output complete!
fully_quantize: 0, inference_type: 6, input_inference_type: INT8, output_inference_type: INT8
Full INT8 Quantization tflite output complete!
INT8 Quantization with int16 activations tflite output complete!
Full INT8 Quantization with int16 activations tflite output complete!
TensorFlow SavedModel: export success ✅ 160.3s, saved as 'yolov8s_saved_model' (139.2 MB)
Edge TPU: WARNING ⚠️ Edge TPU known bug https://github.com/ultralytics/ultralytics/issues/1185

Edge TPU: starting export with Edge TPU compiler 16.0.384591198...
Edge TPU: running 'edgetpu_compiler -s -d -k 10 --out_dir "yolov8s_saved_model" "yolov8s_saved_model/yolov8s_full_integer_quant.tflite"'
Edge TPU Compiler version 16.0.384591198
Searching for valid delegate with step 10
Try to compile segment with 240 ops
Started a compilation timeout timer of 180 seconds.
ERROR: Restored original execution plan after delegate application failure.
Compilation failed: Compilation failed due to large activation tensors in model.
Compilation child process completed within timeout period.
Try to compile segment with 230 ops
Intermediate tensors: model/tf.math.add_72/Add;model/tf.nn.convolution_48/convolution;model/tf.nn.convolution_60/convolution;Const_1,model/tf.math.add_61/Add;model/tf.nn.convolution_49/convolution;Const_3,model/tf.math.multiply_259/Mul,model/tf.math.add_73/Add;model/tf.nn.convolution_49/convolution;model/tf.nn.convolution_61/convolution;Const_4,model/tf.math.add_82/Add;model/tf.nn.convolution_49/convolution;model/tf.nn.convolution_70/convolution;Const_5,model/tf.math.add_60/Add;model/tf.nn.convolution_48/convolution;Const
Compilation failed! 
Started a compilation timeout timer of 180 seconds.

Model compiled successfully in 5100 ms.

Input model: yolov8s_saved_model/yolov8s_full_integer_quant.tflite
Input size: 10.74MiB
Output model: yolov8s_saved_model/yolov8s_full_integer_quant_edgetpu.tflite
Output size: 11.04MiB
On-chip memory used for caching model parameters: 7.03MiB
On-chip memory remaining for caching model parameters: 1.25KiB
Off-chip memory used for streaming uncached model parameters: 3.62MiB
Number of Edge TPU subgraphs: 1
Total number of operations: 240
Operation log: yolov8s_saved_model/yolov8s_full_integer_quant_edgetpu.log

Model successfully compiled but not all operations are supported by the Edge TPU. A percentage of the model will instead run on the CPU, which is slower. If possible, consider updating your model to use only operations supported by the Edge TPU. For details, visit g.co/coral/model-reqs.
Number of operations that will run on Edge TPU: 230
Number of operations that will run on CPU: 10

Operator                       Count      Status

RESIZE_NEAREST_NEIGHBOR        2          Mapped to Edge TPU
QUANTIZE                       2          Mapped to Edge TPU
MUL                            64         Mapped to Edge TPU
MUL                            1          More than one subgraph is not supported
CONV_2D                        2          More than one subgraph is not supported
CONV_2D                        69         Mapped to Edge TPU
ADD                            6          Mapped to Edge TPU
LOGISTIC                       64         Mapped to Edge TPU
LOGISTIC                       1          More than one subgraph is not supported
MAX_POOL_2D                    3          Mapped to Edge TPU
CONCATENATION                  13         Mapped to Edge TPU
PAD                            7          Mapped to Edge TPU
RESHAPE                        6          More than one subgraph is not supported
Compilation child process completed within timeout period.
Compilation succeeded! 

Some operations may not map to edgetpu, but our tests show that this is fine.

@shashichilappagari
Copy link
Contributor

I exported yolov8n and it compiled without issues.

Unfortunately, the new postprocess function (decode_bbox()) in your branch depends on torch, which is not included in my target environment in Frigate (https://github.com/blakeblackshear/frigate). This is shame, I don't have the time to convert it to use numpy/tensorflow.

It seems I have to abandon this project for now, thanks for the help.

@lkaino The frigate repo looks very interesting. We will see if we can help you with the numpy version of the code.

@lkaino
Copy link
Author

lkaino commented Sep 15, 2023

I exported yolov8n and it compiled without issues.
Unfortunately, the new postprocess function (decode_bbox()) in your branch depends on torch, which is not included in my target environment in Frigate (https://github.com/blakeblackshear/frigate). This is shame, I don't have the time to convert it to use numpy/tensorflow.
It seems I have to abandon this project for now, thanks for the help.

@lkaino The frigate repo looks very interesting. We will see if we can help you with the numpy version of the code.

That would be awesome!

I got the postprocessing running with the numpy version you linked.

Unfortunately the model doesn't detect much of anything. Is there something wrong in the way I'm exporting the model:

from ultralytics import YOLO

model_name = 'yolov8n'
input_width = 280
input_height = 280
model = YOLO(f"{model_name}.pt")
model.export(format='edgetpu',
             simplify=True,
             imgsz=(input_height, input_width),
             export_hw_optimized=True,
             separate_outputs=True
             )

Of course there might be something wrong in the way how I prepare the input tensor (or in the postprocessing):

interpreter = Interpreter(
                model_path="yolov8n_saved_model/yolov8n_full_integer_quant.tflite",
            )

interpreter.allocate_tensors()
tensor_input_details = interpreter.get_input_details()
tensor_output_details = interpreter.get_output_details()

details = tensor_input_details[0]
image = Image.open('thumbnail.jpg').resize((288, 288))
tensor_input = np.asarray(image)
tensor_input = input.astype('float') / 255
scale, zero_point = details['quantization']
tensor_input = (input / scale + zero_point).astype(details['dtype'])
tensor_input = np.expand_dims(input, 0)
interpreter.set_tensor(tensor_input_details[0]["index"], tensor_input)
interpreter.invoke()

@shashichilappagari
Copy link
Contributor

@lkaino Looks like you exported with size (280,280) but sending an input of size (288,288). However, this should have thrown an error at the intepreter.invoke() step itself.

@shashichilappagari
Copy link
Contributor

@lkaino I found one potential place where things could have gone wrong. Can you try one of the following:

 model = YOLO(model_yaml).load(model.ckpt_path)

or export with hw_optimized=False

@lkaino
Copy link
Author

lkaino commented Sep 15, 2023

@lkaino Looks like you exported with size (280,280) but sending an input of size (288,288). However, this should have thrown an error at the intepreter.invoke() step itself.

Thanks! Exporting with 280 results in the input tensor being 288, for some reason.

@lkaino
Copy link
Author

lkaino commented Sep 15, 2023

@lkaino I found one potential place where things could have gone wrong. Can you try one of the following:

 model = YOLO(model_yaml).load(model.ckpt_path)

or export with hw_optimized=False

I'm intentionally using the tflite model and tensorflow directly without YOLO. This is how it is used in Frigate.

I can try with the hw_optimized=False.

@shashichilappagari
Copy link
Contributor

@lkaino Sorry, what I meant was during the export stage in Ultralytics repo, you loaded model using just the checkpoint. If hw_optimized is set to True, you need to load it using model_yaml. During inference, you can directly use the tflite model as in your code.

Btw, YOLOv8 model expects input sizes to be multiples of 32. Hence, 280 becomes 288.

@lkaino
Copy link
Author

lkaino commented Sep 15, 2023

@lkaino Sorry, what I meant was during the export stage in Ultralytics repo, you loaded model using just the checkpoint. If hw_optimized is set to True, you need to load it using model_yaml. During inference, you can directly use the tflite model as in your code.

Btw, YOLOv8 model expects input sizes to be multiples of 32. Hence, 280 becomes 288.

Ah sorry, I misunderstood. The model works when I export it without the hw optimization.

Could you explain how to load the model using ckpt_path with coco pretained? When I load the model as follows the model.ckpt_path is None:

from ultralytics import YOLO

model_name = 'yolov8n'
input_width = 280
input_height = 280
model = YOLO("relu6-yolov8.yaml")
model.export(format='edgetpu',
             simplify=True,
             imgsz=(input_height, input_width),
             export_hw_optimized=False,
             separate_outputs=True
             )

@shashichilappagari
Copy link
Contributor

@lkaino You simply use model=YOLO('relu6-yolov8.yaml').load(relu6-yolov8n.pt)

I am assuming you trained the model with relu6. Otherwise, you should use whatever config yaml you used for training.

@lkaino
Copy link
Author

lkaino commented Sep 15, 2023

@lkaino You simply use model=YOLO('relu6-yolov8.yaml').load(relu6-yolov8n.pt)

I am assuming you trained the model with relu6. Otherwise, you should use whatever config yaml you used for training.

Sorry, I don't know YOLO so well.

'relu6-yolov8.yaml' loads the model architecture with relu6 activations, I assume. 'relu6-yolov8n.pt' would be the training weights which I don't have.

I haven't done any training yet, I would like to use the model pretrained with COCO dataset, with HW optimizations. Is that possible, or do I have to retrain the model?

model = YOLO('relu6-yolov8.yaml').load("relu6-yolov8n.pt")
model.export(format='edgetpu',
             simplify=True,
             imgsz=(input_height, input_width),
             export_hw_optimized=True,
             separate_outputs=True
             )

@shashichilappagari
Copy link
Contributor

@lkaino If you are using standard checkpoints, then you can do the following:
model=YOLO('yolov8.yaml').load('yolov8n.pt')

@lkaino
Copy link
Author

lkaino commented Sep 15, 2023

@lkaino If you are using standard checkpoints, then you can do the following: model=YOLO('yolov8.yaml').load('yolov8n.pt')

This one works with the HW optimized on edgetpu! Thanks!

Would be interesting to know the difference between the original:

model = YOLO(f"yolov8n.pt")

and the working one:

model=YOLO('yolov8.yaml').load('yolov8n.pt')

Would need to check the documentation or the implementation of the YOLO class.

@shashichilappagari
Copy link
Contributor

@lkaino Great to hear that you could get the model working on edge tpu. When we use hw_optimized flag, we add some new layers to the model (even though the functionality remains exactly the same). Because of this, the model needs to be initialized using the conifg yaml and then the weights need to be loaded from checkpoint.

Do you see any speed difference between when hw_optimized=True and hw_optimized=False? If possible, can you share the FPS you are seeing?

@lkaino
Copy link
Author

lkaino commented Sep 16, 2023

@shashichilappagari I measured the model execution time with and without hw_optimized. Unfortunately there is no difference between them.

Both of them take 33ms on average with 288 input size on yolov8n.

Unfortunately the numpy version of decode_bbox() takes 59 ms on average, which is almost double the model execution. I can profile the function a bit more to see where the overhead comes from.

@shashichilappagari
Copy link
Contributor

@lkaino Thanks for sharing these numbers. In our measurements, we were getting 33ms for input size of 640x640. So inference time for 288x288 should be much faster. May I ask which host CPU are you using? Is it some raspberry pi type of device? if so, it can explain why the post-processing is so slow.

We developed a package called degirum pysdk that can run on our hardware, edge tpu, cpu, and gpus. In this package, the postprocessors are written in c++ and are much faster. You can try our sw and see if you find it useful. You can even run various ML models directly in your browser without installing any SW to get a feel for our sw and decide if it is worth spending any time on. You can sign up for our cloud platform at : https://cs.degirum.com

@lkaino
Copy link
Author

lkaino commented Sep 16, 2023

The CPU is Intel i7-4700MQ CPU @ 2.40GHz, it's an old HP Zbook.

Your package sounds interesting, but unfortunately I can't afford much more time on this project. Do you have python bindings for the SDK and how difficult would it be to integrate it?

My goal was to try if Yolo would be feasible alternative to MobileNetV2 in Frigate on edgetpu. The model itself is on par with it's execution time, but the mobilenet needs very little postprocessing which makes it more energy-efficient choice.

@shashichilappagari
Copy link
Contributor

@lkaino The degirum package is actually a python package but all post-processing functions are implemented in c++ to which we have python bindings. We believe that it is very easy to integrate as the code to run any model on any HW is only 4 lines. You can take a look at our docs at: https://docs.degirum.com/content/

We are planning to spend some time on integrating pysdk into frigate and I will keep you posted on our progress.

@lkaino
Copy link
Author

lkaino commented Sep 16, 2023

@lkaino The degirum package is actually a python package but all post-processing functions are implemented in c++ to which we have python bindings. We believe that it is very easy to integrate as the code to run any model on any HW is only 4 lines. You can take a look at our docs at: https://docs.degirum.com/content/

We are planning to spend some time on integrating pysdk into frigate and I will keep you posted on our progress.

That's nice to hear!

I spent some time to profile the execution time, most of the time (50 ms) is spent in the double for loop (4x1701). See below, the operation is so complex that I have no idea how to optimize it.

class DFL:
    def __init__(self, c1=16):
        """Integral module of Distribution Focal Loss (DFL)."""
        super().__init__()
        self.c1 = c1

    def forward(self, x):
        """Applies a transformer layer on numpy array 'x' and returns a numpy array."""
        b, c, a = x.shape  # batch, channels, anchors
        x = x.reshape((b, 4, self.c1, a))
        x = x.transpose(0, 2, 1, 3)
        x = softmax(x)
        weights = np.arange(self.c1)
        weights = np.reshape(weights, (1, self.c1, 1, 1))
        output = np.zeros((1, 1, 4, a))
        for i in range(4): ######################### this loop takes time
            for j in range(a):
                output[0, 0, i, j] = np.sum(x[0, :, i, j] * weights[0, :, 0, 0])

        output = output.reshape(b, 4, a)
        return output

@shashichilappagari
Copy link
Contributor

@lkaino Thanks for this useful information on profiling. We will take a look and see if it can be improved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants