关于分类模型训练测试，每次比PyTorch慢几秒的原因&可复现代码 #120

ccssu · 2023-03-09T07:05:22Z

前言

py-spy 分析

可稳定复现代码

前言

在研究定位 PyTorch 中 Python API 对应的 C++ 代码 https://github.com/Oneflow-Inc/OneTeam/issues/147 时候

试了下 pytorch官网推荐的一个性能定位工具 py-spy

定位了到pr: #111 在分类模型训练测试，每次比PyTorch慢几秒的在 tloss = (tloss * i + loss.item()) / (i + 1) # update mean losses 这一行

Profiling with `py-spy`

Evaluating the performance impact of code changes in PyTorch can be complicated,
particularly if code changes happen in compiled code. One simple way to profile
both Python and C++ code in PyTorch is to use
py-spy, a sampling profiler for Python
that has the ability to profile native code and Python code in the same session.

py-spy can be installed via pip:

pip install py-spy

To use py-spy, first write a Python test script that exercises the
functionality you would like to profile. For example, this script profiles
torch.add:

import torch

t1 = torch.tensor([[1, 1], [1, 1.]])
t2 = torch.tensor([[0, 0], [0, 0.]])

for _ in range(1000000):
    torch.add(t1, t2)

Since the torch.add operation happens in microseconds, we repeat it a large
number of times to get good statistics. The most straightforward way to use
py-spy with such a script is to generate a flame
graph:

py-spy record -o profile.svg --native -- python test_tensor_tensor_add.py

This will output a file named profile.svg containing a flame graph you can
view in a web browser or SVG viewer. Individual stack frame entries in the graph
can be selected interactively with your mouse to zoom in on a particular part of
the program execution timeline. The --native command-line option tells
py-spy to record stack frame entries for PyTorch C++ code. To get line numbers
for C++ code it may be necessary to compile PyTorch in debug mode by prepending
your setup.py develop call to compile PyTorch with DEBUG=1. Depending on
your operating system it may also be necessary to run py-spy with root
privileges.

py-spy can also work in an htop-like "live profiling" mode and can be
tweaked to adjust the stack sampling rate, see the py-spy readme for more
details.

原来的分类训练测试结果

原来的分类训练测试方法 #111 (comment)

py-spy 分析

y轴表示函数的调用栈，x轴表示函数的执行时间，那么函数在x轴越宽表示执行时间越长，也说明是性能的瓶颈点。
从下面两张图可以发现 tloss = (tloss * i + loss.item()) / (i + 1) # update mean losses 这一行对性能是有一定影响的。

pytorch 后端 tloss = (tloss * i + loss.item()) / (i + 1) # update mean losses 这一行得用放大镜看

oneflow后端 tloss = (tloss * i + loss.item()) / (i + 1) # update mean losses 这一行比较明显

可稳定复现代码

使用机器 oneflow27-root
2023-03-09 编译的oneflow 版本
flow.version='0.9.1+cu117.git.a4b7145d01' 耗时0.7273483276367188
torch.version='1.13.0+cu117' 耗时0.11882472038269043

下面代码定义了一个计时的 Profile类，和两个test_torch， test_oneflow 函数

import time 

LENGTH = 148*100

class Profile():
    # YOLOv5 Profile class. Usage: @Profile() decorator or 'with Profile():' context manager
    def __init__(self, v):
        self.v = v
    def __enter__(self):
        self.start = self.time()
        return self
    def __exit__(self, type, value, traceback):
        self.dt = self.time() - self.start  # delta-time
        print(f'{self.v} 耗时{self.dt}')

    def time(self):
        return time.time()

def test_oneflow():
    import oneflow as flow
    dt = Profile(f'{flow.__version__=}')
    x = flow.Tensor([1.34]).cuda()
    tloss = 0.0 
    with dt:
        for i in range(LENGTH):
            tloss = (tloss *i + x.item())/(i+1)

def test_torch():
    import torch 
    dt = Profile(f'{torch.__version__=}')
    x = torch.Tensor([1.34]).cuda()
    tloss = 0.0 
    with dt:
        for i in range(LENGTH):
            tloss = (tloss*i + x.item())/(i+1)

if __name__ == '__main__':
    test_oneflow()
    test_torch()

输出

flow.__version__='0.9.1+cu117.git.a4b7145d01' 耗时0.7273483276367188
torch.__version__='1.13.0+cu117' 耗时0.11882472038269043

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

关于分类模型训练测试，每次比PyTorch慢几秒的原因&可复现代码 #120

关于分类模型训练测试，每次比PyTorch慢几秒的原因&可复现代码 #120

ccssu commented Mar 9, 2023

关于分类模型训练测试， 每次比PyTorch慢几秒的原因&可复现代码 #120

关于分类模型训练测试， 每次比PyTorch慢几秒的原因&可复现代码 #120

Comments

ccssu commented Mar 9, 2023

关于分类模型训练测试，每次比PyTorch慢几秒的原因&可复现代码 #120

关于分类模型训练测试，每次比PyTorch慢几秒的原因&可复现代码 #120