Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于分类模型训练测试, 每次比PyTorch慢几秒的原因&可复现代码 #120

Open
ccssu opened this issue Mar 9, 2023 · 0 comments

Comments

@ccssu
Copy link
Collaborator

ccssu commented Mar 9, 2023

前言

py-spy 分析

可稳定复现代码

最近计划

前言

在研究 定位 PyTorch 中 Python API 对应的 C++ 代码 https://github.com/Oneflow-Inc/OneTeam/issues/147 时候

试了下 pytorch官网推荐的一个性能定位工具 py-spy

定位了到pr: #111 在分类模型训练测试, 每次比PyTorch慢几秒的在 tloss = (tloss * i + loss.item()) / (i + 1) # update mean losses 这一行

Profiling with `py-spy`

Evaluating the performance impact of code changes in PyTorch can be complicated,
particularly if code changes happen in compiled code. One simple way to profile
both Python and C++ code in PyTorch is to use
py-spy, a sampling profiler for Python
that has the ability to profile native code and Python code in the same session.

py-spy can be installed via pip:

pip install py-spy

To use py-spy, first write a Python test script that exercises the
functionality you would like to profile. For example, this script profiles
torch.add:

import torch

t1 = torch.tensor([[1, 1], [1, 1.]])
t2 = torch.tensor([[0, 0], [0, 0.]])

for _ in range(1000000):
    torch.add(t1, t2)

Since the torch.add operation happens in microseconds, we repeat it a large
number of times to get good statistics. The most straightforward way to use
py-spy with such a script is to generate a flame
graph
:

py-spy record -o profile.svg --native -- python test_tensor_tensor_add.py

This will output a file named profile.svg containing a flame graph you can
view in a web browser or SVG viewer. Individual stack frame entries in the graph
can be selected interactively with your mouse to zoom in on a particular part of
the program execution timeline. The --native command-line option tells
py-spy to record stack frame entries for PyTorch C++ code. To get line numbers
for C++ code it may be necessary to compile PyTorch in debug mode by prepending
your setup.py develop call to compile PyTorch with DEBUG=1. Depending on
your operating system it may also be necessary to run py-spy with root
privileges.

py-spy can also work in an htop-like "live profiling" mode and can be
tweaked to adjust the stack sampling rate, see the py-spy readme for more
details.

原来的分类训练测试结果

原来的分类训练测试方法 #111 (comment)

image

py-spy 分析

y轴表示函数的调用栈,x轴表示函数的执行时间,那么函数在x轴越宽表示执行时间越长,也说明是性能的瓶颈点。
从下面两张图可以发现 tloss = (tloss * i + loss.item()) / (i + 1) # update mean losses 这一行对性能是有一定影响的。

pytorch 后端 tloss = (tloss * i + loss.item()) / (i + 1) # update mean losses 这一行得用放大镜看

image

oneflow后端 tloss = (tloss * i + loss.item()) / (i + 1) # update mean losses 这一行比较明显

image

可稳定复现代码

可稳定复现代码
  • 使用机器 oneflow27-root
  • 2023-03-09 编译的oneflow 版本
  • flow.version='0.9.1+cu117.git.a4b7145d01' 耗时0.7273483276367188
  • torch.version='1.13.0+cu117' 耗时0.11882472038269043

下面代码定义了一个计时的 Profile类,和两个test_torch, test_oneflow 函数

import time 

LENGTH = 148*100

class Profile():
    # YOLOv5 Profile class. Usage: @Profile() decorator or 'with Profile():' context manager
    def __init__(self, v):
        self.v = v
    def __enter__(self):
        self.start = self.time()
        return self
    def __exit__(self, type, value, traceback):
        self.dt = self.time() - self.start  # delta-time
        print(f'{self.v} 耗时{self.dt}')

    def time(self):
        return time.time()

def test_oneflow():
    import oneflow as flow
    dt = Profile(f'{flow.__version__=}')
    x = flow.Tensor([1.34]).cuda()
    tloss = 0.0 
    with dt:
        for i in range(LENGTH):
            tloss = (tloss *i + x.item())/(i+1)

def test_torch():
    import torch 
    dt = Profile(f'{torch.__version__=}')
    x = torch.Tensor([1.34]).cuda()
    tloss = 0.0 
    with dt:
        for i in range(LENGTH):
            tloss = (tloss*i + x.item())/(i+1)

if __name__ == '__main__':
    test_oneflow()
    test_torch()
输出
flow.__version__='0.9.1+cu117.git.a4b7145d01' 耗时0.7273483276367188
torch.__version__='1.13.0+cu117' 耗时0.11882472038269043

最近计划

  • 学习定位 PyTorch中算子代码
  • nsys 工具上手
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant