Mac M1Pro 可以正常运行，但一直卡着等不到回答！ #462

delete-x · 2023-04-08T07:45:45Z

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

queue: 1/1 | 561.3s.

Expected Behavior

No response

Steps To Reproduce

mac m1pro
torch_nightly_env
half().to('mps')

Environment

- OS:  macos 13.3.1
- Python:  3.10
- Transformers:  4.27
- PyTorch: torch_nightly_env
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :   mps

Anything else?

No response

cylee0909 · 2023-04-08T08:56:44Z

half().to('mps') 改为 float().to('mps')试试

duzx16 · 2023-04-09T08:13:01Z

应该是PyTorch在mps上的一个bug：pytorch/pytorch#96602 ，会导致attention score出现nan
参考huggingface/diffusers#2643 做了一个修复https://huggingface.co/THUDM/chatglm-6b/commit/cde457b39fe0670b10dd293909aab17387ea2c80 ，在M1 Pro + pytorch 2.1.0.dev20230408上测试了没有问题。

zz10247 · 2023-04-10T02:48:20Z

应该是PyTorch在mps上的一个bug：pytorch/pytorch#96602 ，会导致attention score出现nan 参考huggingface/diffusers#2643 做了一个修复https://huggingface.co/THUDM/chatglm-6b/commit/cde457b39fe0670b10dd293909aab17387ea2c80 ，在M1 Pro + pytorch 2.1.0.dev20230408上测试了没有问题。

您好，想请教一下，在m1 pro 16g的电脑上，使用.half().to('mps')，回复速度十分的慢，这个是正常情况么。第一句回复用了1000多秒。

duzx16 · 2023-04-10T06:35:31Z

应该是PyTorch在mps上的一个bug：pytorch/pytorch#96602 ，会导致attention score出现nan 参考huggingface/diffusers#2643 做了一个修复https://huggingface.co/THUDM/chatglm-6b/commit/cde457b39fe0670b10dd293909aab17387ea2c80 ，在M1 Pro + pytorch 2.1.0.dev20230408上测试了没有问题。

您好，想请教一下，在m1 pro 16g的电脑上，使用.half().to('mps')，回复速度十分的慢，这个是正常情况么。第一句回复用了1000多秒。

不太正常，你可以打开活动监视器看一下GPU占用

zz10247 · 2023-04-10T07:07:29Z

GPU占用很低，是不是说明我没有启动to('mps')这个模式

dydwgmcnl4241 · 2023-04-11T06:18:51Z

在m1电脑上，使用.half().to('mps')时报错，用float()时正常，应该怎么修改呢，感谢！

/Users/wilson/miniforge3/lib/python3.10/site-packages/transformers/generation/utils.py:686: UserWarning: MPS: no support for int64 repeats mask, casting it to int32 (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/mps/operations/Repeat.mm:236.)
input_ids = input_ids.repeat_interleave(expand_size, dim=0)
The dtype of attention mask (torch.int64) is not bool
loc("varianceEps"("(mpsFileLoc): /AppleInternal/Library/BuildRoots/97f6331a-ba75-11ed-a4bc-863efbbaf80d/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm":228:0)): error: input types 'tensor<1x4x1xf16>' and 'tensor<1xf32>' are not broadcast compatible
LLVM ERROR: Failed to infer result type(s).

yg838457845 · 2023-04-15T10:45:47Z

same problem, after I update the system version to 13.3.1

zhaopengme · 2023-04-15T11:06:53Z

same problem, after I update the system version to 13.3.1

你跑通了，我刚跑也是一直卡着

zhouxiunai · 2023-04-18T06:51:33Z

m1 half 问个你好，要10多分钟。 float 直接内存溢出。量化直接报错要 cuda 有一样问题的么？另外项目belle 可以使用int4 量化，而且效果不错

liury889 · 2023-04-18T07:17:10Z

m1 half 问个你好，要10多分钟。 float 直接内存溢出。量化直接报错要 cuda 有一样问题的么？另外项目belle 可以使用int4 量化，而且效果不错

对，踩坑半天，和你情况一样

duzx16 · 2023-04-18T08:11:55Z

m1 half 问个你好，要10多分钟。 float 直接内存溢出。量化直接报错要 cuda 有一样问题的么？另外项目belle 可以使用int4 量化，而且效果不错

量化的话直接用 https://huggingface.co/THUDM/chatglm-6b-int4 ，里面有cpu kernel不用cuda
half 用 mps 的话我在 32G 内存的机器上测试还挺快的，16G 内存的话可能要做 swap 所以很慢

liury889 · 2023-04-18T08:15:21Z

m1 half 问个你好，要10多分钟。 float 直接内存溢出。量化直接报错要 cuda 有一样问题的么？另外项目belle 可以使用int4 量化，而且效果不错

量化的话直接用 https://huggingface.co/THUDM/chatglm-6b-int4 ，里面有cpu kernel不用cuda half 用 mps 的话我在 32G 内存的机器上测试还挺快的，16G 内存的话可能要做 swap 所以很慢

在 m1 pro 16g 机器上根本无法运行int4，你是能直接运用吗？有详细的步骤没，大佬请赐教，不知道那出错啦？

silentlit · 2023-04-18T13:37:16Z

我也遇到这个问题了，现已解决。
M1 Max，13.2/13.3.1，64GB。
按照readme配置环境，表现为不使用to('mps')时能够正常使用，使用to('mps')时将无回复以及爆内存。

用Anaconda，重新描述一遍过程。
先创建虚拟环境
conda create -n chatGLM python=3.9

激活环境
conda activate chatGLM

配置环境
conda install pytorch torchvision torchaudio -c pytorch-nightly

然后用conda install 命令依次安装requirements.txt文件中的依赖。除了torch。有些是pip
conda install protobuf
conda install transformers
pip install cpm_kernels
pip install gradio
pip install mdtex2html
conda install sentencepiece
pip install accelerate

接下来去https://anaconda.org/conda-forge/transformers/files，下载[noarch/transformers-4.26.0-pyhd8ed1ab_0.conda](https://anaconda.org/conda-forge/transformers/4.26.0/download/noarch/transformers-4.26.0-pyhd8ed1ab_0.conda)
下载完后conda install 你下载transformers-4.26.0的文件路径

最后mps加速就正常了。

没有修改git代码，基本可以确定是环境配置的问题。

duzx16 · 2023-04-18T14:25:55Z

m1 half 问个你好，要10多分钟。 float 直接内存溢出。量化直接报错要 cuda 有一样问题的么？另外项目belle 可以使用int4 量化，而且效果不错

量化的话直接用 https://huggingface.co/THUDM/chatglm-6b-int4 ，里面有cpu kernel不用cuda half 用 mps 的话我在 32G 内存的机器上测试还挺快的，16G 内存的话可能要做 swap 所以很慢

在 m1 pro 16g 机器上根本无法运行int4，你是能直接运用吗？有详细的步骤没，大佬请赐教，不知道那出错啦？

可以，之前的代码会强制使用 openmp 所以没法跑。你更新一下应该就好了。（需要安装 xcode command line tools）

liury889 · 2023-04-18T16:24:30Z

m1 half 问个你好，要10多分钟。 float 直接内存溢出。量化直接报错要 cuda 有一样问题的么？另外项目belle 可以使用int4 量化，而且效果不错

量化的话直接用 https://huggingface.co/THUDM/chatglm-6b-int4 ，里面有cpu kernel不用cuda half 用 mps 的话我在 32G 内存的机器上测试还挺快的，16G 内存的话可能要做 swap 所以很慢

在 m1 pro 16g 机器上根本无法运行int4，你是能直接运用吗？有详细的步骤没，大佬请赐教，不知道那出错啦？

可以，之前的代码会强制使用 openmp 所以没法跑。你更新一下应该就好了。（需要安装 xcode command line tools）

1.下载重新尝试了一下，基于模型chatglm-6b-int4 https://huggingface.co/THUDM/chatglm-6b-int4 2
2.按照如下示例修改web_demo.py 或者 api.py中对应模型加载的代码；

model_dir= "your local dir"  #
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
# INT8 量化的模型将"THUDM/chatglm-6b-int4"改为"THUDM/chatglm-6b-int8"
model = AutoModel.from_pretrained(model_dir, trust_remote_code=True).float()

3.直接下载 https://github.com/THUDM/ChatGLM-6B.git 项目，进入项目路径

python  web_demo.py

or

python  api.py

4.直接忽略启动错误（此类错误：Message: 'Failed to load cpm_kernels:' Arguments: (RuntimeError('Unknown platform: darwin'),)），直接试用，基本延迟5秒左右，就会得到答案（m1 pro+16g）

liury889 · 2023-04-18T16:31:39Z

我也遇到这个问题了，现已解决。 M1 Max，13.2/13.3.1，64GB。按照readme配置环境，表现为不使用to('mps')时能够正常使用，使用to('mps')时将无回复以及爆内存。

用Anaconda，重新描述一遍过程。先创建虚拟环境 conda create -n chatGLM python=3.9

激活环境 conda activate chatGLM

配置环境 conda install pytorch torchvision torchaudio -c pytorch-nightly

然后用conda install 命令依次安装requirements.txt文件中的依赖。除了torch。有些是pip conda install protobuf conda install transformers pip install cpm_kernels pip install gradio pip install mdtex2html conda install sentencepiece pip install accelerate

接下来去[https://anaconda.org/conda-forge/transformers/files，下载noarch/transformers-4.26.0-pyhd8ed1ab_0.conda](https://anaconda.org/conda-forge/transformers/files%EF%BC%8C%E4%B8%8B%E8%BD%BD%5Bnoarch/transformers-4.26.0-pyhd8ed1ab_0.conda%5D(https://anaconda.org/conda-forge/transformers/4.26.0/download/noarch/transformers-4.26.0-pyhd8ed1ab_0.conda)) 下载完后conda install 你下载transformers-4.26.0的文件路径

最后mps加速就正常了。

没有修改git代码，基本可以确定是环境配置的问题。

尝试啦一下，int4 cpu 本地启动成功并能成功调用，gpu 启动方式（to("mps")）能够启动成功，但是调用服务会如出现如下错误
func = kernels.int4WeightExtractionHalf
AttributeError: 'NoneType' object has no attribute 'int4WeightExtractionHalf'

zwzheng45 · 2023-04-19T12:55:16Z

我也遇到这个问题了，现已解决。 M1 Max，13.2/13.3.1，64GB。按照readme配置环境，表现为不使用to('mps')时能够正常使用，使用to('mps')时将无回复以及爆内存。
用Anaconda，重新描述一遍过程。先创建虚拟环境 conda create -n chatGLM python=3.9
激活环境 conda activate chatGLM
配置环境 conda install pytorch torchvision torchaudio -c pytorch-nightly
然后用conda install 命令依次安装requirements.txt文件中的依赖。除了torch。有些是pip conda install protobuf conda install transformers pip install cpm_kernels pip install gradio pip install mdtex2html conda install sentencepiece pip install accelerate
接下来去[https://anaconda.org/conda-forge/transformers/files，下载noarch/transformers-4.26.0-pyhd8ed1ab_0.conda](https://anaconda.org/conda-forge/transformers/files%EF%BC%8C%E4%B8%8B%E8%BD%BD%5Bnoarch/transformers-4.26.0-pyhd8ed1ab_0.conda%5D(https://anaconda.org/conda-forge/transformers/4.26.0/download/noarch/transformers-4.26.0-pyhd8ed1ab_0.conda)) 下载完后conda install 你下载transformers-4.26.0的文件路径
最后mps加速就正常了。
没有修改git代码，基本可以确定是环境配置的问题。

尝试啦一下，int4 cpu 本地启动成功并能成功调用，gpu 启动方式（to("mps")）能够启动成功，但是调用服务会如出现如下错误 func = kernels.int4WeightExtractionHalf AttributeError: 'NoneType' object has no attribute 'int4WeightExtractionHalf'

我用AMD 5500M的Mac也出现了这个错误，试着按照楼上的方法重新安装了一遍依赖依然无效。将half().to('mps')改成float()使用CPU是可以正常回答的。

duzx16 · 2023-04-19T14:38:21Z

我也遇到这个问题了，现已解决。 M1 Max，13.2/13.3.1，64GB。按照readme配置环境，表现为不使用to('mps')时能够正常使用，使用to('mps')时将无回复以及爆内存。
用Anaconda，重新描述一遍过程。先创建虚拟环境 conda create -n chatGLM python=3.9
激活环境 conda activate chatGLM
配置环境 conda install pytorch torchvision torchaudio -c pytorch-nightly
然后用conda install 命令依次安装requirements.txt文件中的依赖。除了torch。有些是pip conda install protobuf conda install transformers pip install cpm_kernels pip install gradio pip install mdtex2html conda install sentencepiece pip install accelerate
接下来去[https://anaconda.org/conda-forge/transformers/files，下载noarch/transformers-4.26.0-pyhd8ed1ab_0.conda](https://anaconda.org/conda-forge/transformers/files%EF%BC%8C%E4%B8%8B%E8%BD%BD%5Bnoarch/transformers-4.26.0-pyhd8ed1ab_0.conda%5D(https://anaconda.org/conda-forge/transformers/4.26.0/download/noarch/transformers-4.26.0-pyhd8ed1ab_0.conda)) 下载完后conda install 你下载transformers-4.26.0的文件路径
最后mps加速就正常了。
没有修改git代码，基本可以确定是环境配置的问题。

尝试啦一下，int4 cpu 本地启动成功并能成功调用，gpu 启动方式（to("mps")）能够启动成功，但是调用服务会如出现如下错误 func = kernels.int4WeightExtractionHalf AttributeError: 'NoneType' object has no attribute 'int4WeightExtractionHalf'

我用AMD 5500M的Mac也出现了这个错误，试着按照楼上的方法重新安装了一遍依赖依然无效。将half().to('mps')改成float()使用CPU是可以正常回答的。

量化后的模型是没法在 MPS 后端上跑的，因为用到的 kernel 是用 CUDA 写的，只能用 CPU 跑。
半精度的是可以在 MPS 后端上跑的。苹果官方说的是 AMD 的显卡可以用 MPS 后端，但是我没试过

newpepsi · 2023-04-20T02:45:20Z

按照提示修改了
model = AutoModel.from_pretrained(model_name, trust_remote_code=True).float().to('mps')
启动程序输入第一个问题的时候出现这个问题
用户：你是谁？
Traceback (most recent call last):
File "/Users/wanghao/PycharmProjects/ChatGLM-6B/mac_cli_demo.py", line 57, in
main()
File "/Users/wanghao/PycharmProjects/ChatGLM-6B/mac_cli_demo.py", line 42, in main
for response, history in model.stream_chat(tokenizer, query, history=history):
File "/Users/wanghao/PycharmProjects/ChatGLM-6B/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 35, in generator_context
response = gen.send(None)
File "/Users/wanghao/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/modeling_chatglm.py", line 1312, in stream_chat
for outputs in self.stream_generate(**inputs, **gen_kwargs):
File "/Users/wanghao/PycharmProjects/ChatGLM-6B/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 35, in generator_context
response = gen.send(None)
File "/Users/wanghao/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/modeling_chatglm.py", line 1389, in stream_generate
outputs = self(
File "/Users/wanghao/PycharmProjects/ChatGLM-6B/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/Users/wanghao/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/modeling_chatglm.py", line 1191, in forward
transformer_outputs = self.transformer(
File "/Users/wanghao/PycharmProjects/ChatGLM-6B/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/Users/wanghao/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/modeling_chatglm.py", line 997, in forward
layer_ret = layer(
File "/Users/wanghao/PycharmProjects/ChatGLM-6B/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/Users/wanghao/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/modeling_chatglm.py", line 627, in forward
attention_outputs = self.attention(
File "/Users/wanghao/PycharmProjects/ChatGLM-6B/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/Users/wanghao/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/modeling_chatglm.py", line 445, in forward
mixed_raw_layer = self.query_key_value(hidden_states)
File "/Users/wanghao/PycharmProjects/ChatGLM-6B/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/Users/wanghao/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/quantization.py", line 375, in forward
output = W8A16Linear.apply(input, self.weight, self.weight_scale, self.weight_bit_width)
File "/Users/wanghao/PycharmProjects/ChatGLM-6B/venv/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/Users/wanghao/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/quantization.py", line 53, in forward
weight = extract_weight_to_half(quant_w, scale_w, weight_bit_width)
File "/Users/wanghao/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/quantization.py", line 262, in extract_weight_to_half
func = kernels.int4WeightExtractionHalf
AttributeError: 'NoneType' object has no attribute 'int4WeightExtractionHalf'

duzx16 · 2023-04-20T02:48:39Z

按照提示修改了 model = AutoModel.from_pretrained(model_name, trust_remote_code=True).float().to('mps') 启动程序输入第一个问题的时候出现这个问题用户：你是谁？ Traceback (most recent call last): File "/Users/wanghao/PycharmProjects/ChatGLM-6B/mac_cli_demo.py", line 57, in main() File "/Users/wanghao/PycharmProjects/ChatGLM-6B/mac_cli_demo.py", line 42, in main for response, history in model.stream_chat(tokenizer, query, history=history): File "/Users/wanghao/PycharmProjects/ChatGLM-6B/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 35, in generator_context response = gen.send(None) File "/Users/wanghao/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/modeling_chatglm.py", line 1312, in stream_chat for outputs in self.stream_generate(**inputs, **gen_kwargs): File "/Users/wanghao/PycharmProjects/ChatGLM-6B/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 35, in generator_context response = gen.send(None) File "/Users/wanghao/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/modeling_chatglm.py", line 1389, in stream_generate outputs = self( File "/Users/wanghao/PycharmProjects/ChatGLM-6B/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/Users/wanghao/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/modeling_chatglm.py", line 1191, in forward transformer_outputs = self.transformer( File "/Users/wanghao/PycharmProjects/ChatGLM-6B/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/Users/wanghao/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/modeling_chatglm.py", line 997, in forward layer_ret = layer( File "/Users/wanghao/PycharmProjects/ChatGLM-6B/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/Users/wanghao/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/modeling_chatglm.py", line 627, in forward attention_outputs = self.attention( File "/Users/wanghao/PycharmProjects/ChatGLM-6B/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/Users/wanghao/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/modeling_chatglm.py", line 445, in forward mixed_raw_layer = self.query_key_value(hidden_states) File "/Users/wanghao/PycharmProjects/ChatGLM-6B/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/Users/wanghao/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/quantization.py", line 375, in forward output = W8A16Linear.apply(input, self.weight, self.weight_scale, self.weight_bit_width) File "/Users/wanghao/PycharmProjects/ChatGLM-6B/venv/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply return super().apply(*args, **kwargs) # type: ignore[misc] File "/Users/wanghao/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/quantization.py", line 53, in forward weight = extract_weight_to_half(quant_w, scale_w, weight_bit_width) File "/Users/wanghao/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/quantization.py", line 262, in extract_weight_to_half func = kernels.int4WeightExtractionHalf AttributeError: 'NoneType' object has no attribute 'int4WeightExtractionHalf'

MPS 后端没法用量化后的模型

iiilin · 2023-04-22T13:16:05Z

M1Pro 32G 13.3.1 遇到同样的问题。折腾后发现其实是我python的问题，使用 x86_64 就会出现卡住最后内存不够的情况，换用 miniconda 带的 arm 版本的 python 就没有问题
遇到类似情况也可以用 import platform; print(platform.uname()[4]) 检查下版本是 x86_64 还是 arm64
改完后 “你好” 的输出是 8.73s，“晚上睡不着” 的输出时间是 83.07s，不会出现内存不够的情况，应该就是符合预期的了

yhyu13 · 2023-04-23T08:47:46Z

@ALL textgen webui 踩过坑了！oobabooga/text-generation-webui#393

pytorch目前要用nightly build。估计py2.1.0会修复

unfish · 2023-04-23T10:59:29Z

我也遇到这个问题了，现已解决。 M1 Max，13.2/13.3.1，64GB。按照readme配置环境，表现为不使用to('mps')时能够正常使用，使用to('mps')时将无回复以及爆内存。
用Anaconda，重新描述一遍过程。先创建虚拟环境 conda create -n chatGLM python=3.9
激活环境 conda activate chatGLM
配置环境 conda install pytorch torchvision torchaudio -c pytorch-nightly
然后用conda install 命令依次安装requirements.txt文件中的依赖。除了torch。有些是pip conda install protobuf conda install transformers pip install cpm_kernels pip install gradio pip install mdtex2html conda install sentencepiece pip install accelerate
接下来去[https://anaconda.org/conda-forge/transformers/files，下载noarch/transformers-4.26.0-pyhd8ed1ab_0.conda](https://anaconda.org/conda-forge/transformers/files%EF%BC%8C%E4%B8%8B%E8%BD%BD%5Bnoarch/transformers-4.26.0-pyhd8ed1ab_0.conda%5D(https://anaconda.org/conda-forge/transformers/4.26.0/download/noarch/transformers-4.26.0-pyhd8ed1ab_0.conda)) 下载完后conda install 你下载transformers-4.26.0的文件路径
最后mps加速就正常了。
没有修改git代码，基本可以确定是环境配置的问题。

尝试啦一下，int4 cpu 本地启动成功并能成功调用，gpu 启动方式（to("mps")）能够启动成功，但是调用服务会如出现如下错误 func = kernels.int4WeightExtractionHalf AttributeError: 'NoneType' object has no attribute 'int4WeightExtractionHalf'

我用AMD 5500M的Mac也出现了这个错误，试着按照楼上的方法重新安装了一遍依赖依然无效。将half().to('mps')改成float()使用CPU是可以正常回答的。

量化后的模型是没法在 MPS 后端上跑的，因为用到的 kernel 是用 CUDA 写的，只能用 CPU 跑。半精度的是可以在 MPS 后端上跑的。苹果官方说的是 AMD 的显卡可以用 MPS 后端，但是我没试过

用M1 Ultra 64G内存来跑的话，非量化的版本是可以顺利跑起来的，而且输出速度可以接受，就是回答字数超过500字会出现爆内存，10几秒崩一个字。
改成用8bit量化版运行以后，内存占用倒是低了，但是完全用CPU跑，比上面卡内存还慢，从一开始回答就慢的要命。

zhaopengme · 2023-04-24T06:15:23Z

mac m1 16G cpu利用率20%

yfyang86 · 2023-05-03T04:33:13Z

我在这个fork里面加了下Mac（ARM64）的文档：

如何配置Openmp
量化后模型的CPU调用（OMP方式/单核方式）
量化后模型的MPS问题和根因

如果没问题，会提交个PR

K024 · 2023-05-15T06:19:48Z

导出了一份 ONNX int8 量化的模型，走 ONNXRuntime MatMulInteger 算子，在 M1 上测试速度完全可以，huggingface 链接

dalei2019 · 2023-05-22T05:58:03Z

fork

您好，是采用 MPS 后端方式运行的吗？

同M1，但回复都是表情，可以帮看看这个问题吗：

#1080

jeffwubj · 2023-06-18T14:31:09Z

在M1 Pro + pytorch 2.1.0.dev20230408上测试

MacBook Pro, 升级 pytorch 到 nightly 后, half 可以运行了, 👍

duzx16 closed this as completed Apr 9, 2023

duzx16 pinned this issue Apr 9, 2023

TreasureJade mentioned this issue Apr 26, 2023

[BUG/Help] M1 Pro 16G 提问一直卡住 #823

Open

1 task

sdhjl2000 mentioned this issue Apr 27, 2023

m1 32g内存使用mps进行推理非常慢 #835

Open

1 task

yfyang86 mentioned this issue May 3, 2023

[Document] 更新Mac部署说明 #899

Merged

Zylsjsp mentioned this issue May 8, 2024

[BUG/Help] <title>复现ptuning微调时出现RuntimeError: "bernoulli_scalar_cpu_" not implemented for 'Half' #1468

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mac M1Pro 可以正常运行，但一直卡着等不到回答！ #462

Mac M1Pro 可以正常运行，但一直卡着等不到回答！ #462

delete-x commented Apr 8, 2023

cylee0909 commented Apr 8, 2023

duzx16 commented Apr 9, 2023 •

edited

zz10247 commented Apr 10, 2023

duzx16 commented Apr 10, 2023

zz10247 commented Apr 10, 2023

dydwgmcnl4241 commented Apr 11, 2023

yg838457845 commented Apr 15, 2023

zhaopengme commented Apr 15, 2023

zhouxiunai commented Apr 18, 2023

liury889 commented Apr 18, 2023

duzx16 commented Apr 18, 2023

liury889 commented Apr 18, 2023

silentlit commented Apr 18, 2023

duzx16 commented Apr 18, 2023 •

edited

liury889 commented Apr 18, 2023

liury889 commented Apr 18, 2023

zwzheng45 commented Apr 19, 2023

duzx16 commented Apr 19, 2023

newpepsi commented Apr 20, 2023

duzx16 commented Apr 20, 2023

iiilin commented Apr 22, 2023

yhyu13 commented Apr 23, 2023 •

edited

unfish commented Apr 23, 2023

zhaopengme commented Apr 24, 2023

yfyang86 commented May 3, 2023

K024 commented May 15, 2023

dalei2019 commented May 22, 2023

jeffwubj commented Jun 18, 2023

Mac M1Pro 可以正常运行，但一直卡着等不到回答！ #462

Mac M1Pro 可以正常运行，但一直卡着等不到回答！ #462

Comments

delete-x commented Apr 8, 2023

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Environment

Anything else?

cylee0909 commented Apr 8, 2023

duzx16 commented Apr 9, 2023 • edited

zz10247 commented Apr 10, 2023

duzx16 commented Apr 10, 2023

zz10247 commented Apr 10, 2023

dydwgmcnl4241 commented Apr 11, 2023

yg838457845 commented Apr 15, 2023

zhaopengme commented Apr 15, 2023

zhouxiunai commented Apr 18, 2023

liury889 commented Apr 18, 2023

duzx16 commented Apr 18, 2023

liury889 commented Apr 18, 2023

silentlit commented Apr 18, 2023

duzx16 commented Apr 18, 2023 • edited

liury889 commented Apr 18, 2023

liury889 commented Apr 18, 2023

zwzheng45 commented Apr 19, 2023

duzx16 commented Apr 19, 2023

newpepsi commented Apr 20, 2023

duzx16 commented Apr 20, 2023

iiilin commented Apr 22, 2023

yhyu13 commented Apr 23, 2023 • edited

unfish commented Apr 23, 2023

zhaopengme commented Apr 24, 2023

yfyang86 commented May 3, 2023

K024 commented May 15, 2023

dalei2019 commented May 22, 2023

jeffwubj commented Jun 18, 2023

duzx16 commented Apr 9, 2023 •

edited

duzx16 commented Apr 18, 2023 •

edited

yhyu13 commented Apr 23, 2023 •

edited