Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive Memory consumption. #759

Open
1 of 5 tasks
iSuslov opened this issue May 12, 2024 · 4 comments
Open
1 of 5 tasks

Excessive Memory consumption. #759

iSuslov opened this issue May 12, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@iSuslov
Copy link

iSuslov commented May 12, 2024

System Info

  • M1 Pro 16Gb
  • I suppose PHI-3 demo uses version 2.17.1

Environment/Platform

  • Website/web-app
  • Browser extension
  • Server-side (e.g., Node.js, Deno, Bun)
  • Desktop app (e.g., Electron)
  • Other (e.g., VSCode extension)

Description

For the latest PHI-3 demo Chrome browser uses:
5.31 Gb for Renderer process and 4.16 Gb for GPU process, totaling almost 10 Gb while running ~2Gb model.

After first inference memory consumption jumps above 12Gb. That can't be normal.

Reproduction

  1. Open demo page https://huggingface.co/spaces/Xenova/experimental-phi3-webgpu
  2. Click on load model
  3. Check on memory consumption.
@iSuslov iSuslov added the bug Something isn't working label May 12, 2024
@xenova
Copy link
Owner

xenova commented May 12, 2024

5.31 Gb for Renderer process

Can you confirm you do not have any other tabs open? I can't seem to understand how this can be related to the application (not much is being rendered).

4.16 Gb for GPU process

This makes sense since it's (most likely) running in fp16 mode.

@iSuslov
Copy link
Author

iSuslov commented May 12, 2024

@xenova can confirm:

  • no opened tabs
  • no browser extensions

One empty tab opened:
Screenshot 2024-05-12 at 9 52 24 AM

Navigated to https://huggingface.co/spaces/Xenova/experimental-phi3-webgpu and loaded the model:
Screenshot 2024-05-12 at 9 53 13 AM


This makes sense since it's (most likely) running in fp16 mode.

Can we make it run in lower precision if we run q_4 quantization? (We can't, ONNX doesn't support 4q yet)

To avoid any confusion this is the downloaded model:
/Xenova/Phi-3-mini-4k-instruct_fp16/resolve/main/onnx/model_q4.onnx 838 mb
/Xenova/Phi-3-mini-4k-instruct_fp16/resolve/main/onnx/model_q4.onnx_data 1454mb

@flatsiedatsie
Copy link

WebGPU currently only supports 16 and 32bit mode.

@everythinginjs
Copy link

same here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants