Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Magic number in example #791

Open
wilderfield opened this issue Apr 8, 2024 · 7 comments
Open

Magic number in example #791

wilderfield opened this issue Apr 8, 2024 · 7 comments

Comments

@wilderfield
Copy link
Contributor

ctx_size += 1024; // some overhead

Can this magic number 1024 be explained, or perhaps improved to some calculation?

Does it depend on the size of the output?

(I notice that if I increase the size of the input tensors this example stops working).

@wilderfield
Copy link
Contributor Author

@FSSRepo First off, thanks for contributing this example. Just want to include you on this issue to discuss this. Do you recall why you picked 1024 for this overhead? Can we calculate this instead?

@FSSRepo
Copy link
Collaborator

FSSRepo commented Apr 9, 2024

That number is a small extra space for the data since some operations require padding; this is necessary when performing calculations with the context (without using ggml-alloc, which internally adds that small overhead).

As for calculating it, it's just a matter of trying. Try removing it and see what happens.

@wilderfield
Copy link
Contributor Author

I was gdb'ing last night, and I saw that when building the graph, memory is allocated from the context's memory pool for the output tensor. It happened somewhere under ggml_mul_mat(). This logic doesn't account for that correct?

If the input is 4096x2 , 2x4096 ... and output is 4096*4096 ... the ctx_size would not have enough space if we don't account for the output tensor size. (This example highlights how the output size can be far greater than the sum of the two inputs).

Also, do we even need to reserve space for the two inputs? They are allocated in the example?

@FSSRepo
Copy link
Collaborator

FSSRepo commented Apr 9, 2024

You're right, that 1024 should be the size of the output tensor data. Honestly, I'm not sure how to calculate it correctly before creating the context. @slaren Any idea on how to calculate the compute buffer size before creating the compute graph with the legacy API?

The maximum memory buffer in gpt-2 example is 256 MB:

static size_t buf_size = 256u*1024*1024;
static void * buf = malloc(buf_size);
if (mem_per_token > 0 && mem_per_token*N > buf_size) {
const size_t buf_size_new = 1.1*(mem_per_token*N); // add 10% to account for ggml object overhead
//printf("\n%s: reallocating buffer from %zu to %zu bytes\n", __func__, buf_size, buf_size_new);
// reallocate
buf_size = buf_size_new;
buf = realloc(buf, buf_size);
if (buf == nullptr) {
fprintf(stderr, "%s: failed to allocate %zu bytes\n", __func__, buf_size);
return false;
}
}
struct ggml_init_params params = {
/*.mem_size =*/ buf_size,
/*.mem_buffer =*/ buf,
/*.no_alloc =*/ false,
};

@slaren
Copy link
Collaborator

slaren commented Apr 9, 2024

You would have to pad the size of the tensor to the alignment value. My recommendation is to use ggml-alloc for compute buffers, and ggml_backend_alloc_ctx_tensors for static tensor buffers, and let it do it for you.

@wilderfield
Copy link
Contributor Author

Tangentially, I also wanted to profile the matrix multiplication.
I put a loop and timers around this line:

ggml_graph_compute_with_ctx(model.ctx, gf, n_threads);

1000 iterations.

Again, I see the context running out of memory. How could this example be modified to run iteratively?

@slaren
Copy link
Collaborator

slaren commented Apr 9, 2024

ggml_graph_compute_with_ctx uses the context buffer to allocate a work buffer. Calling it repeatedly will cause the work buffer to be allocated on every iteration, until it runs out of memory. This is not a good way to test the performance of an operation since it will include other overheads, such as starting the threads. test-backend-ops has an option to test the performance of individual ops.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants