Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graph Convolutional Layers - Builtin #1941

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

MaximilianSchreff
Copy link
Contributor

Adding a built in graph convolutional layer to our nn layer diretory. The graph convolutional layer follows the paper "Semi-Supervised Classification with Graph Convolutional Networks" from Kipf and Welling. This includes the forward and the backward pass of the layer and an example network on how to use the layers.

We tested the implementation against PyTorch and our implementation computes the same exact values as PyTorch does and handles missing in-bound edges of a node the same way in the spectral convolution. This is also implemented as a component test as a NNComponentTest. There we hard-coded the initialized weights from PyTorch and the result into the component test.

@MaximilianSchreff
Copy link
Contributor Author

About the performance: My machine showed performance issues when testing against PyTorch for very very big inputs.

Stress test: SystemDS: 340 seconds - PyTorch: 32 seconds

The stress test consisted of about 300 forward passes with about 10.000 x 10.000 matrices. This is likely a problem with my setup and not my implementation since the affine layer with the same inputs took 220 seconds. The GCL consists of a simple affine part and a convolutional part with the convolutional part being a lot more complex. So, the implementation is likely quite fast because the complex convolution part makes up less than a third of the runtime.

@MaximilianSchreff
Copy link
Contributor Author

About caching:

It would be a possibility to cache the normalized weights since they are quite complex to compute (counting of every degree and spectral convolution). In the stress testing, this only showed an improvement of 10 seconds from 340 seconds. Also, when testing the caching feature against smaller inputs than the huge inputs of the stress test, it was always slower.
With the added complexity and difficulty to handle cached weights and generally more outputs of the layer, I decided against caching.

@Baunsgaard
Copy link
Contributor

Baunsgaard commented Nov 6, 2023

About the performance: My machine showed performance issues when testing against PyTorch for very very big inputs.

Stress test: SystemDS: 340 seconds - PyTorch: 32 seconds

The stress test consisted of about 300 forward passes with about 10.000 x 10.000 matrices. This is likely a problem with my setup and not my implementation since the affine layer with the same inputs took 220 seconds. The GCL consists of a simple affine part and a convolutional part with the convolutional part being a lot more complex. So, the implementation is likely quite fast because the complex convolution part makes up less than a third of the runtime.

ups, this was graph Conv - i thought i was commenting on the ResNet

@Baunsgaard
Copy link
Contributor

About the performance: My machine showed performance issues when testing against PyTorch for very very big inputs.

Stress test: SystemDS: 340 seconds - PyTorch: 32 seconds

The stress test consisted of about 300 forward passes with about 10.000 x 10.000 matrices. This is likely a problem with my setup and not my implementation since the affine layer with the same inputs took 220 seconds. The GCL consists of a simple affine part and a convolutional part with the convolutional part being a lot more complex. So, the implementation is likely quite fast because the complex convolution part makes up less than a third of the runtime.

can you show the '-stats' output of calling it, to indicate where we are using time.
and maybe we want to profile it using a profiler: https://github.com/async-profiler/async-profiler

If you are in doubt how to use it, i can show you in office.

@MaximilianSchreff
Copy link
Contributor Author

I already profiled it. The convolutional layer consists of the linear layer and a convolutional layer. The linear layer takes up more than 70% of the runtime. This means there are probably some issues in my configs that restrict systemds from being faster since the linear part is the very base line of this layer.

@Baunsgaard
Copy link
Contributor

I already profiled it. The convolutional layer consists of the linear layer and a convolutional layer. The linear layer takes up more than 70% of the runtime. This means there are probably some issues in my configs that restrict systemds from being faster since the linear part is the very base line of this layer.

with Linear layer, do you mean a simple fully connected affine layer (aka a matrix multiplication), or something else?

Can you maybe give me an example in code?

@MaximilianSchreff
Copy link
Contributor Author

Yes exactly. There is a matrix multiplication and adding a bias happening. This is what takes up 70% of the performance.
You can see that in scripts/nn/layers/graph_conv.dml in the forward pass function. That are only two lines that use most of the runtime. These two lines cannot be optimized by my layer.

As for the rest of the forward pass, I already took numerous steps to optimize it from, initially, 780 seconds to 340 seconds for the whole layer. This includes:

  • Merging the different functions into a big one to combine for-loops, improving parallelization gain
  • Optimizing the addition of self loops
  • Caching normalization weights (removed it again, since it wasn't faster)

Copy link
Contributor

@Baunsgaard Baunsgaard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I quickly looked through, and am unsure what operation you are referring to.
I assume the starting matrix multiplication is expensive, but that only depends on the input size, and it will always be expensive. Some of the other code contain many as.integer or as.scalar. most of them should not be needed but i do not think they have much impact on performance.

{
edge_weight[j, 1] = 0.0
}
X_out[as.integer(as.scalar(edge_index[j, 2]))] += as.scalar(edge_weight[j, 1]) * X_hat[as.integer(as.scalar(edge_index[j, 1]))]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all these as.integer(as.scalar(...)) should not be needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I got some issues when not using them. I will take a look at them.

m = nrow(edge_index)

# transform
X_hat = X %*% W
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the slow operation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is the slow thing. But adding the bias in the end takes just as much time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This matrix multiplication takes around 35% of the runtime while adding the bias in the end also takes around 35%.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in practice adding the bias, should be very fast compared to the matrix multiplication.
When looking at the code, it seems to me that you access indices, rather than adding vectors.
This might be the issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All things related to the index is for the convolution part and not the linear part. But, it is actually possible to do the convolution part without any indices at all since the formula for a graph convolutional layer is OUT = D^-1 * A * D^-1 * X * W + b (A: Adjacency matrix n x n, D: degree matrix n x n, X: input n x features, W: weights f_in x f_out). As you can see, to do the convolution without any indices (normalization and message passing), you need to do 3 extra matrix multiplications instead. These matrix multiplications are (in normal use cases -> n >> features) even bigger than the linear part (XW + b) because D and A are bigger matrices than X and W.
So, since X
W + b takes 220 seconds (only the matrix multiplaction takes around 110 seconds), not using indices to do the normalization and convolution would take way longer than 340 seconds, likely around 600 seconds.

This is also the reason why famous other libraries (PyTorch, TF) also mainly use an edge list to do the convolution part through index accessing or use a sparse matrix datatype (which is basically also an edge list) in the GCL implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My -Xms -Xmx arguments were 16g, I think.

Also, I called the tests multiple times. They were always very consistent, only changing by -3 to +3 seconds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what i mean is not from the outside, but inside your script.
or maybe that is what you do already?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, the stress test file itself would probably help. The stress test is a big forward pass over 3 layers, repeated 100 times.
So, from the outside and the inside.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah okay, when will you be in office next time, then maybe we can talk about it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! I will be in the office tomorrow and on friday.

dX = dOut_agg_rev %*% t(W)

# calculate gradient w.r.t. W (Formula: X^T * A_hat^T * dOut)
dW = t(X) %*% dOut_agg_rev
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or is it the gradient taking time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the stress test, I only used the forward pass.

@j143 j143 added this to the systemds-3.2.0 milestone Dec 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
4 participants