Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory error with large network - centralities #52

Closed
alberto-bracci opened this issue Nov 11, 2019 · 15 comments
Closed

Memory error with large network - centralities #52

alberto-bracci opened this issue Nov 11, 2019 · 15 comments

Comments

@alberto-bracci
Copy link

Hi,

I am just starting with Teneto. Installed with pip on anaconda - windows.
I am trying to load the temporal network from here

I put a line "i,j,t" at the beginning of the file, then loaded it with pandas as dataframe and used:
teneto.TemporalNetwork(from_df=dataframe) but I receive a memory error:

emoryError Traceback (most recent call last)
in
----> 1 tnet2 = tnet.TemporalNetwork(from_edgelist=[list(d) for d in D.values])

C:\ProgramData\Anaconda3\lib\site-packages\teneto\classes\network.py in init(self, N, T, nettype, from_df, from_array, from_dict, from_edgelist, timetype, diagonal, timeunit, desc, starttime, nodelabels, timelabels, hdf5, hdf5path, forcesparse)
131 self.network_from_df(from_df)
132 if from_edgelist is not None:
--> 133 self.network_from_edgelist(from_edgelist)
134 elif from_array is not None:
135 self.network_from_array(from_array, forcesparse=forcesparse)

C:\ProgramData\Anaconda3\lib\site-packages\teneto\classes\network.py in network_from_edgelist(self, edgelist)
257 colnames = ['i', 'j', 't']
258 self.network = pd.DataFrame(edgelist, columns=colnames)
--> 259 self._update_network()
260
261 def network_from_dict(self, contact):

C:\ProgramData\Anaconda3\lib\site-packages\teneto\classes\network.py in _update_network(self)
220 """
221 self._calc_netshape()
--> 222 self._set_nettype()
223 if self.nettype:
224 if self.nettype[1] == 'u':

C:\ProgramData\Anaconda3\lib\site-packages\teneto\classes\network.py in _set_nettype(self)
172 self.nettype = 'xu'
173 G1 = teneto.utils.df_to_array(
--> 174 self.network, self.netshape, self.nettype)
175 self.nettype = 'xd'
176 G2 = teneto.utils.df_to_array(

C:\ProgramData\Anaconda3\lib\site-packages\teneto\utils\utils.py in df_to_array(df, netshape, nettype)
749 if len(df) > 0:
750 idx = np.array(list(map(list, df.values)))
--> 751 G = np.zeros([netshape[0], netshape[0], netshape[1]])
752 if idx.shape[1] == 3:
753 if nettype[-1] == 'u':

MemoryError:

Am I doing something wrong or maybe this representation cannot handle large networks?

@wiheto
Copy link
Owner

wiheto commented Nov 11, 2019 via email

@alberto-bracci
Copy link
Author

It seems that argument is not present, at least in my version. I tried 'forcesparse=True' and also 'hdf5=True' without success. What's more, the error is the same, which I wouldn't expect in the latter case as it should use a different format.
The network has around 900 nodes and 33720 time-stamped links

@wiheto
Copy link
Owner

wiheto commented Nov 11, 2019

Sorry I meant forcesparse=True (was sitting on a train and didn't doublecheck the argument). So the HDF5 compatibility was never complete/optimized as it was also slowing down processing on smaller networks. And it is on my todo list to fix all this in December when I have time to contribute to this instead of other projects. So bare with me here. There may be one or two errors, but we can probably get them sorted quite easily when they arise.

But this problem seems to be the function trying to figure out what type of network your input and it is trying to make the dataframe a numpy array to determine this (not optimal). So if you add the argument: nettype='bu' (or 'bd', 'wu', 'wd') depending on if you network is binary/weighted undirected/directed, this function shouldn't be called.

That is slightly bigger than most of the networks I usually use (ca 500 nodes and 1000 time points). But the HDF5 representation should work.

@alberto-bracci
Copy link
Author

this indeed worked! setting the type also makes 'forcespars' or HDF5 seemingly unnecessary.
Quick unrelated question (to avoid opening another issue): is it possible to have references for the centrality measures implemented in the library? Like the formula or article they are referring to.

Thanks for your quick help!
Alberto

@wiheto
Copy link
Owner

wiheto commented Nov 11, 2019

Which centrality measure in particular are you after?

I generally follow Masuda and Lambiotte's book "A guide to temporal networks" for the maths for many of the measures. Adding citations to all the docstrings is also on the todo list. Some of them already have quite detailed information in the docmentation (e.g. here, but I've not had time to write one for every measure yet).

So if there is any you want want me to find, I can find them for you and also add them to the docstrings and provide the references for you here too.

@alberto-bracci
Copy link
Author

I was mainly interested in the centrality for now. So closeness, betwenness and degree are the ones missing. I am asking because I found different definitions in different papers, and as of now I am not able to get a copy of the book to look for them by myself.
Really appreciate your help here!

@wiheto
Copy link
Owner

wiheto commented Nov 11, 2019

Alright. I have some writing time assigned later today. So I'll add them then. So within 24 hours I'll have the the documentation of all three of those. And, especially for closeness and betweenness I'll add to the documentation of shortest temporal paths as well (as that is the place I've seen the most diffferences in equations).

@wiheto
Copy link
Owner

wiheto commented Nov 12, 2019

You may want to update from the developer branch: https://github.com/wiheto/teneto/tree/develop as some argument names are changing in the upcoming 0.5.0, so the documentation isn't fully in line with the functions in 0.4.6

The more in depth documentation is here:

https://teneto.readthedocs.io/en/develop/networkmeasures/temporal_closeness_centrality.html#module-teneto.networkmeasures.temporal_closeness_centrality

https://teneto.readthedocs.io/en/develop/networkmeasures/temporal_degree_centrality.html#module-teneto.networkmeasures.temporal_degree_centrality

As with a lot of teneto's documentation, I write far too quickly to get doc coverage, and sometimes loose clarity. Just leave an issue whenever anything is unclear

2 changes still to make.

So the shortest temporal paths is HDF5 ready but the calculation of closeness centrality is not. It is an easy fix. but I want to test it tomorrow to make sure it works. But since you will need the shortest temporal paths for both bet centrality and closeness, you may as well precompute that first anyway and save it.

I didn't get round to betweenness centrality docs. I'll also try and do that tomorrow.

@wiheto
Copy link
Owner

wiheto commented Nov 13, 2019

https://teneto.readthedocs.io/en/develop/api/teneto.networkmeasures.temporal_betweenness_centrality.html#teneto.networkmeasures.temporal_betweenness_centrality

I've also updated the normalization to follow the reference before for 0.5.0. Previously it did not divide by sigma_jk. I need to write a test to make sure this is working as expected (today or tomorrow)

Otherwise, can I close this issue now? Seems like the problems are sorted.

@alberto-bracci
Copy link
Author

Yes, everything should be fine. Just a question: how quick you expect the shortest path function to be? I tried it with a network of around 90 nodes and 300 links and after 6 hours it wasn't finished yet (core i7 on laptop).

Also, it is better to first compute the shortest paths and then use them as argument for closeness and betweenness right?

@alberto-bracci
Copy link
Author

Also, there might be another issue with the shortest path function:
Whereas with a 'bd' network the behavior is as described above, the same network but loaded as 'bu' returns the following error:

File "", line 1, in
shortest_paths = tnt.networkmeasures.shortest_temporal_path(t)

File "C:\ProgramData\Anaconda3\lib\site-packages\teneto\networkmeasures\shortest_temporal_path.py", line 201, in shortest_temporal_path
network = tnet.get_network_when(ij=list(ij), t=t)

File "C:\ProgramData\Anaconda3\lib\site-packages\teneto\classes\network.py", line 483, in get_network_when
return teneto.utils.get_network_when(self, **kwargs)

File "C:\ProgramData\Anaconda3\lib\site-packages\teneto\utils\utils.py", line 993, in get_network_when
network['j'].isin(ij))), (network['t'].isin(t)))]

TypeError: and_ expected 2 arguments, got 1

@alberto-bracci alberto-bracci changed the title Memory error with large network (?) Memory error with large network - centralities Nov 13, 2019
@wiheto
Copy link
Owner

wiheto commented Nov 14, 2019

Yes, everything should be fine. Just a question: how quick you expect the shortest path function to be? I tried it with a network of around 90 nodes and 300 links and after 6 hours it wasn't finished yet (core i7 on laptop).

So when making the HDF5 compatible objects I compromised on speed. This is the major backbones issues regarding speed that has to be solved that is planned for the end of December (the start of #36 is relevant here).

Also, it is better to first compute the shortest paths and then use them as argument for closeness and betweenness right?

Yes, cause otherwise you have to calculate the paths twice, and that is the most computationally intense part.

Regarding the error. Interesting. I'm going to open up a new issue about that as that is about undirected HDF5 network referencing.

@wiheto
Copy link
Owner

wiheto commented Nov 14, 2019

Also, regarding speed of shortest_temporal_paths: to minimize the possible path space, you could change the value of steps_per_t.

The default parameter of steps_per_t in shortest_temporal_path is 'all'. This means that, at each time-point, a path can travel multiple nodes.This is not a reasonable assumption in many temporal networks. If you set this parameter to an integer (e.g. to 1 meaning that only one edge can be traveling per time-point per path), it will speed up the calculation.

@wiheto
Copy link
Owner

wiheto commented Nov 14, 2019

And another possible way to speed it up at the moment is to set i argument and run it in parallel (so for 90 nodes you can run 90 jobs at once. But will require access to a cluster).

@wiheto
Copy link
Owner

wiheto commented Nov 15, 2019

Aside from the computational time, I think all the issues here have been solved. So closing this issues.

@wiheto wiheto closed this as completed Nov 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants