Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error using pickle #20

Open
plison opened this issue Nov 15, 2017 · 10 comments
Open

Error using pickle #20

plison opened this issue Nov 15, 2017 · 10 comments

Comments

@plison
Copy link

plison commented Nov 15, 2017

I found another bug, related this time to pickling pytricia objects:

import pytricia
pyt = pytricia.PyTricia()
pyt["192.168.0.1"] = "foo"
pickle.dump(pyt, open("test.pkl", "wb"))
pyt2 = pickle.load(open("test.pkl", "rb"))
pyt2.keys()

gives a Segmentation fault (core dumped). I'm using Python 3.

@jsommers
Copy link
Owner

jsommers commented Dec 8, 2017

Thanks for posting this. There's basically no support now for pickling but it's a reasonable thing to add.

@Darkheir
Copy link

Hi, is it something that would be complicated to add ?
py-radix offers it but not in an efficient manner (by returning all the ips contained in the tree)

@jsommers
Copy link
Owner

Not really, but I have zero time to do it. Maybe over the summer but not any time soon. Sorry.

@Darkheir
Copy link

No problem, I understand.

@samagra14
Copy link

I am using a naive way to do this in my project:

def save_tree(tree, file_name):
    tree_saver = {}
    for prefix in tree:
        tree_saver[prefix] = tree[prefix]
    joblib.dump(tree_saver, file_name)
    
def load_tree(file_name):
    tree_saver = joblib.load(file_name)
    tree = pytricia.PyTricia()
    for prefix in tree_saver.keys():
        tree[prefix] = tree_saver[prefix]
    return tree

@sk2
Copy link

sk2 commented Jul 29, 2022

Hello, is this the only way to serialise an instantiated tree? I have a tree that takes about 30 seconds to build, so would be keen to be able to serialise it for future use.
Would this likely offer any benefits - ie is the one by one insertion more computationally intensive than loading a tree from a file?

Or is there a best practice to bulk import?
I saw the route views in the GitHub source but didn’t see any examples of bulk loading them.

Thanks

@jsommers
Copy link
Owner

Yes, for now it is. The routeviews file is used by testload.py but it just loads in an iterative fashion, not, as you put it, by "bulk" loading. I recognize that fast loading (of whatever kind) is a feature desired by several folks.

@sk2
Copy link

sk2 commented Jul 30, 2022

Thanks for the reply, and glad to see you're still active on the project.
Pytricia is still by far the cleanest implementation that I've seen, so I'd be keen to try and help out with adding the bulk/fast loading.

The dataset I am currently using is about 9 million entries, it takes about 30 seconds to build, but only a few seconds to run a set of 2 million IPs against it.
So this suggests that tree construction is the expensive operation, and not a memory issue even for a large tree.

Do you have any pointers as to where you'd start to implement such a solution?
I'm inserting the data from pandas, so I can vectorize and sort the data before inserting. Would there be an ordering (eg largest first) that would be less expensive to build?

Alternatively, being able to serialise it to a binary file and then load this for future use would also work. This wouldn't need to be pickle necessarily.
For the example I have (and what I think many would use for large datasets) is that the Trie node simply contains a single integer, which is a lookup into a data dictionary (or Pandas dataframe) which stores the data.
So the dumping to binary format would only need to be a _patricia_node_t and an integer value (avoiding the need to handle complex python pickling).

Pytricia is amazing for very fast lookups, and works very nicely with the apply function in Pandas across very large datasets.
Being able to quickly build the trie, and/or dump/save would be a very beneficial tool for network applications in the PyData space.

Another advantage of pickle or a method to serialise/deserialise is that the trie construction could be run across multiple cores, and membership of each sub trie checked individually.

Ideally the sub tries (eg one per /8) could be reparented, but this would be a little more involved.

I'd like to be able to help implement these - but some pointers would be very helpful.

thanks!

@jsommers
Copy link
Owner

I'd have to spend some time thinking about different ways to serialize and tradeoffs. Not sure I'd have any meaningful pointers at this time, but I appreciate you asking! I would likely be able to spend some time starting in later october on this.

@sk2
Copy link

sk2 commented Jul 31, 2022

Thanks, that timeframe sounds good.

I've been playing around with a pure-Python radix tree to look at possible performance enhancements.

There's a few potential avenues to improve bulk performance. The time consuming part is building the tree, and this is cpu bound by the Python GIL.
Being able to serialise the tree means it can be un-serialised - and loading a tree from a dumped data structure is very fast, even in pure Python.

This would allow the work to be divided amongst cores using multiprocessing, eg a tree for each /8 (or similar depending on the tree distribution), and then a simple lookup table to map this to the appropriate trie.
eg if octet1 == 192 -> use tree 192
eg if octet2 == 10 -> use tree 10

The next step from this would be to parent these sub-trees to the root node (and its immediate children).
The idea would be to essentially build the large sub-trees in parallel, and then do the relatively simple operation of repointing them to the small number of root nodes.
This would remove the hacky lookup table approach and replace it with nodes under the root node (conceptually a very similar thing, but all integrated into the trie itself).

To be able to this would mean the user needs to sort the data - but this is fast using something like Pandas (which itself allows for parallelising workloads using tools like Dask).
I don't think sorting the data is onerous on the user given the tooling in the PyData ecosystem.

For very large trees, I think it is reasonable that the user uses the trie as an index lookup. This simplifies the nodes to only needing to store an integer key, rather than needing to handle arbitrary Python data structures.
I think it is fair to place this a requirement on the user who wishes to use pickling/dumping - those that have smaller data sets can simply construct the trie on the fly as per today.
This restriction would allow for the dump/restore to be done with a textbook tree walking algorithm in C, rather than needing to accomodate all of the Python pickling protocol.

This would give very fast performance to build the tree, relatively easily, even for very large datasets.

If the user provides the data sorted, then a further performance optimisation would be to cache the parent node. In my pure Python experiment, I was just caching at the classic octet boundaries. I kept a lookup dictionary of the /8 /16 and /24, which were pointers to that specific node. This removed the many redundant tree traversals that would occur if the prefixes being inserted have a common close ancestor.

These options would significantly increase performance for very large datasets, and based on my pure Python experiments, shouldn't be too much new code to write (I'd do it if I was more familiar with c).

Given that building the tree is time consuming, and saving/loading/searching are fast, being able to store to a file would allow multiple tries (such as for multiple routers) to be loaded/unloaded as necessary (with only taking the cpu hit to build the tree initially).

These improvements would really open up the PyData ecosystem to powerful network analytics.

This has strayed way beyond the initial Pickle discussion - happy to take it to a new thread if that's preferable.

Thanks!
Simon

Edit: actually the bulk import/bulk cross referencing speedup could be simplified:

  1. keep the last inserted/get integer/bits
  2. keep an array of the path to this node (pointers of max length 31/127]
  3. on new insert/get , do a bitwise AND of the new address against the last inserted from step 1
  4. if inserting, then also and this with the net mask prefix length
  5. find the last 1 in the AND result, this is the index of the closest common ancestor
  6. use this index to retrieve the ancestor pointer from the array in 2
  7. get the ancestor from the pointer
  8. proceed as normal from the lookup process
  9. update the new path into the array of 2 (which would be the indices that are 0 in the AND result of 3)

This would save repeated lookup steps if there is clustering, which would happen if the source (to build) or the lookup datasets are large, and are sorted. Sorting is cheap (and can be parallelised), so this would exploit the inherent locality of very large datasets.
It would only require 2/3 AND operations, storing the previous ip (32/128 bits), and the array of 32/128 pointers.
v6 would be more sparse, so perhaps some heuristics could be implemented to reduce the storage space (eg only store a subset of the path at common branching points)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants