Refactor Snapshot and ArchiveResult to use ulid
and typeid
instead of uuidv4
#1430
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Migrate to Snapshot.ulid and ArchiveResult.ulid (instead of using Snapshot.timestamp as unique key).
Related issues
Fixes: #74
Changes these areas
This is the new folder layout I'm migrating to after after the switch from timestamps to ulids:
There is some nesting to avoid running into trouble with directories having 100k+ files and taking a long time to list.
For maximum practicality I went with
[objecttype] / [date] / [domain] / [ulid]
as the nesting order.This satisfies a bunch of the most common use cases:
For maximum fun, the ULIDs also embed information about the object type, timestamp, url, subtype, and some randomness (in case you happen to snapshot the same domain with the same extractor a few thousands times in the same millisecond).
This has the very cool property that all of the ArchiveResults under a certain snapshot share the same prefix, e.g.:
This means that all the data in the system that uses this ulid format will sort lexicographically together properly, and in the same order/grouping as the nested
archive/
folder structure provides.Even if all the data were thrown together in one big folder it would maintain all the nice ordering properties of
objtype > date > domain > subtype > uuid
.