Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

History entry timestamps aren't accurate #119

Closed
kergoth opened this issue Dec 3, 2018 · 3 comments
Closed

History entry timestamps aren't accurate #119

kergoth opened this issue Dec 3, 2018 · 3 comments
Labels
size: hard status: wip Work is in-progress / has already been partially completed type: bug report

Comments

@kergoth
Copy link

kergoth commented Dec 3, 2018

Firefox uses PRTime, Chrome uses webkit timestamps, neither of which match up as is with bookmark-archiver timestamp expectations. Firefox's timestamps need to be multiplied by 10, otherwise this year's history entries show up as 1974, and chrome's timestamps are in microseconds from 1601. To work around, use (last_visit_time-11644446702000000)*10 rather than last_visit_time for chrome, and last_visit_date*10 rather than last_visit_date for firefox. I'm also testing addition of safari history export, but the dates require further massaging than the other two, as they're Mac Absolute Time and in <seconds from 2001>.<microseconds> form, just multiplying to eliminate the decimal doesn't work as the microseconds lack leading zero padding.

For reference, see:

@pirate
Copy link
Member

pirate commented Dec 4, 2018

Thanks for pointing this out.

Timestamps seem to be fundamentally flawed as a unique identifier I think. The new design I'm working on makes them entirely optional and uses a sha256 of the URL instead, but it's going to be hard to change the folder layout of the archive to hashes if everyone's right now are timestamp-based.

Related to: #74

@pirate pirate added the status: wip Work is in-progress / has already been partially completed label Dec 7, 2018
@pirate
Copy link
Member

pirate commented Mar 30, 2019

@kergoth a quick update, v0.3.0 adds some improvement to the timestamp parsing, but it's still not perfect.

It doesn't yet handle Firefox's timestamps being off by 10x, and Chrome's timestamps aren't fixed from 1601 yet either, but it's a start:

https://github.com/pirate/ArchiveBox/blob/dev/archivebox/util.py#L369

@pirate
Copy link
Member

pirate commented Jul 24, 2020

I think the latest django branch gets us as close as we're going to get without implementing custom offset parsing for different sources.

git checkout django
git pull
docker build . -t archivebox
docker run -v $PWD/output:/data archivebox init
docker run -v $PWD/output:/data archivebox add 'https://example.com'
docker run -v $PWD/output:/data archivebox remove --delete 'https://example.com'
docker run -v $PWD/output:/data archivebox update

Comment back here if you're still having troubles with timestamps being wildly off and I can reopen the ticket.

@pirate pirate closed this as completed Jul 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size: hard status: wip Work is in-progress / has already been partially completed type: bug report
Projects
None yet
Development

No branches or pull requests

2 participants