-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a parser for a list of links (see issue #39) #59
Conversation
- `basic_link_info` method in parse.py to avoid code duplication. - Fetch method to get missing titles. - Fix some HTML syntax errors in the templates. - Simpler version of `html_appended_url` in utils.py using urlsplit. - Update the README accordingly. Remove some repetition and minor style improvements. - Add instructions to install requests (which was already used in utils.py). Add requirements.txt. - Fix a bug where a KeyError would be thrown when generating the templates if screenshots or PDFs are disabled. - If FETCH_WGET_REQUISITES is disabled, then wget stores the output in a different way, which was not handled and resulted in bugs. This has been corrected.
This is an awesome feature. Does this project still hold your interest and would you find time to fix the merge conflicts? |
@issmirnov unfortunately this issue is not as easy as it looks, it's dangerous to merge before #74 is fully resolved. The problem right now is that links are uniquified based on their timestamp, and if links are imported with no timestamps it makes it harder to sort/access/dedupe them without running into situations where two links get archived into the same timestamp folder. I'm moving away from timestamp-based id's, but that refactor isn't finished yet. Once the new url-hash based unique id system is merged, this PR will be trivial to add. Apologies for the delay. |
Ah, interesting. Yes, this makes sense. I just created a small What ID scheme did you decide to use? Domains + page path hash, or something else? |
Full url actually, including scheme, it's what the Internet Archive uses and I'd rather archive two similar urls separately than merge two different pages into one by accident. I'll probably drop trailing slashes, but http and https would be different links. Most sites redirect to https now anyways, so I doubt it's going to be an issue. |
I recommend using @nodiscc's script to convert a plain-text list into a valid netscape html format for now. |
Awesome, that also makes my personal use case way simpler. Planning to archive all links on my wiki, so having a deterministic prefix is ideal. If there are any subtasks you can farm out, I'd be happy to take a stab at them - always happy to help good projects! :) |
That would be awesome, thanks @issmirnov! This is a good first ticket to get familiar with the codebase: #38 It involves adding a couple new options in This is another good medium-sized task: #49 it involves adding a new archive method with some corresponding config options. If you're more ambitious, this task is bigger: #48 but it's already 90% done, the code is just commented out and a few last issues need to be tested and fixed. |
I'm going to close this because I'm removing timestamps as the unique key and using url hashes instead in the new layout as mentioned here: Once unique ordering is established separately from timestamps, link parsing from a plain text list of links will be trivial and I'll make sure I push it with the new layout changes. |
Awesome, thanks! I guess it'll be in the next release? |
Using URL hashes is very nice and i drew the same conclusion when working on my own archiving tool (now unmaintained). I went for md5. Good luck with the next release |
See #58 and the comments therein. Thanks, and don't hesitate if you have any question on the changes.