Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a parser for a list of links (see issue #39) #59

Closed
wants to merge 1 commit into from

Conversation

frinkelpi
Copy link

See #58 and the comments therein. Thanks, and don't hesitate if you have any question on the changes.

- `basic_link_info` method in parse.py to avoid code duplication.
- Fetch method to get missing titles.
- Fix some HTML syntax errors in the templates.
- Simpler version of `html_appended_url` in utils.py using urlsplit.
- Update the README accordingly. Remove some repetition and minor style improvements.
- Add instructions to install requests (which was already used in utils.py). Add requirements.txt.

- Fix a bug where a KeyError would be thrown when generating the templates if screenshots or PDFs are disabled.

- If FETCH_WGET_REQUISITES is disabled, then wget stores the output in a different way, which was not handled and resulted in bugs. This has been corrected.
@issmirnov
Copy link
Contributor

This is an awesome feature. Does this project still hold your interest and would you find time to fix the merge conflicts?

@pirate
Copy link
Member

pirate commented Jul 6, 2018

@issmirnov unfortunately this issue is not as easy as it looks, it's dangerous to merge before #74 is fully resolved.

The problem right now is that links are uniquified based on their timestamp, and if links are imported with no timestamps it makes it harder to sort/access/dedupe them without running into situations where two links get archived into the same timestamp folder.

I'm moving away from timestamp-based id's, but that refactor isn't finished yet. Once the new url-hash based unique id system is merged, this PR will be trivial to add.

Apologies for the delay.

@issmirnov
Copy link
Contributor

Ah, interesting. Yes, this makes sense. I just created a small bookmarks.html with fake data, ran into this issue.

What ID scheme did you decide to use? Domains + page path hash, or something else?

@pirate
Copy link
Member

pirate commented Jul 6, 2018

Full url actually, including scheme, it's what the Internet Archive uses and I'd rather archive two similar urls separately than merge two different pages into one by accident.

I'll probably drop trailing slashes, but http and https would be different links. Most sites redirect to https now anyways, so I doubt it's going to be an issue.

@pirate
Copy link
Member

pirate commented Jul 6, 2018

I recommend using @nodiscc's script to convert a plain-text list into a valid netscape html format for now.

@issmirnov
Copy link
Contributor

issmirnov commented Jul 6, 2018

Awesome, that also makes my personal use case way simpler. Planning to archive all links on my wiki, so having a deterministic prefix is ideal.

If there are any subtasks you can farm out, I'd be happy to take a stab at them - always happy to help good projects! :)

@pirate
Copy link
Member

pirate commented Jul 6, 2018

That would be awesome, thanks @issmirnov! This is a good first ticket to get familiar with the codebase: #38

It involves adding a couple new options in config.py, and adding some logic to skip certain urls when creating the index and archive.

This is another good medium-sized task: #49 it involves adding a new archive method with some corresponding config options.

If you're more ambitious, this task is bigger: #48 but it's already 90% done, the code is just commented out and a few last issues need to be tested and fixed.

@pirate pirate added status: needs followup Work is stalled awaiting a follow-up from the original issue poster or ArchiveBox maintainers size: hard labels Sep 24, 2018
@pirate
Copy link
Member

pirate commented Dec 7, 2018

I'm going to close this because I'm removing timestamps as the unique key and using url hashes instead in the new layout as mentioned here:
#120

Once unique ordering is established separately from timestamps, link parsing from a plain text list of links will be trivial and I'll make sure I push it with the new layout changes.

@pirate pirate closed this Dec 7, 2018
@sbrl
Copy link
Contributor

sbrl commented Dec 7, 2018

Awesome, thanks! I guess it'll be in the next release?

@nodiscc
Copy link
Contributor

nodiscc commented Dec 7, 2018

Using URL hashes is very nice and i drew the same conclusion when working on my own archiving tool (now unmaintained). I went for md5. Good luck with the next release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size: hard status: needs followup Work is stalled awaiting a follow-up from the original issue poster or ArchiveBox maintainers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants