Add a parser for a list of links (see issue #39) #59

frinkelpi · 2017-12-28T02:26:49Z

See #58 and the comments therein. Thanks, and don't hesitate if you have any question on the changes.

- `basic_link_info` method in parse.py to avoid code duplication. - Fetch method to get missing titles. - Fix some HTML syntax errors in the templates. - Simpler version of `html_appended_url` in utils.py using urlsplit. - Update the README accordingly. Remove some repetition and minor style improvements. - Add instructions to install requests (which was already used in utils.py). Add requirements.txt. - Fix a bug where a KeyError would be thrown when generating the templates if screenshots or PDFs are disabled. - If FETCH_WGET_REQUISITES is disabled, then wget stores the output in a different way, which was not handled and resulted in bugs. This has been corrected.

issmirnov · 2018-07-06T19:22:26Z

This is an awesome feature. Does this project still hold your interest and would you find time to fix the merge conflicts?

pirate · 2018-07-06T19:26:51Z

@issmirnov unfortunately this issue is not as easy as it looks, it's dangerous to merge before #74 is fully resolved.

The problem right now is that links are uniquified based on their timestamp, and if links are imported with no timestamps it makes it harder to sort/access/dedupe them without running into situations where two links get archived into the same timestamp folder.

I'm moving away from timestamp-based id's, but that refactor isn't finished yet. Once the new url-hash based unique id system is merged, this PR will be trivial to add.

Apologies for the delay.

issmirnov · 2018-07-06T19:41:26Z

Ah, interesting. Yes, this makes sense. I just created a small bookmarks.html with fake data, ran into this issue.

What ID scheme did you decide to use? Domains + page path hash, or something else?

pirate · 2018-07-06T19:44:33Z

Full url actually, including scheme, it's what the Internet Archive uses and I'd rather archive two similar urls separately than merge two different pages into one by accident.

I'll probably drop trailing slashes, but http and https would be different links. Most sites redirect to https now anyways, so I doubt it's going to be an issue.

pirate · 2018-07-06T19:46:41Z

I recommend using @nodiscc's script to convert a plain-text list into a valid netscape html format for now.

issmirnov · 2018-07-06T20:29:16Z

Awesome, that also makes my personal use case way simpler. Planning to archive all links on my wiki, so having a deterministic prefix is ideal.

If there are any subtasks you can farm out, I'd be happy to take a stab at them - always happy to help good projects! :)

pirate · 2018-07-06T20:44:23Z

That would be awesome, thanks @issmirnov! This is a good first ticket to get familiar with the codebase: #38

It involves adding a couple new options in config.py, and adding some logic to skip certain urls when creating the index and archive.

This is another good medium-sized task: #49 it involves adding a new archive method with some corresponding config options.

If you're more ambitious, this task is bigger: #48 but it's already 90% done, the code is just commented out and a few last issues need to be tested and fixed.

pirate · 2018-12-07T04:06:13Z

I'm going to close this because I'm removing timestamps as the unique key and using url hashes instead in the new layout as mentioned here:
#120

Once unique ordering is established separately from timestamps, link parsing from a plain text list of links will be trivial and I'll make sure I push it with the new layout changes.

sbrl · 2018-12-07T17:05:45Z

Awesome, thanks! I guess it'll be in the next release?

nodiscc · 2018-12-07T18:54:31Z

Using URL hashes is very nice and i drew the same conclusion when working on my own archiving tool (now unmaintained). I went for md5. Good luck with the next release

This was referenced Aug 7, 2018

cherrypick edits to README in 4bf830 by @frinkelpi #86

Closed

Refactor linkinfo #89

Closed

pirate added status: needs followup Work is stalled awaiting a follow-up from the original issue poster or ArchiveBox maintainers size: hard labels Sep 24, 2018

pirate closed this Dec 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a parser for a list of links (see issue #39) #59

Add a parser for a list of links (see issue #39) #59

frinkelpi commented Dec 28, 2017

issmirnov commented Jul 6, 2018

pirate commented Jul 6, 2018

issmirnov commented Jul 6, 2018

pirate commented Jul 6, 2018

pirate commented Jul 6, 2018

issmirnov commented Jul 6, 2018 •

edited

pirate commented Jul 6, 2018 •

edited

pirate commented Dec 7, 2018

sbrl commented Dec 7, 2018

nodiscc commented Dec 7, 2018 •

edited

Add a parser for a list of links (see issue #39) #59

Add a parser for a list of links (see issue #39) #59

Conversation

frinkelpi commented Dec 28, 2017

issmirnov commented Jul 6, 2018

pirate commented Jul 6, 2018

issmirnov commented Jul 6, 2018

pirate commented Jul 6, 2018

pirate commented Jul 6, 2018

issmirnov commented Jul 6, 2018 • edited

pirate commented Jul 6, 2018 • edited

pirate commented Dec 7, 2018

sbrl commented Dec 7, 2018

nodiscc commented Dec 7, 2018 • edited

issmirnov commented Jul 6, 2018 •

edited

pirate commented Jul 6, 2018 •

edited

nodiscc commented Dec 7, 2018 •

edited