[Feature Proposal] Download all content #600

RayBB · 2020-04-15T22:04:39Z

[Feature Request] Download all content

I am making a new issue to have complete ticket to point to that will easily be found when people are wondering why they can't download: quizzes, notes, assignment, assessments, hangouts sheets, knowledge checks, questions, or html content in general.

It seems that everyone is all these comments are on the same page. Downloading content besides videos/pdf is very valuable. However, it's tricky to implement. It seems to be part of that trickieness is due to the variety of course structures (see #102). However, that may have improved since 2014.

These OPEN tickets are related:
#102 -
#253 - ask to download pages with embedded html
#283 - has a lot of discussion of download hierarchy
#337 - using edx-platform-api (seems unlikely)
#429 - includes a patch that used to work
#447 -
#524 -
#550 -
#561 -
#596 -

I'd recommend closing some of those tickets since they'll all see the link to this new ticket.

Temporary Fix

As a stopgap measure I've written a very small js script folks can run in the browser to save content they want. https://github.com/RayBB/edx-scrape

Implementation

To get this setup in edx-dl it looks like we'll need to determine a few things.

What to download
Directory structure
Usage

What to Download

For a very naive solution, we could start by making something similar to the script I added above. That would mean going to to the "progress" page, grabbing all of the links there, and then downloading all of those pages.

That will give something that will at least save the text content.

The next thing to think about is images or other content in the html hosted on edX. The main thing I am aware of is images but there may be other requirements. Please let me know if you know of any.

If following the simple solution of downloading the html as above we may also want to scrape some of the js files that load on to the page so that people can still see pages offline.

ideal solution

The ideal solution would be to parse the html of each specific page and grab the text, inputbox values, and deal with many other possible formats. However, it seems like that would be a lot more work.

Directory structure

I don't have strong feels here. I think a simple "pages" folder would suffice but if others have ideas on how to make it better that would be great.

Usage

It would be nice if it were the default to download all pages but we'll probably also want to add a flag to disable it.

Final Thoughts

Thank you so much to the developers of this project who have already made a really fantastic tool! I know implementing this isn't easy and I'd be willing to help out where I can but the first step is deciding how/what needs to be done. I'm trying my best to contribute by putting this together and rounding up the above tickets that can be closed out.

Please let me know how you all would like to proceed.

JohnVeness · 2020-05-20T13:49:25Z

You might want to take notice of https://github.com/EugeneLoy/edx-archive

drdata2018 mentioned this issue Sep 1, 2020

# [Feature Request] Download all content #643

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Proposal] Download all content #600

[Feature Proposal] Download all content #600

RayBB commented Apr 15, 2020

JohnVeness commented May 20, 2020

[Feature Proposal] Download all content #600

[Feature Proposal] Download all content #600

Comments

RayBB commented Apr 15, 2020

[Feature Request] Download all content

Temporary Fix

Implementation

What to Download

ideal solution

Directory structure

Usage

Final Thoughts

JohnVeness commented May 20, 2020