Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Proposal] Download all content #600

Open
RayBB opened this issue Apr 15, 2020 · 1 comment
Open

[Feature Proposal] Download all content #600

RayBB opened this issue Apr 15, 2020 · 1 comment

Comments

@RayBB
Copy link

RayBB commented Apr 15, 2020

[Feature Request] Download all content

I am making a new issue to have complete ticket to point to that will easily be found when people are wondering why they can't download: quizzes, notes, assignment, assessments, hangouts sheets, knowledge checks, questions, or html content in general.

It seems that everyone is all these comments are on the same page. Downloading content besides videos/pdf is very valuable. However, it's tricky to implement. It seems to be part of that trickieness is due to the variety of course structures (see #102). However, that may have improved since 2014.

These OPEN tickets are related:
#102 -
#253 - ask to download pages with embedded html
#283 - has a lot of discussion of download hierarchy
#337 - using edx-platform-api (seems unlikely)
#429 - includes a patch that used to work
#447 -
#524 -
#550 -
#561 -
#596 -

I'd recommend closing some of those tickets since they'll all see the link to this new ticket.

Temporary Fix

As a stopgap measure I've written a very small js script folks can run in the browser to save content they want. https://github.com/RayBB/edx-scrape

Implementation

To get this setup in edx-dl it looks like we'll need to determine a few things.

  1. What to download
  2. Directory structure
  3. Usage

What to Download

For a very naive solution, we could start by making something similar to the script I added above. That would mean going to to the "progress" page, grabbing all of the links there, and then downloading all of those pages.

That will give something that will at least save the text content.

The next thing to think about is images or other content in the html hosted on edX. The main thing I am aware of is images but there may be other requirements. Please let me know if you know of any.

If following the simple solution of downloading the html as above we may also want to scrape some of the js files that load on to the page so that people can still see pages offline.

ideal solution

The ideal solution would be to parse the html of each specific page and grab the text, inputbox values, and deal with many other possible formats. However, it seems like that would be a lot more work.

Directory structure

I don't have strong feels here. I think a simple "pages" folder would suffice but if others have ideas on how to make it better that would be great.

Usage

It would be nice if it were the default to download all pages but we'll probably also want to add a flag to disable it.

Final Thoughts

Thank you so much to the developers of this project who have already made a really fantastic tool! I know implementing this isn't easy and I'd be willing to help out where I can but the first step is deciding how/what needs to be done. I'm trying my best to contribute by putting this together and rounding up the above tickets that can be closed out.

Please let me know how you all would like to proceed.

@JohnVeness
Copy link

You might want to take notice of https://github.com/EugeneLoy/edx-archive

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants