Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project Evolution #192

Open
acrois opened this issue Jul 18, 2021 · 4 comments
Open

Project Evolution #192

acrois opened this issue Jul 18, 2021 · 4 comments

Comments

@acrois
Copy link

acrois commented Jul 18, 2021

Hello, I would like to propose a plan for the future of this project.

Purpose:

It will give the community of developers and users a common structure to collaborate as well as improve the quality and speed of archiving the web.

Scary issues out of the way first: We don't want to make any license changes, avoid major (breaking) refactors (requiring end-user changes), and maintain support for existing feature functionality. We should focus new development efforts on performance, stability, scalability, features, and pain points (a little of everything). We will want to encourage the adoption of the library and improve the user experience of the application. We will want to establish and grow the project's visibility by providing a tool to aspiring archivers (known by some as web preservationists) to use as well as give them a clear way to contribute to the massive work that needs to be done in order to preserve the web as we know it.

As a result, I will present to you an outline of the things I would like for us to agree and embark on with this project. I have roughly ordered them for (my perceived) importance to the project (vs complexity vs dependency). This is done considering what falls within the aforementioned categories of visibility, stability, performance, and features. This list and its ordering are subject to change as community prioritization process happens and we take this project to the next level. I would love to have a discussion in the comments if you are curious about any of this. After all, we need consensus to be considered a community. This is just a launch point for us to get started from.

I truly believe that all this is achievable, but it will require help from the community (you). Without any further feedback, I will be personally pitching in a lot on the code, as well as provide consultancy for PRs, but this project is bigger than me and I expect that we would like to encourage open collaboration on these issues. I am grateful to have the opportunity to not only use it, but also contribute to this wonderful project. I hope we can evolve this greater, together within the next 3-6 months!

Please review:

Project management
	Discover, define, estimate tasks
		GitHub Projects or other project management/tracking software
	Contribution guidelines
		Application versioning - Semantic versioning
		Git style
			Git flow?
		Code standards
			Linting
		Release creation documentation
	Application Package Roadmap(s)
		Encompassing all features and their releases in a timeline projected into the future few quarters.

Dockerization
	Document usage in README
	Follow best practices (conventions & security)
		Minimal layers
		Parameter pass-through
		User-space isolation
	Docker-compose/Helm/Kubernetes example(s)
	Test-suite harness
		Automated testing
	Daemon to spin up grab-clients?
		** Do not expose the docker socket!! :D
		Start crawl from dashboard
		Maybe custom resource definition for a grab-client operator in Kubernetes?
	Related:
		#93
		#182
		#149
		#176
		#175
	Browsing, searching the warc locally
		Visually browse/search warc (external project)
		browse @ http host (external project)
		Usage example and documentation

Dashboard improvements
	Re-organize front-end code
		It is kind of bloated and several files over 1k lines, mixed types of content in HTML files, etc.
	Include attempted crawls, connection error, etc. (re: #93), queued URLS (optionally)
		Queued URL logging would be less costly
			If we were able to put a cap and not max out to as much memory as I (you) have available to the browser.
	Ability to manage ignore sets and ignore rules while crawling
		Potentially modify any option
		Related: #3
	No log mode
		Display aggregate crawl stats only
	Authentication provider / access control
	TLS usage example

Server improvements
	Generalize log exporting layer
		Prometheus metric format exporter
			Allows usage by common libs and integration of system reporting services
		Globally & per-crawl

Client improvements
	Investigate and address random crawl hang issue
		May be able to improve the ratio of connections to request-per-second and archival throughput
		GC, database, expensive function calls, still need to do a perf. analysis on the app while crawling
		Related: #60
	More application-specific ignore set defaults to choose from
		Review to ensure top platforms are up to date
		Better support for forums (vBulletin, IPB, XenForo)
			Related: #178
			The defaults are good but not 100% for some versions of these softwares
				(vb tab pages, print views, smf sorting, etc. etc.)
				This has made a difference between a 500k crawl and a 5m crawl for me
	Ability to resume crawl
		Using ID
		Must define default behavior for when the directory exists once implemented
		Resync (recrawl/reindex)
			Delta WARC?
		Related:
			#57
			#58
			#185
	Dead URL / Dupe spotter false positives
		Optimize to avoid dead URLs
		Related: #43
	Windows CRLF/LF/CR adaptability
		Related: #48
	Detect when crawls have been limited, back off exponentially until crawl can resume
		Max retries, automatically adjust rate limiting until "sweet spot" is achieved where blocking does not occur

Documentation improvements
	Document resumption of a grab (specifically ID ("job_data.ident") field)
		Including more explicit docs on STOP and START process signal option (which already supported and code example)
		Different failure scenarios, also: when to just... start over!
	Some copy or quote to inspire people to become archivists
		Link to the ArchiveTeam wiki
	Systems documentation
		Internal concepts, implementation specific details, etc.
		Directory structure / descriptions of operational data files
		More in-depth documentation on gs-server and the role of it in the application architecture
			Functional dependencies between grab-site and gs-server, running grab-site standalone?
		Document parameters and live config update conventions
			Which parameters can be updated live, limitations, etc.
	Management
		Background Processing / Daemonization
		Scaling
		Logging
		Storage Management
		Resource Monitoring (IO, CPU, RAM, HDD)

Application packaging
	Deployment
		GitHub packages (docker, python)
		Docker Hub (docker)

Project website
	Statically generated website for landing, docs, etc.
	GitHub actions & GitHub sites?
@TheTechRobo
Copy link
Contributor

I agree with most of what you said, but I don't like this:

no major (breaking) refactors (requiring end-user changes),

IMO while we should avoid them, there will be times when it is necessary to do it.

@acrois
Copy link
Author

acrois commented Jul 19, 2021

@TheTechRobo Thanks for the feedback

I totally agree with you. You have to sometimes. Sweeping statements/ideas like "don't make breaking changes" are never universally true or achievable. What I meant by that is if we build our next milestones in 1-2 quarters based on just the list, I feel strongly that it should be achievable without breaking things at all. I've revised the wording in the issue a bit to try and reflect that.

In the case of grab-site, I think getting a stable 2.x and then thinking about breaking things in a 3.x release version with all that stuff (and potentially upgraded/rewritten features) would be less volatile. I suppose it all just depends on the nature of change.

On that subject:
Maybe there are some other features that people want/need that might/will break things.
What do you think would be the best way to go about discovering and including those issues in creating a more concrete plan for this project?

@TheTechRobo
Copy link
Contributor

What do you think would be the best way to go about discovering and including those issues in creating a more concrete plan for this project?

Probably create a guthib projectboard or milestone for "Look at later; requires refac" or something.

This was referenced Jul 20, 2021
@TomLucidor
Copy link

Hope things turn out a bit better in the future

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants