Name		Name	Last commit message	Last commit date
README.md		README.md
config.pl-example		config.pl-example
ghtmlxxx.t		ghtmlxxx.t
gisxxx.t		gisxxx.t
glogin.t		glogin.t
gmeter.t		gmeter.t
greadauthorbk.t		greadauthorbk.t
greadauthors.t		greadauthors.t
greadbook.t		greadbook.t
greadcomments.t		greadcomments.t
greadfolls.t		greadfolls.t
greadreviews.t		greadreviews.t
greadshelf.t		greadshelf.t
greadshelfnames.t		greadshelfnames.t
greadsimilaraut.t		greadsimilaraut.t
greaduser.t		greaduser.t
greadusergrp.t		greadusergrp.t
gsearch.t		gsearch.t
gsocialnet.t		gsocialnet.t
gverifyshelf.t		gverifyshelf.t
gverifyxxx.t		gverifyxxx.t

README.md

Quality Plan

Table of contents:

Quality goals
Realization
Evaluation
Setup unit tests
Conventions
Lessons learned

Quality goals

Essential Goals¹	Why² + Scope
Monetary costs ≯ PC+Internet	Solo-developer non-profit side-project; Out of scope: distributed scraping with unique IP addresses (due to request throttling); we can easily wait for results
Unattendability	Scraping can take hours: allow people leaving the computer/process or running the toolbox on a remote computer/server
Fault-tolerance	Scraping can take hours: expect Internet connection issues, Goodreads has exceptions and is sometimes over capacity or in maintenance mode, invalid dates, ...; supports unattendability goal (FT is not high availability)
Resumability	Scraping can take hours: allow intentional breaks, expect program or computer crashes, power issues -- we don't want to start from the beginning
Testability	Scraping the Goodreads website expects stable HTML/JS-parts and we cannot know in advance when and where changes will occur (long-term failure). So regular and throughout (i.e., automated) testing is needed.
Correctness	Worst case: wasted computer time and power-consumption, missed book discovery opportunities, too many annoying/useless emails (recentrated); Out of scope: formal proofs, deep specifications
Repair Turnaround Time	Scraping can take hours: shouldn't impact regular debugging too much
Ease of use on UNIX systems	Out of scope: Windows, GUIs, Browser-Addons, SaaS too much effort, although it would increase potential user base
Learnability	Lot of program options and functions (libs), you cannot remember everything; no docs = no users; correct use and some expectation management supports correctness goal
Integrity	Users on GR might try to abuse scrapers such as our programs or other programs (reading our outputs) by saving rogue strings in reviews, usernames etc (XSS)

¹) List of possible goals...
²) Risks, worst-case, constraints, ...

Realization

Activity¹	Coverage/Frequency	Operational Notes
Unit testing	libraries' public functions	Use cache < 24h
Regression testing	before pushing to GitHub and inside new Docker images	Running unit-tests automatically via a git-hook reduces chance of distributing a buggy release; per-commit would be annoying because some tests need 3-8 minutes (w/o cache)
Manual testing	user-scripts, when sth. significant changed	Automated UI tests are not worth the effort here. Manual fault-injection: Disable network. As a one-man side project, this also has its limits in terms of effort
Syntactic check	user-scripts, before each commit	Automatically via a git-hook, because small (accidental) changes are not always manually tested but break things too; `use strict; use warnings;`
PushLogicDownTheStack	user-scripts	Have very little code in the user-scripts by moving as much code as possible into the libs (down the technology stack). Tests covering the libs would cover most fallible code, good enough to gain confidence. External libraries are usually more mature. Less repetition in user-scripts, centralized changes, technical debt and code smells isolated (API higher importance)
Persistent caching	all scraped raw source data (not results)	Caching the sources makes it easier (faster) to fix scraping and calculation errors. Caching (false) results would require to download sources again which takes much time. CPU is cheap, I/O expensive. Also easier to build apps on top of that, apps don't need to care about caching/it's fully transparent.
Outwait I/O issues	libraries	Wait, retry n times, skip less important
HTML entity encoding	user-scripts HTML generation	Prevent XSS
Docker container	all	Scripted builds/uploads via Makefile; I moved from DockerHub to GitHub, automatic builds cost money now
Makefile	dependencies, Docker, developer-setup
Unit test = tutorial	libraries, emergent	Reduce errors caused by incorrect use or assumptions; no need to write (outdated) tutorials
Inline man pages	user-scripts, program parameters, examples	Use Man-page POD-header in each script: more likely to be up-to-date, and can be extracted and displayed on incorrect program use
Help files	user-scripts, everything but program parameters (DRY)	Markdown-file in help-directory, with screenshot, motivation, install instructions, lessons learned etc; program parameters documented in man pages
Documented conventions	user-scripts, common program parameters	developer, consistent look and feel, principle of least astonishment (POLA)
Field failure reports	ask for reports, contact opts in scripts / help
Issue tracking	all	GitHub Issue Tracker: feedback (feature requests, usage problems), troubleshooting history
Version control	all	Git and GitHub: reverting code/source history, releasing, sync between computers
Use free softw. only	all	Free as in beer

¹) Quality assurance activities: defect prevention and product evaluation (quality control/testing)

Considerable:

Perl taint mode (perl -T)

Evaluation

Goal	Unit	Regr	ManT	Synt	Down	Cach	Wait	HtmE	Dock	Make	ManP	Help	Conv	Issu	VC	Free	Overall
Monetary costs	none	none	none	none	none	none	none	none	none	none	none	none	none	none	none	+++	strong
Correctness	+++	+++	+++	++	+++	none	none	none	++	none	+	+	+	++	none	none	strong
Unattendability	none	none	none	none	none	none	+++	none	none	none	none	none	none	none	none	none	weak
Fault-tolerance	none	none	+	none	none	none	+++	none	none	none	none	none	none	none	none	none	weak
Resumability	none	none	none	none	none	+++	+	none	none	none	none	none	none	none	none	none	strong
Testability	+++	+++	none	none	+++	+	none	none	+	none	none	none	none	none	none	none	strong
RepairTurnaroundTime	+++	+++	none	none	++	+++	none	none	none	none	none	none	none	none	+	none	strong
Ease of Use on UNIX	none	none	none	none	none	none	none	none	+++	++	+++	+++	+	none	none	none	strong
Learnability	++	none	none	none	none	none	none	none	none	none	+++	+++	+	none	none	none	strong
Integrity	none	none	none	none	none	none	none	++	none	none	none	none	none	none	none	none	at-risk

Values: +++, ++, +, none (does not address this goal)
Overall assurance: strong, weak, at-risk

Note: As a rule of thumb, it takes at least two "+++" activities and one "++" to give a "strong" overall rating. Likewise, it takes at least two "++" and one "+" activities to rate a "weak" overall rating.

Setup unit tests

Rename config.pl-example to config.pl and edit the file. Replace the email, pass, user-id values.

Running all tests via a GNU/Linux terminal:

$ cd goodreads
$ prove
t/gisxxx.t ........... ok
t/glogin.t ........... ok
t/gmeter.t ........... ok
t/greadauthors.t ..... ok
...
t/gverifyxxx.t ....... ok
All tests successful.
Files=16, Tests=253, 11 wallclock secs ( 0.16 usr  0.03 sys +  9.75 cusr  0.48 csys = 10.42 CPU)
Result: PASS

Conventions

Program calling conventions

Don't redesignate these switches in new or extended programs:

-c,  --cache
-d,  --dict
-i,  --ignore-errors
-o,  --outdir     or  --outfile
-r,  --minrated   or  --ratings    (TODO confusing)
-s,  --shelf
-u,  --userid
-?,  --help

Lessons learned

Speeding up scraping

pay attention to the print functions of Goodreads, they may offer more data for 1 request than the web view, e.g., 200 book titles instead of 30 (requires login!)
due to Goodreads request throttling, multi-threading requests had no significant performance impact but made code more complex; It will likely require access with multiple IP addresses. So far it didn't seem worth the effort.
the official API is slow too; there is also the risk that this will be slowed down even more if Goodreads has capacity problems again. This API is not used internally and is rather neglected. API users are of secondary importance compared to web users.
use a cache
although good idea when scraping, on Goodreads there's no need to retain backwards compatibility to older page versions from other servers

Typical scraping mistakes on Goodreads pages

number formats: "1,123,123"
dates such as "Jan 01, 1010"
TODO

Files

t

Directory actions

More options