Skip to content
This repository has been archived by the owner on Mar 9, 2023. It is now read-only.

My script for finding books by looking at bookshelves of people who read similar books #36

Open
san-kumar opened this issue Jul 30, 2021 · 9 comments

Comments

@san-kumar
Copy link

san-kumar commented Jul 30, 2021

Love this toolbox. But it was missing a feature for finding books by looking at bookshelves of people who read similar books. So I wrote this small perl script for that today.

Here is how it works:

  • fetches books with 4 and 5 stars in your profile

  • crawls reviews of these books to find users who also rated it 4 or 5 stars

  • looks up the bookshelves of those users to see which books they rated 4 or 5 stars

  • ranks books based on number of votes from these users

  • also ranks users by number of books they have in common (min 3)

  • also gives more votes to users who love the same books as you but also hate the same books as you (i.e. 1 or 2 star)

Output:

  • Gives you a list of books who were rated highly by people who share the same taste as you
  • Gives you a list of doppelgangers, i.e. people who have rated books very similar to you.

My perl is a little rusty so this isn't the best way to do it but then perl motto is TIMTOWTDI and it did produce some good outputs.

Let me know what you guys think. Will post the script in the next comment.

@san-kumar
Copy link
Author

#!/usr/bin/env perl

#<--------------------------------- MAN PAGE --------------------------------->|

=pod

=head1 NAME

bookfinder - finding books by looking at bookshelves of people who read similar books

=head1 PURPOSE

=over

=item * fetches books with 4 and 5 stars in your profile

=item * crawls reviews of these books to find users who also rated it 4 or 5 stars

=item * looks up the bookshelves of those users to see which books they rated 4 or 5 stars

=item * ranks books based on number of votes from these users

=item * also ranks users by number of books they have in common (min 3)

=item * also gives more votes to users who love the same books as you but also hate the same books as you get special treatment

=back

=head1 SYNOPSIS

B<bookfinder.pl>
[B<-n> F<number>]
[B<-a> F<number>] 
[B<-x> F<number>] 
[B<-d> F<filename>] 
[B<-u> F<number>] 
[B<-c> F<numdays>] 
[B<-o> F<filename>] 
[B<-s> F<shelfname> ...] 
[B<-i>]
F<goodloginmail> [F<goodloginpass>]


=head1 OPTIONS

Mandatory arguments to long options are mandatory for short options too.

=over 4

=item B<-n, --common>=F<number>

Max number of books in user's bookshelf. Currently set to
500. PEople who have hundreds and thousand of books often
add more noise than signal to your results.


=item B<-x, --rigor>=F<numlevel>

we need to find members who rate the books of our authors, 
though Goodreads just shows a few ratings. 
We exploit ratings filters and the reviews-search to find more members:

 level 1 = filters-based search of book-raters (max 5400 ratings) - default
 level 2 = like 1 plus dict-search if >3000 ratings with stall-time of 2min
 level n = like 1 plus dict-search with stall-time of n minutes

Rigor level 0 is useless here (latest readers only), 
and 2+ (dict-search) has a bad cost/benefit ratio given hundreds of books.


=item B<-d, --dict>=F<filename>

default is F<./list-in/dict.lst>


=item B<-u, --userid>=F<number>

check another member instead of the one identified by the login-mail 
and password arguments. You find the ID by looking at the shelf URLs.


=item B<-c, --cache>=F<numdays>

number of days to store and reuse downloaded data in F</tmp/FileCache/>,
default is 31 days. This helps with cheap recovery on a crash, power blackout 
or pause, and when experimenting with parameters. Loading data from Goodreads
is a very time consuming process.


=item B<-o, --outfile>=F<filename>

name of the CSV file where we write results to, default is
"./likeminded-F<goodusernumber>-F<shelfname>.csv"


=item B<-i, --ignore-errors>

Don't retry on errors, just keep going. 
Sometimes useful if a single Goodreads resource hangs over long periods 
and you're okay with some values missing in your result.
This option is not recommended when you run the program unattended.




=item B<-?, --help>

show full man page

=back


=head1 FILES

F<./list-in/dict.lst>

F<./list-out/likeminded-$USERID-$SHELF.html>

F</tmp/FileCache/>


=head1 EXAMPLES

$ ./bookfinder.pl [email protected] MyPASSword

$ ./bookfinder.pl -c 31 -o myfile.csv  [email protected] pass


=head1 REPORTING BUGS

Report bugs to <[email protected]> or use Github's issue tracker
L<https://github.com/andre-st/goodreads-toolbox/issues>


=head1 COPYRIGHT

This is free software. You may redistribute copies of it under the terms of
the GNU General Public License L<https://www.gnu.org/licenses/gpl.html>.
There is NO WARRANTY, to the extent permitted by law.



=head1 VERSION

2020-01-23 (Since 2018-06-22)

=cut

#<--------------------------------- 79 chars --------------------------------->|


use strict;
use warnings qw(all);
use locale;
use 5.18.0;

# Perl core:
use FindBin;
use lib "$FindBin::Bin/lib/";
use Time::HiRes qw(time tv_interval);
use POSIX qw(strftime floor locale_h);
use File::Spec; # Platform indep. directory separator
use IO::File;
use Getopt::Long;
use Pod::Usage;
# Third party:
use Text::CSV;
# Ours:
use Goodscrapes;


# ----------------------------------------------------------------------------
# Program configuration:
#
setlocale(LC_CTYPE, "en_US"); # GR dates all en_US
STDOUT->autoflush(1);
gsetopt(cache_days => 31);

our $TSTART = time();
our $MINCOMMON = 5;
our $MAXAUBOOKS = 100;
our $RIGOR = 1;
our $MAXBOOKS = 500;
our $DICTPATH = File::Spec->catfile($FindBin::Bin, 'list-in', 'dict.lst');
our $OUTPATH;
our @SHELVES;
our $USERID;

GetOptions('rigor|x=i'          => \$RIGOR,
    'dict|d=s'           => \$DICTPATH,
    'userid|u=s'         => \$USERID,
    'outfile|o=s'        => \$OUTPATH,
    'maxbooks|n=s'       => \$MAXBOOKS,
    'shelf|s=s'          => \@SHELVES,
    'ignore-errors|i'    => sub {gsetopt(ignore_errors => 1);},
    'cache|c=i'          => sub {gsetopt(cache_days => $_[1]);},
    'help|?'             => sub {pod2usage(-verbose => 2);})
    or pod2usage(1);

pod2usage(1) if !$ARGV[0];

glogin(usermail => $ARGV[0], # Login also allows to load 200 books in 1 request
    userpass    => $ARGV[1], # Asks pw if omitted
    r_userid    => \$USERID);

sub bookshelf {
    my $id = shift;
    my %books;

    print "\nLooking bookshelf of $id..";

    greadshelf(from_user_id => $id,
        ra_from_shelves     => [ 'read' ],
        rh_into             => \%books,
        # on_book       => sub{},
        on_progress         => gmeter('books')
    );

    my (@good, @bad);
    for my $book_id (keys %books) {
        my $book = $books{$book_id};
        #next unless $book->{title} =~ /Club/;

        my $rating = $book->{user_rating};
        push(@good, $book) if ($rating >= 4);
        push(@bad, $book) if ($rating <= 2);

        #warn("cannot find rating for $book->{title} of $id\n") unless ($rating >= 1);
    }

    return (\@good, \@bad);
}

sub bookgenres {
    my $bid = shift;
    my $html = Goodscrapes::_html(Goodscrapes::_book_url($bid));
    my @genres;
    while ($html =~ m[href="/genres/([\w-]+)"]g) {
        push(@genres, $1);
    }

    return \@genres;
}

my ($su_good, $su_bad) = bookshelf($USERID);
my (%good_users, %good_books, %haters);

for my $b (@$su_good) {
    print "\nLooking up reviews for for $b->{title}..";
    $b->{reviews} = {};
    greadreviews(rh_for_book => $b,
        rh_into              => $b->{reviews},
        rigor                => $RIGOR,
        dict_path            => $DICTPATH,
        on_progress          => gmeter('memb'));

    for my $rev (values %{$b->{reviews}}) {
        my $u = $rev->{rh_user};
        if ($rev->{rating} >= 4) {
            $good_users{$u->{id}} = { 'votes' => (defined($good_users{$u->{id}}->{votes}) ? $good_users{$u->{id}}->{votes} : 0) + 1, 'user' => $u };
        } elsif ($rev->{rating} <= 2) {
            $haters{$u->{id}} = { 'votes' => (defined($haters{$u->{id}}->{votes}) ? $haters{$u->{id}}->{votes} : 0) + 1, 'user' => $u };
        }
    }
}

for my $u (keys %good_users) {
    $good_users{$u}->{'bad'} = defined($haters{$u}->{votes}) ? $haters{$u}->{votes} : 0;
}

printf("\nHere are your best users (out of %d users):\n", scalar keys %good_users);
my $filename = File::Spec->catfile($FindBin::Bin, 'list-out', "bookfinder-users.csv");
my $csv = Text::CSV->new({ binary => 1, eol => $/ }) or die "Failed to create a CSV handle: $!";
open my $fh, ">:encoding(utf8)", $filename or die "failed to create $filename: $!";

$csv->print($fh, [ 'uid', 'name', 'good_common', 'bad_common', 'total_common', 'total_books', 'ratio', 'url' ]);

for my $user_id (keys %good_users) {
    my $userHash = $good_users{$user_id};

    if (($user_id ne $USERID) && ($userHash->{votes} >= 2)) {
        my $user = $userHash->{user};
        my $uBooks = bookshelf($user_id);
        my $numBooks = scalar @$uBooks;

        if (!$MAXBOOKS || ($numBooks <= $MAXBOOKS)) {
            my $total = $userHash->{votes} + $userHash->{bad};
            $csv->print($fh, [ $user->{id}, $user->{name}, $userHash->{votes}, $userHash->{bad}, $total, $numBooks, $numBooks > 0 ? $total / $numBooks : 0, "https://www.goodreads.com/review/list/$user_id?sort=rating" ]);

            for my $gb (@$uBooks) {
                $good_books{$gb->{id}} = { 'votes' => (defined($good_books{$gb->{id}}->{votes}) ? $good_books{$gb->{id}}->{votes} : 0) + 1, 'book' => $gb };
            }
        } else {
            print "\nskipped books for $user_id: $numBooks > $MAXBOOKS\n";
        }
    }
}

close $fh or die "failed to close $filename: $!";

printf("\nHere are your best books (out of %d books):\n", scalar keys %good_books);
$OUTPATH = File::Spec->catfile($FindBin::Bin, 'list-out', "bookfinder-books.csv") if !$OUTPATH;

$csv = Text::CSV->new({ binary => 1, eol => $/ }) or die "Failed to create a CSV handle: $!";
open $fh, ">:encoding(utf8)", $OUTPATH or die "failed to create $OUTPATH: $!";

$csv->print($fh, [ 'bid', 'title', 'author', 'votes', 'avg_rating', 'num_ratings', 'genres', 'img_url' ]);

for my $bk (sort {$b->{votes} <=> $a->{votes}} values(%good_books)) {
    if ($bk->{votes} > 1) {
        my $b = $bk->{book};
        my $genres = bookgenres($b->{id});
        printf("%s with %d votes\n", $b->{title}, $bk->{votes});
        $csv->print($fh, [ $b->{id}, $b->{title}, $b->{rh_author}->{name}, $bk->{votes}, $b->{avg_rating}, $b->{num_ratings}, join(', ', @$genres), $b->{img_url} ]);
    }
}

close $fh or die "failed to close $OUTPATH: $!";

@san-kumar
Copy link
Author

For this to work, there is a minor patch in Goodscrapes.pm line 2075:

$bk{ user_rating     } = $row =~            /data-rating="(\d+)"/                   ? ($1?$1:0) : 0;

I guess goodreads has changed the HTML so the user rating is always 0. The above line fixes it.

@andre-st
Copy link
Owner

Hi San Kumar, thanks for sharing your script. I will definitely check this out over the course of the next week.

@WaterSibilantFalling
Copy link

Super like

@mcleanle
Copy link

This is exactly what I've been looking for! Can it be run in Docker?

@san-kumar
Copy link
Author

This is exactly what I've been looking for! Can it be run in Docker?

I haven't tried it but shouldn't be so hard. Just modify goodreads-toolbox Dockerfile to copy this script to the container and the rest should be the same.

@mcleanle
Copy link

I added your script and the patch to the goodreads-toolbox directory and then modified the .dockerignore file to include the new script in the exceptions list, then rebuilt the container from my local drive instead of pointing to github in the build command. However, it seems to have broken my bash prompt and I get "no such file or directory" when trying to run any of the scripts in the container. Oh well! I'm not a Linux programmer and have never messed around with Docker before until today. I realize this isn't a Docker help forum, however if you happen to have any tips I would love to hear them. Thank you for your awesome work on this! I hope the toolbox will be supported again one day and this can be added as an official script.

@san-kumar
Copy link
Author

I think your Dockerfile may be missing the entrypoint. I haven't tried this in docker yet, haven't seen the Dockerfile yet (will maybe check on the weekend) but you need to copy-paste the entry point from the original Dockerfile in to the modified file. Otoh you don't want to mess with Dockerfile then you can just mount a volume (with the -v command) and put this script there. Then use docker exec -it $pid bash to enter the container and just do a perl script.pl. Sorry I'm typing all this from memory so you may have to do some digging around but I reckon these should both work.

@mcleanle
Copy link

mcleanle commented Dec 16, 2022

Thanks again for your help! For anyone who stumbles across this in future, here are all the steps I took to eventually get this working in Docker for Windows:

  1. Clone the repo
  2. Paste @san-kumar 's script into a new blank text file called bookfinder.pl
  3. Replace line 205 with the following: use local::lib "$FindBin::Bin/lib/local/"; use lib "$FindBin::Bin/lib/";
  4. Patch /lib/Goodscrapes.pm as @san-kumar mentions above
  5. Add perl-text-csv \ at line 47 in Dockerfile
  6. Add !/bookfinder.pl anywhere in .dockerignore
  7. Open a command prompt and cd to the repo directory
  8. Enter docker build -t goodreads-toolbox . and wait for the build to complete
  9. Enter docker run -it --publish=8080:80 goodreads-toolbox
  10. At the bash prompt, run perl bookfinder.pl

That's it!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants