Skip to content
This repository has been archived by the owner on Dec 14, 2023. It is now read-only.

stories_public/list empty story_tags when more than 100 rows requested #729

Open
pypt opened this issue Sep 29, 2020 · 2 comments
Open

stories_public/list empty story_tags when more than 100 rows requested #729

pypt opened this issue Sep 29, 2020 · 2 comments
Assignees
Labels

Comments

@pypt
Copy link
Contributor

pypt commented Sep 29, 2020

(Moved from #725.)

More confusingly - asking to page with more rows than 100 seems to make the story_tags disaster in results.

This code returns a story 105831 with story_tags on it:

mc.storyList('robot', mc.dates_as_query_clause(dt.date(2020,8,2), dt.date(2020,8,3)), rows=100)[0]

But this call, with rows=200 returns the same story with NO story_tags on it:

mc.storyList('robot', mc.dates_as_query_clause(dt.date(2020,8,2), dt.date(2020,8,3)), rows=200)[0]
@pypt
Copy link
Contributor Author

pypt commented Sep 29, 2020

Prep:

>>> import mediacloud.api, json, datetime as dt
>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=4)
>>> mc = mediacloud.api.MediaCloud('YOUR_KEY')
>>> tag_sets_id = mediacloud.tags.TAG_SET_NYT_THEMES_VERSION
>>> q = '*'
>>> fq = mc.dates_as_query_clause(dt.date(2020,8,20), dt.date(2020,8,24))

99 stories - story_tags looks okay:

>>> pp.pprint(mc.storyList('robot', mc.dates_as_query_clause(dt.date(2020,8,2), dt.date(2020,8,3)), rows=99)[0])
{   'ap_syndicated': False,
    'collect_date': '2020-03-09 18:44:54.488650',
    'feeds': None,
    'guid': 'https://www.sinembargo.mx/08-02-2020/3727176',
    'language': 'es',
    'media_id': 105831,
    'media_name': 'sinembargo.mx',
    'media_url': 'http://sinembargo.mx/#spider',
    'metadata': {   'date_guess_method': {   'stories_id': 1543287159,
                                             'tag': 'guess_by_unknown',
                                             'tag_set': 'date_guess_method',
                                             'tag_sets_id': 508,
                                             'tags_id': 50741492},
                    'extractor_version': {   'stories_id': 1543287159,
                                             'tag': 'readability-lxml-0.7',
                                             'tag_set': 'extractor_version',
                                             'tag_sets_id': 1354,
                                             'tags_id': 81092444},
                    'geocoder_version': None,
                    'nyt_themes_version': None},
    'processed_stories_id': 1950370689,
    'publish_date': '2020-08-02 00:00:00',
    'stories_id': 1543287159,
    'story_tags': [   {   'stories_id': 1543287159,
                          'tag': 'guess_by_unknown',
                          'tag_set': 'date_guess_method',
                          'tag_sets_id': 508,
                          'tags_id': 50741492},
                      {   'stories_id': 1543287159,
                          'tag': 'readability-lxml-0.7',
                          'tag_set': 'extractor_version',
                          'tag_sets_id': 1354,
                          'tags_id': 81092444}],
    'title': 'Penaut, el robot que alimenta a personas en cuarentena por '
             'Coronavirus en un hotel de China',
    'url': 'https://www.sinembargo.mx/08-02-2020/3727176',
    'word_count': None}

100 stories - story_tags looks okay:

>>> pp.pprint(mc.storyList('robot', mc.dates_as_query_clause(dt.date(2020,8,2), dt.date(2020,8,3)), rows=100)[0])
{   'ap_syndicated': False,
    'collect_date': '2020-03-09 18:44:54.488650',
    'feeds': None,
    'guid': 'https://www.sinembargo.mx/08-02-2020/3727176',
    'language': 'es',
    'media_id': 105831,
    'media_name': 'sinembargo.mx',
    'media_url': 'http://sinembargo.mx/#spider',
    'metadata': {   'date_guess_method': {   'stories_id': 1543287159,
                                             'tag': 'guess_by_unknown',
                                             'tag_set': 'date_guess_method',
                                             'tag_sets_id': 508,
                                             'tags_id': 50741492},
                    'extractor_version': {   'stories_id': 1543287159,
                                             'tag': 'readability-lxml-0.7',
                                             'tag_set': 'extractor_version',
                                             'tag_sets_id': 1354,
                                             'tags_id': 81092444},
                    'geocoder_version': None,
                    'nyt_themes_version': None},
    'processed_stories_id': 1950370689,
    'publish_date': '2020-08-02 00:00:00',
    'stories_id': 1543287159,
    'story_tags': [   {   'stories_id': 1543287159,
                          'tag': 'guess_by_unknown',
                          'tag_set': 'date_guess_method',
                          'tag_sets_id': 508,
                          'tags_id': 50741492},
                      {   'stories_id': 1543287159,
                          'tag': 'readability-lxml-0.7',
                          'tag_set': 'extractor_version',
                          'tag_sets_id': 1354,
                          'tags_id': 81092444}],
    'title': 'Penaut, el robot que alimenta a personas en cuarentena por '
             'Coronavirus en un hotel de China',
    'url': 'https://www.sinembargo.mx/08-02-2020/3727176',
    'word_count': None}

101 rows - story_tags is empty:

>>> pp.pprint(mc.storyList('robot', mc.dates_as_query_clause(dt.date(2020,8,2), dt.date(2020,8,3)), rows=101)[0])
{   'ap_syndicated': False,
    'collect_date': '2020-03-09 18:44:54.488650',
    'feeds': None,
    'guid': 'https://www.sinembargo.mx/08-02-2020/3727176',
    'language': 'es',
    'media_id': 105831,
    'media_name': 'sinembargo.mx',
    'media_url': 'http://sinembargo.mx/#spider',
    'metadata': {   'date_guess_method': None,
                    'extractor_version': None,
                    'geocoder_version': None,
                    'nyt_themes_version': None},
    'processed_stories_id': 1950370689,
    'publish_date': '2020-08-02 00:00:00',
    'stories_id': 1543287159,
    'story_tags': [],
    'title': 'Penaut, el robot que alimenta a personas en cuarentena por '
             'Coronavirus en un hotel de China',
    'url': 'https://www.sinembargo.mx/08-02-2020/3727176',
    'word_count': None}

@pypt
Copy link
Contributor Author

pypt commented Sep 29, 2020

I think this natatime() call could be the one to blame, but I can't figure out how:

my $iter = natatime(100, @{ $stories } );
while ( my @chunk_stories = $iter->() )
{
my $chunk_ids_list = join( ',', map { int( $_->{ stories_id } ) } @chunk_stories );
my $tag_data = $db->query(
<<SQL
SELECT
s.stories_id::int,
t.tags_id,
t.tag,
ts.tag_sets_id,
ts.name AS tag_set
FROM stories_tags_map AS s
JOIN tags AS t
ON t.tags_id = s.tags_id
JOIN tag_sets AS ts
ON ts.tag_sets_id = t.tag_sets_id
WHERE stories_id in ( $chunk_ids_list )
ORDER BY t.tags_id
SQL
)->hashes;
$stories = MediaWords::DBI::Stories::attach_story_data_to_stories( $stories, $tag_data, 'story_tags' );
}

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

1 participant