Skip to content
This repository has been archived by the owner on Dec 14, 2023. It is now read-only.

1-round spidering problem/error #774

Open
dsjen opened this issue Mar 3, 2021 · 0 comments
Open

1-round spidering problem/error #774

dsjen opened this issue Mar 3, 2021 · 0 comments
Labels
Projects

Comments

@dsjen
Copy link
Contributor

dsjen commented Mar 3, 2021

via @ebndulue

As part of our project trying to identify when preprint server URLs are linked to in news, we ran a topic for all stories (so, a * query) from the Nigeria - National collection for one day, with 1 round of spidering. So our expected results would be all the stories published by those Nigerian news outlets on that date, as well as any URLs that those articles directly link to. Here is a link to the topic: https://topics.mediacloud.org/#/topics/5530/summary?focusId&q&snapshotId=6351&timespanId=1255893.

I queried the topic for the media_id of one of the preprint servers we have identified, arxiv.org (id: 19472), and found 1 article. When clicking on the Links tab (view here: https://topics.mediacloud.org/#/topics/5530/stories/1834760437?focusId&q&snapshotId=6351&timespanId=1255893), you can see that the news source that links to the arxiv.org article is Wired.com. Wired is a US news source and is not part of the Nigeria collection. So, if Wired is the only article that links to the Arxiv.org URL, then the Arxiv.org URL should not be in the topic. Clearly some additional spidering has happened here beyond the initial 1 round.

As I mentioned, a similar error has happened to me in 1-round spiderered topic(s) before, and Hal said it was somewhat inevitable for things to sneak in sometimes, but I didn't understand why.

Anything you can explain about this, and ways we can address it, would be much appreciated! Several of our research questions have to do with limiting the scope of a topic to only a select set of news sources and the stories they link to, so having that feature be unreliable is somewhat of a challenge.

@dsjen dsjen added the bug label Mar 3, 2021
@dsjen dsjen added this to To do in Dev via automation Mar 3, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
No open projects
Dev
To do
Development

No branches or pull requests

1 participant