Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawler detection improvement #50

Open
Yongyao opened this issue Aug 25, 2016 · 5 comments
Open

Crawler detection improvement #50

Yongyao opened this issue Aug 25, 2016 · 5 comments

Comments

@Yongyao
Copy link
Collaborator

Yongyao commented Aug 25, 2016

Write the implementations and also write tests to validate.

@lewismc lewismc added this to the 09/02/2016 milestone Aug 26, 2016
@lewismc
Copy link
Collaborator

lewismc commented Aug 26, 2016

This issue concerns the code present within CrawlerDetection.java
which is too static in nature and which is not accurately capturing all of the Web crawlers which appear within HTTP. FTP, etc. logs.
We need to use a more intelligent mechanism for detecting Web crawlers within the PO.DAAC logs.

@Yongyao
Copy link
Collaborator Author

Yongyao commented Aug 26, 2016

Any quick solution or suggestion right now.

Sent from my iPhone

On Aug 25, 2016, at 22:03, Lewis John McGibbney [email protected] wrote:

This issue concerns the code present within CrawlerDetection.java
which is too static in nature and which is not accurately capturing all of the Web crawlers which appear within HTTP. FTP, etc. logs.
We need to use a more intelligent mechanism for detecting Web crawlers within the PO.DAAC logs.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.

@lewismc
Copy link
Collaborator

lewismc commented Aug 26, 2016

This is not an easy issue to tackle. I've been writing Web crawlers (and search engines) for years so have experienced this issue from both sides of the table.
Sites exist such as user-agents.org, robotstxt.org and botsvsbrowsers.com however unfortunately we've found that bot activity is too numerous and varied to be able to accurately filter it. If you want accurate download counts (or in our case dataset landing page hits/downloads), our best bet may be to push a requirement on PO.DAAC to require Javascript to trigger the download. That's basically the only thing that is going to reliably filter out the bots. It also means that we can catch the requests which are not intelligent enough to invoke JavaScript in order to acknowledge the download. It's also why all site traffic analytics engines these days are Javascript based.

This being said, we've actually modified Apache Nutch to interact with Javascript so we can (with Nutch) actually bypass Javascript download verification as well.

i think that this is a difficult issue... there is actually a bunch of research in this area. I will try to find some and post it here.

@lewismc lewismc modified the milestones: 09/16/2016, 09/02/2016 Sep 16, 2016
@lewismc
Copy link
Collaborator

lewismc commented Sep 16, 2016

@lewismc lewismc modified the milestones: 09/30/2016, 09/16/2016 Sep 24, 2016
@lewismc lewismc removed this from the 09/30/2016 milestone Oct 12, 2016
@lewismc lewismc added this to Engine Integration, Deployment and Testing in AIST Master Schedule Feb 1, 2017
@lewismc lewismc moved this from Engine Integration and Deployment to Testing in AIST Master Schedule Feb 1, 2017
@lewismc
Copy link
Collaborator

lewismc commented Feb 2, 2017

@Yongyao we need to make this a priority. Right now it takes forever.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

2 participants