Crawler detection improvement #50

Yongyao · 2016-08-25T23:23:44Z

Write the implementations and also write tests to validate.

lewismc · 2016-08-26T02:03:30Z

This issue concerns the code present within CrawlerDetection.java
which is too static in nature and which is not accurately capturing all of the Web crawlers which appear within HTTP. FTP, etc. logs.
We need to use a more intelligent mechanism for detecting Web crawlers within the PO.DAAC logs.

Yongyao · 2016-08-26T02:57:21Z

Any quick solution or suggestion right now.

Sent from my iPhone

On Aug 25, 2016, at 22:03, Lewis John McGibbney [email protected] wrote:

This issue concerns the code present within CrawlerDetection.java
which is too static in nature and which is not accurately capturing all of the Web crawlers which appear within HTTP. FTP, etc. logs.
We need to use a more intelligent mechanism for detecting Web crawlers within the PO.DAAC logs.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.

lewismc · 2016-08-26T03:10:04Z

This is not an easy issue to tackle. I've been writing Web crawlers (and search engines) for years so have experienced this issue from both sides of the table.
Sites exist such as user-agents.org, robotstxt.org and botsvsbrowsers.com however unfortunately we've found that bot activity is too numerous and varied to be able to accurately filter it. If you want accurate download counts (or in our case dataset landing page hits/downloads), our best bet may be to push a requirement on PO.DAAC to require Javascript to trigger the download. That's basically the only thing that is going to reliably filter out the bots. It also means that we can catch the requests which are not intelligent enough to invoke JavaScript in order to acknowledge the download. It's also why all site traffic analytics engines these days are Javascript based.

This being said, we've actually modified Apache Nutch to interact with Javascript so we can (with Nutch) actually bypass Javascript download verification as well.

i think that this is a difficult issue... there is actually a bunch of research in this area. I will try to find some and post it here.

lewismc · 2016-09-16T08:39:24Z

Another article I started reading http://searchengineland.com/7-fundamental-technical-seo-questions-to-answer-with-a-log-analysis-and-how-to-easily-do-it-245903

lewismc · 2017-02-02T03:23:51Z

@Yongyao we need to make this a priority. Right now it takes forever.

lewismc added enhancement core labels Aug 26, 2016

lewismc added this to the 09/02/2016 milestone Aug 26, 2016

lewismc modified the milestones: 09/16/2016, 09/02/2016 Sep 16, 2016

lewismc modified the milestones: 09/30/2016, 09/16/2016 Sep 24, 2016

lewismc removed this from the 09/30/2016 milestone Oct 12, 2016

lewismc added this to Engine Integration, Deployment and Testing in AIST Master Schedule Feb 1, 2017

lewismc moved this from Engine Integration and Deployment to Testing in AIST Master Schedule Feb 1, 2017

lewismc assigned lewismc and Yongyao Feb 2, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawler detection improvement #50

Crawler detection improvement #50

Yongyao commented Aug 25, 2016

lewismc commented Aug 26, 2016

Yongyao commented Aug 26, 2016

lewismc commented Aug 26, 2016

lewismc commented Sep 16, 2016

lewismc commented Feb 2, 2017

Crawler detection improvement #50

Crawler detection improvement #50

Comments

Yongyao commented Aug 25, 2016

lewismc commented Aug 26, 2016

Yongyao commented Aug 26, 2016

lewismc commented Aug 26, 2016

lewismc commented Sep 16, 2016

lewismc commented Feb 2, 2017