Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create custom scraper #18

Open
bidoubiwa opened this issue Jun 20, 2023 · 0 comments
Open

Create custom scraper #18

bidoubiwa opened this issue Jun 20, 2023 · 0 comments
Labels
enhancement New feature or request

Comments

@bidoubiwa
Copy link
Contributor

bidoubiwa commented Jun 20, 2023

Currently there are 3 strategies

  • default
  • docssearch
  • schema

They all answer to a certain common use case. We suppose these use-cases answer 90% of the website structures.
Nonetheless, other app may not follow that structure or the user might want to only index a specific part of each page and not all of it.

For these cases, we want to create the possibility to provide a custom scraper.

A base to create the custom scraper was made but removed

import prettier from "prettier";
import { v4 as uuidv4 } from "uuid";
export default class CustomScaper {
constructor(sender, config) {
console.info("CustomScaper::constructor");
this.sender = sender;
this.config = config;
if (config.custom_settings) {
this.sender.updateSettings(config.custom_settings);
}
}
async get(url, page) {
let data = {};
if (this.custom_crawler.get_title || false) {
data.title = await page.title();
}
data.uid = uuidv4();
if (this.custom_crawler.get_meta || false) {
const meta = await page.evaluate(() => {
const metas = document.getElementsByTagName("meta");
const meta = {};
for (let i = 0; i < metas.length; i++) {
const name = metas[i].getAttribute("name");
const content = metas[i].getAttribute("content");
if (name && content) {
meta[name] = content;
}
}
return meta;
});
data.meta = meta;
}
if (this.custom_crawler.get_url || false) {
data.url = url;
}
await this.sender.add(data);
}
}

Nonetheless, it has been removed until we deem this feature necessary

@bidoubiwa bidoubiwa added the enhancement New feature or request label Jun 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant