Two problems with doc-scraper #770
Replies: 2 comments 4 replies
-
Hello @kappa-wingman! Glad to read you are satisfied with MeiliSearch 😄 About your issues with docs-scraper, I'm going to give you a quick first answer, but be sure I will take time to answer better tomorrow or this weekend if my current answer does not help you. I know, the docs-scraper README is not really detailed yet: issues are open for that, so be sure we plan to improve this README. To complete, here is the config file we use in production for our own documentation. It might help. Tell me if you were already aware of both of these links or if they do not help! |
Beta Was this translation helpful? Give feedback.
-
Hello again @kappa-wingman! I'm not sure to really understand what exactly you wanted to change, so I followed this idea: find the best config file to scrape your website so that your website will provide a search bar with a great user experience 🙂 TLDR;Here is the config file I suggest: {
"index_uid": "docs",
"start_urls": [
"https://www.kappawingman.com"
],
"stop_urls": [
"https://www.kappawingman.com/archives.html",
"https://www.kappawingman.com/tags.html",
"https://www.kappawingman.com/tags/",
"https://www.kappawingman.com/categories.html",
"https://www.kappawingman.com/categories/",
"https://www.kappawingman.com/category/storage.html",
"https://www.kappawingman.com/category/webdev.html",
"https://www.kappawingman.com/category/siteinfo.html",
"https://www.kappawingman.com/category/security.html"
],
"selectors": {
"lvl0": {
"selector": ".navbar-nav .active",
"global": true,
"default_value": "Kappa ITC Wingman"
},
"lvl1": "#content h1",
"lvl2": "#content h2",
"text": "#content p, #content li"
},
"custom_settings": {
"synonyms": {
"static site generator": ["ssg"],
"ssg": ["static site generator"]
},
"stopWords": ["of", "the", "for", "from"]
},
"scrap_start_urls": false,
"nb_hits": 226
}
ExplanationWhat was the issues I noticed?Here is the main "issue" with your website: there is the same content duplicated many times because you have a summary of each article on the home page, on the category pages, etc... It makes it harder to scrape because it would lead to duplicate contents in MeiliSearch, so in your search bar, like this: The only article contents we want to scrape are the "final" and complete articles in this kind of URL: Even after this first step, there is still an issue with the intermediated pages in More about selectorsBecause you told me the selectors weren't clear to you, I will try to explain them better. Selectors are needed to tell the scraper "I want to get the content in this HTML tag/id/class". This HTML tag/id/tag is a selector. lvl0 selector in your config fileThis selector is a little bit different from the other selectors in your config file. As you can notice in the picture above, the lvl0 corresponds to the main title in the search bar, so I wanted to make it relevant for your search bar! The What I removedI removed the I also removed the Tips to custom your MeiliSearchYou can notice I added
Both of these settings are not mandatory, and your search bar would still work perfectly without them. The resultYou can now search the articles by title: You can also find the article sub-titles (h2): And get relevant results by just typing a word: Your config file is now perfect to provide a relevant search bar! Your blog is really great by the way! I hope you'll keep adding content 😉 EditYou can notice the highlight is sometimes buggy on the screenshots I did: it's because I tested on an old MeiliSearch version. The bug is fixed right now 😉 |
Beta Was this translation helpful? Give feedback.
-
Hello guys, I am using Pelican with MeiliSearch. I am not a front end developer but got a small problem with doc-scraper.
Overall I am very satisfy with MeiliSearch 👍
Problem 1:
I have problem understanding the doc-selector and global in the doc-scraper.
Problem 2:
These are my headings/anchors in the article.
When I type 'hosting' in the search box, the text stack together.
From WordPress to Pelican (h1)
Choosing the Static Site Generator (h2)
Customization of the theme (h2)
Choices of hosting service provider (h2)
Here is my config
Only articles like this with TOC would have a left sidebar.
Thanks.
Beta Was this translation helpful? Give feedback.
All reactions