Skip to content

Web Services application which scraps NHS Choices website and provide smart Search Engine against it

License

Notifications You must be signed in to change notification settings

andrei-l/health-site-scrapper-with-search-engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web Services application which scraps NHS Choices website and provide smart Search Engine against it

Project details

This is a Scala based project with Spring Boot and ElasticSearch underneath. To view the API please open http://localhost:19000/swagger-ui.html and check nhs-choices-api-endpoint. You can use this UI to operate with this scrapper and search engine. It has 3 major methods:

  • GET /nhs-conditions/cache - will a) return cache from memory stored in APP b) if there is no cache store in APP it will load cache from file nhs-choices.json, will update APP memory, will update ElasticSearch Index c) if there is no file it will scrap the website, update file, update APP memory, will update ElasticSearch Index

  • POST /nhs-conditions/cache/reload - it will scrap the website, update file, update APP memory, will update ElasticSearch Index

  • GET /nhs-conditions?q=<query> - will perform search against ElasticSearch Index

Size of scrapped website in json approximately = 10.8mb Search queries almost always provide best match, further ElasticSearch Index configuration would provide better results

Project Build Details:

To build application write sbt oneJar it will create a runnable jar file which you can run via java -jar nhs-choices_2.11-1.0-one-jar.jar

Implementation Details:

Frameworks\Software used: SBT + Scala + Spring Boot + ElasticSearch + ScalaScrapper + Jackson

There is a test which performs complete scrapping process of few pages and return result (emulates GET /nhs-conditions/cache request)

About

Web Services application which scraps NHS Choices website and provide smart Search Engine against it

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages