Skip to content

RGGH/Scrapy14

Repository files navigation

Scrapy14

Scraping News Stories - Multiple Sources

  • 'independent'
  • 'guardian'
  • 'express'

View the YouTube Playlist for the entire project (https://www.youtube.com/playlist?list=PLKMY3XNPiQ7u_ljiiDt1382T9T4xgLpRI)

Objective : Multiple spiders using ONE items.py with MySQL database for consistent data

Check all potential news sites in Scrapy shell first

Use scrapy shell's fetch (url, headers={}) https://youtu.be/UaqSo7hlX9g

Also you can check with scrapy shell and curl

Curl from Browser

Curl Scrapy

### Plan the columns / fields for "items" to scrape

Also features a fix for scrapy & items 'module not found' error :

Add this with imports in each spider

import sys
sys.path.insert(0,'..')
from items import NewzzItem

Scrapy import from items Module not found error

Add new database to MySQL

sudo mysql -u root -p -h localhost

DROP DATABASE IF EXISTS newz;
CREATE DATABASE newz;

GRANT ALL PRIVILEGES ON newz.* TO 'pi'@'localhost';

FLUSH PRIVILEGES;

Allow remote connection to database

GRANT ALL PRIVILEGES ON newz.*  TO 'user1'@'%';

XPATH selectors - some more advanced examples

response.xpath('//*[@id="articleHeader"]//a[contains(@href,"/author/")]/text()')[0].get()
response.xpath('//a[@class="title"][not(contains(@href,"https://www.independent.co.uk/vouchercodes"))]/@href')

More to follow - Also : visit my web scraping and automation site : https://redandgreen.co.uk/