-
Notifications
You must be signed in to change notification settings - Fork 811
DynamicExtraction
Dan Brickley edited this page Sep 19, 2017
·
1 revision
Sometimes (e.g. when widgets are added from another server that holds interesting data), sites add Schema.org data dynamically, via Javascript.
Here is a quick example that shows a headless browser being run from Python, and then data extracted with rdflib:
See link for full code: https://gist.github.com/danbri/ad684a50872fffb30e0bbd2c22ea3e18
register('json-ld', Serializer, 'rdflib_jsonld.serializer', 'JsonLDSerializer')
browser = webdriver.Firefox()
browser.get(u)
pagetext = browser.page_source
with warnings.catch_warnings():
try:
warnings.simplefilter("ignore")
browser.close()
browser.quit()
except e:
print "..."
soup = BeautifulSoup(pagetext, 'lxml')
for tag in soup.find_all('script'):
tt = str(tag.get('type',None))
if tt.endswith("application/ld+json"):
myJsonLd = tag.get_text()
g = Graph()
g.parse(data=myJsonLd, format='json-ld', base=u)
g.close()
for s,p,o in g.triples( (None, None, None) ):
print "%s %s %s"%(s,p,o)