Skip to content

DynamicExtraction

Dan Brickley edited this page Sep 19, 2017 · 1 revision

Sometimes (e.g. when widgets are added from another server that holds interesting data), sites add Schema.org data dynamically, via Javascript.

Here is a quick example that shows a headless browser being run from Python, and then data extracted with rdflib:

See link for full code: https://gist.github.com/danbri/ad684a50872fffb30e0bbd2c22ea3e18

register('json-ld', Serializer, 'rdflib_jsonld.serializer', 'JsonLDSerializer')
browser = webdriver.Firefox()
browser.get(u)
pagetext  = browser.page_source

with warnings.catch_warnings():

        try:
                warnings.simplefilter("ignore")
                browser.close()
                browser.quit()

        except e:
                print "..."

soup = BeautifulSoup(pagetext, 'lxml')

for tag in soup.find_all('script'):
        tt = str(tag.get('type',None))
        if tt.endswith("application/ld+json"):
                myJsonLd = tag.get_text()
                g = Graph()
                g.parse(data=myJsonLd, format='json-ld', base=u)
                g.close()
                for s,p,o in g.triples( (None,  None, None) ):
                           print "%s %s %s"%(s,p,o)