Skip to content

An R package to scrape meta-data from scientific articles hosted on Scielo

License

Notifications You must be signed in to change notification settings

meirelesff/rscielo

Repository files navigation

rscielo

Authors: Fernando Meireles, Denisson Silva, and Rogerio Barbosa

Travis-CI Build Status AppVeyor Build Status Package-License CRAN_Status_Badge Lifecycle: maturing

rscielo offers functions to easily scrape bibliometric information from scientific journals and articles hosted on the Scientific Electronic Library Online Platform (Scielo.org). The retrieved data includes a journal’s details and citation counts; article’s contents, footnotes, bibliographic references; and several other common information used in bibliometric studies. The package also provides functions to quickly summarize the scraped data.

Installing

To install the latest stable release of rscielo from CRAN, use:

install.packages("rscielo")

Alternatively, one may install the latest pre-release version from GitHub via:

if(!require("remotes")) install.packages("remotes")
remotes::install_github("meirelesff/rscielo")

How does it work?

At its core, rscielo is a scraper that offers a transparent and reproducible approach to gather data from the Scientific Electronic Library Online Platform (Scielo.br), one of the largest open repositories for scientific publications in the world. In particular, the package provides functions to automatically extract and parse different types of information from (1) scientific journals (pointed by _journal or _journal_ in their names) and (2) articles (with functions that contains _article or _article_ in their names).

Data from journals

Getting a journal’s ID

To get data from a particular journal, such as citation counts and ISSN, the rscielo relies on an ID (or pid) that uniquely identifies each journal within the Scielo repository. As an example, this is the URL of the Brazilian Political Science Review homepage on Scielo:

http://www.scielo.br/scielo.php?script=sci_serial&pid=1981-3821&lng=en&nrm=iso

The journal ID can be found between &pid= and &lng (i.e., 1981-3821). Most of rscielo’s functions that retrieve data from journals rely on this information to work. To automatically extract an ID from the URL of a journal, one may use the get_journal_id() function:

get_journal_id("http://www.scielo.br/scielo.php?script=sci_serial&pid=1981-3821&lng=en&nrm=iso")
#> [1] "1981-3821"

Scraping data from a journal

With a journal ID in hand, use the get_journal() function to scrape meta-data from all articles published in its last issue:

df <- get_journal("1981-3821")

This code returns a tibble in which the observations correspond to the articles that appeared in the selected journal’s lastest issue. Among the returned variables are authors’ names, institutional affiliations, and home countries; articles’ abstracts, keywords, and the number of pages (check the get_journal documentation executing help(get_journal) for a full description of the retrieved data).

For a quick glimpse at the scraped data, one may use the summary method:

summary(df)
#> 
#> ### JOURNAL: Brazilian Political Science Review
#> 
#> 
#>  Total number of articles:  1 
#>  Total number of articles (reviews excluded):  1
#> 
#>  Mean number of authors per article:  5 
#>  Mean number of pages per article:  Not available

get_journal() also extracts data from all articles ever published by a journal. To do that, set the argument last_issue to FALSE:

get_journal("1981-3821", last_issue = "FALSE")

Scraping journal metrics

rscielo contains functions to scrape and report publication and citation counts of a journal:

# Gets citation metrics
cit <- get_journal_metrics("1981-3821")

# Plots the data for a quick visualization
plot(cit)

Other functions

get_journal_info() and get_journal_list() scrapes a journal’s meta-information (publisher, ISSN, and mission) and a list of all journals hosted on Scielo, respectively:

# Get a journal's meta-information
meta_info <- get_journal_info("1981-3821")


# Get a list with all journals names, URLs and IDs
journals <- get_journal_list()

Data from articles

Getting an articles’ ID

Scientific articles stored on Scielo are also identified by a unique ID, which is formed by a combination between their Digital Object Identifiers (DOI) plus other characters. These IDs can se seen in each article’s URL (after &pid= until &lng=):

# URL of an article
url_article <- "http://www.scielo.br/scielo.php?script=sci_arttext&pid=S1981-38212016000200201&lng=en&nrm=iso&tlng=en"

By design, rscielo handles full articles’ URLs as inputs, but users may obtain the IDs by using the get_article_id function:

get_article_id(url_article)
#> [1] "S1981-38212016000200201"

Contents of a single article

To scrape the content of a single scientific article, the rscielo provides the get_article() function:

# Scrape the meta-data
article <- get_article(url_article)

As can be seen, the function returns the full text of the requested article as a character vector. Users may also pass the article’s ID to the function to achieve the same results:

article <- get_article("S1981-38212016000200201")

Or set the argument output_text to FALSE to get a tibble with the article’s DOI (which might be useful in bibliometric analysis):

article <- get_article("S1981-38212016000200201", output_text = FALSE)

Meta-data of an article

Similar to the get_journal() function, get_article_meta returns meta-data of a selected article hosted on Scielo:

url <- "http://www.scielo.br/scielo.php?script=sci_arttext&pid=S1981-38212016000200201&lng=en&nrm=iso&tlng=en"
article_meta <- get_article_meta(url)

Bibliographic references and footnotes

To retrieve a list of bibliographic items cited by an article, use get_article_referencs():

article_references <- get_article_references(url)

The function outputs a tibble in which every bibliographic item corresponds to an observation. get_article_footnotes() returns a similar object, but with footnotes in the rows:

article_foots <- get_article_footnotes(url)

A list of functions

For convenience, here is a description of the rscielo functions.

Function to extract data from journals:

  • get_journal_id(): Get a journal’s ID from its URL.
  • get_journal(): Get meta-data of all articles published by a journal.
  • get_journal_info(): Get a journal’s description.
  • get_journal_list(): Get a list with all journals’ names, URLs and ID’s.
  • get_journal_metrics(): Get publication and citation counts of a journal.

Function to extract data from articles:

  • get_article_id(): Get an article’s ID from its URL.
  • get_article(): Get the full text of a single article.
  • get_article_meta(): Get meta-data of a single article.
  • get_article_referencs(): Get the list of bibliographic references cited by a single article.
  • get_article_footnotes(): Get the list of the footnotes of a single article.

Methods:

  • summary.Scielo(): Summarize the data of a tibble returned by get_journal.
  • plot.scielo_metrics(): Plot citation counts of a journal retrieved by get_journal_metrics.

A note about the data

The rscielo‘s functions extract data directly from the Scielo online repository. In any event, sometimes users might find errors or obtain incomplete information when using its functions, mainly when using the _article ones to scrape articles’ full contents. This happens when journals feeds invalid or wrongly formatted information into the Scielo platform. In most situations, a bit of data cleaning solves the issues, but users must be aware that the retrieved data still might be lacking.

Citation

To cite rscielo in publications, use:

citation("rscielo")
#> 
#> To cite package 'rscielo' in publications use:
#> 
#>   Fernando Meireles, Denisson Silva and Rogerio Barbosa (2019).
#>   rscielo: A Scraper for Scientific Journals Hosted on Scielo. R
#>   package version 1.0.0.
#>   https://CRAN.R-project.org/package=rscielo
#> 
#> A BibTeX entry for LaTeX users is
#> 
#>   @Manual{,
#>     title = {rscielo: A Scraper for Scientific Journals Hosted on Scielo},
#>     author = {Fernando Meireles and Denisson Silva and Rogerio Barbosa},
#>     year = {2019},
#>     note = {R package version 1.0.0},
#>     url = {https://CRAN.R-project.org/package=rscielo},
#>   }

Contributions

We welcome comments or suggestions to improve the package. Feel free to start a issue at our GitHub repository.

About

An R package to scrape meta-data from scientific articles hosted on Scielo

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages