In rscopus
we try to
use the Scopus API to present queries about authors and affiliations.
Here we will use an example from Clarke Iakovakis.
First, let’s load in the packages we’ll need.
Next, we need to see if we have an API key available. See the API key
vignette for more information and how to set the keys up. We will use
the have_api_key()
functionality.
We also need to make sure you are authorized to search Scopus with
is_elsevier_authorized
. If you have an API key, but
is_elsevier_authorized()
is FALSE
, then it’s
likely due to the fact that you’re not on the necessary IP that the key
needs:
Here we will create a query of a specific affiliation, subject area, publication year, and type of access (OA = open access). Let’s look at the different types of subject areas:
rscopus::subject_areas()
#> [1] "AGRI" "ARTS" "BIOC" "BUSI" "CENG" "CHEM" "COMP" "DECI" "DENT" "EART"
#> [11] "ECON" "ENER" "ENGI" "ENVI" "HEAL" "IMMU" "MATE" "MATH" "MEDI" "NEUR"
#> [21] "NURS" "PHAR" "PHYS" "PSYC" "SOCI" "VETE" "MULT"
These categories are helpful because to search all the documents it’d be too big of a call. We may also get rate limited. We can search each separately, store the information, save them, merge them, and then run our results.
The author of this example was analyzing data from OSU (Oklahoma
State University), and uses the affiliation ID from that institution
(60006514
). If you know the institution name, but not the
ID, you can use process_affiliation_name
to retrieve it.
Here we make the queries for each subject area:
Let’s pull the first subject area information. Note, the count may
depend on your API key limits. We also are asking for a complete view,
rather than the standard view. The max_count
is set to
20000, so this may not be enough for
your query and you need to adjust.
if (authorized) {
make_query = function(subj_area) {
paste0("AF-ID(60006514) AND SUBJAREA(",
subj_area,
") AND PUBYEAR = 2018 AND ACCESSTYPE(OA)")
}
i = 3
subj_area = subject_areas()[i]
print(subj_area)
completeArticle <- scopus_search(
query = make_query(subj_area),
view = "COMPLETE",
count = 200)
print(names(completeArticle))
total_results = completeArticle$total_results
total_results = as.numeric(total_results)
} else {
total_results = 0
}
Here we see the total results of the query. This can be useful if the
total_results = 0
or they are greater than the max count
specified (not all records in Scopus are returned).
The gen_entries_to_df
function is an attempt at turning
the parsed JSON to something more manageable from the API output. You
may want to go over the list elements get_statements
in the
output of completeArticle
. The original content can be
extracted using httr::content()
and the "type"
can be specified, such as "text"
and then
jsonlite::toJSON
can be used explicitly on the JSON output.
Alternatively, any arguments to jsonlite::toJSON
can be
passed directly into httr::content()
, such as
flatten
or simplifyDataFrame
.
These are all alternative options, but we will use
rscopous::gen_entries_to_df
. The output is a list of
data.frame
s after we pass in the entries
elements from the list.
if (authorized) {
# areas = subject_areas()[12:13]
areas = c("ENER", "ENGI")
names(areas) = areas
results = purrr::map(
areas,
function(subj_area) {
print(subj_area)
completeArticle <- scopus_search(
query = make_query(subj_area),
view = "COMPLETE",
count = 200,
verbose = FALSE)
return(completeArticle)
})
entries = purrr::map(results, function(x) {
x$entries
})
total_results = purrr::map_dbl(results, function(x) {
as.numeric(x$total_results)
})
total_results = sum(total_results, na.rm = TRUE)
df = purrr::map(entries, gen_entries_to_df)
MainEntry = purrr::map_df(df, function(x) {
x$df
}, .id = "subj_area")
ddf = MainEntry %>%
filter(as.numeric(`author-count.$`) > 99)
if ("message" %in% colnames(ddf)) {
ddf = ddf %>%
select(message, `author-count.$`)
print(head(ddf))
}
MainEntry = MainEntry %>%
mutate(
scopus_id = sub("SCOPUS_ID:", "", `dc:identifier`),
entry_number = as.numeric(entry_number),
doi = `prism:doi`)
#################################
# remove duplicated entries
#################################
MainEntry = MainEntry %>%
filter(!duplicated(scopus_id))
Authors = purrr::map_df(df, function(x) {
x$author
}, .id = "subj_area")
Authors$`afid.@_fa` = NULL
Affiliation = purrr::map_df(df, function(x) {
x$affiliation
}, .id = "subj_area")
Affiliation$`@_fa` = NULL
# keep only these non-duplicated records
MainEntry_id = MainEntry %>%
select(entry_number, subj_area)
Authors = Authors %>%
mutate(entry_number = as.numeric(entry_number))
Affiliation = Affiliation %>%
mutate(entry_number = as.numeric(entry_number))
Authors = left_join(MainEntry_id, Authors)
Affiliation = left_join(MainEntry_id, Affiliation)
# first filter to get only OSU authors
osuauth <- Authors %>%
filter(`afid.$` == "60006514")
}
At the end of the day, we have the author-level information for each
paper. The entry_number
will join these
data.frame
s if necessary. The df
element has
the paper-level information in this example, the author
data.frame
has author information, including affiliations.
There can be multiple affiliations, even within institution, such as
multiple department affiliations within an institution affiliation. The
affiliation
information relates to the affiliations and can
be merged with the author information.
Here we look at the funding agencies listed on all the papers. This can show us if there is a pattern in the funding sponsor and the open-access publications. Overall, though, we would like to see the funding of all the papers if a specific funder requires open access. This checking allows libraries and researchers ensure they are following the guidelines of the funding agency.
if (total_results > 0) {
cn = colnames(MainEntry)
cn[grep("fund", tolower(cn))]
tail(sort(table(MainEntry$`fund-sponsor`)))
funderPoland <- filter(
MainEntry,
`fund-sponsor` == "Ministerstwo Nauki i Szkolnictwa Wyższego" )
dim(funderPoland)
osuFunders <- MainEntry %>%
group_by(`fund-sponsor`) %>%
tally() %>%
arrange(desc(n))
osuFunders
}
The Scopus API has limits for different searches and calls. Using a combination of APIs, we can gather all the information on authors that we would like. This gives us a full picture of the authors and co-authorship at a specific institution in specific scenarios, such as the open access publications from 2018.