Title: | Google Citation Parser |
---|---|
Description: | Scrapes Google Citation pages and creates data frames of citations over time. |
Authors: | John Muschelli [aut, cre] |
Maintainer: | John Muschelli <[email protected]> |
License: | GPL-3 |
Version: | 0.10.1 |
Built: | 2024-11-04 03:03:13 UTC |
Source: | https://github.com/muschellij2/gcite |
Takes a vector of authors and then creates a frequency table of those words and plots a wordcloud
author_cloud( authors, addstopwords = gcite_stopwords(), author_pattern = NULL, split = ",", verbose = TRUE, colors = c("#66C2A4", "#41AE76", "#238B45", "#006D2C", "#00441B"), ... ) author_frequency( authors, author_pattern = NULL, split = ",", addstopwords = gcite_stopwords(), verbose = TRUE )
author_cloud( authors, addstopwords = gcite_stopwords(), author_pattern = NULL, split = ",", verbose = TRUE, colors = c("#66C2A4", "#41AE76", "#238B45", "#006D2C", "#00441B"), ... ) author_frequency( authors, author_pattern = NULL, split = ",", addstopwords = gcite_stopwords(), verbose = TRUE )
authors |
Vector of authors of papers |
addstopwords |
Additional words to remove from wordcloud |
author_pattern |
regular expression for patterns to exclude from individual authors |
split |
split author names (default |
verbose |
Print diagnostic messages |
colors |
color words from least to most frequent. Passed to
|
... |
additional options passed to |
A data.frame
of the words and the frequencies of the
authors
## Not run: L = gcite_author_info("John Muschelli") paper_df = L$paper_df authors = paper_df$authors author_cloud(authors) ## End(Not run)
## Not run: L = gcite_author_info("John Muschelli") paper_df = L$paper_df authors = paper_df$authors author_cloud(authors) ## End(Not run)
Wraps getting the information from Google Citations and plotting the wordcloud
gcite( author, user, plot_wordcloud = TRUE, author_args = list(), title_args = list(), warn = FALSE, force = FALSE, sleeptime = 0, ... )
gcite( author, user, plot_wordcloud = TRUE, author_args = list(), title_args = list(), warn = FALSE, force = FALSE, sleeptime = 0, ... )
author |
author name separated by spaces |
user |
user ID for google Citations |
plot_wordcloud |
should the wordcloud be plotted |
author_args |
Arguments to pass to |
title_args |
Arguments to pass to |
warn |
should warnings be printed from wordcloud? |
force |
If passing a URL and there is a failure, should the
program return |
sleeptime |
time in seconds between http requests, to avoid Google Scholar rate limit |
... |
additional options passed to |
List from either gcite_user_info
or gcite_author_info
if (!is_travis() & !is_cran()) { res = gcite(author = "John Muschelli") paper_df = res$paper_df gcite_wordcloud(paper_df) author_cloud(paper_df$authors) }
if (!is_travis() & !is_cran()) { res = gcite(author = "John Muschelli") paper_df = res$paper_df gcite_wordcloud(paper_df) author_cloud(paper_df$authors) }
Calls gcite_user_info
after getting the user
identifier
gcite_author_info( author, ask = TRUE, pagesize = 100, verbose = TRUE, secure = TRUE, force = FALSE, read_citations = TRUE, sleeptime = 0, ... )
gcite_author_info( author, ask = TRUE, pagesize = 100, verbose = TRUE, secure = TRUE, force = FALSE, read_citations = TRUE, sleeptime = 0, ... )
author |
author name separated by spaces |
ask |
If multiple authors are found, should a menu be given |
pagesize |
Size of pages, max 100, passed to |
verbose |
Print diagnostic messages |
secure |
use https vs. http |
force |
If passing a URL and there is a failure, should the
program return |
read_citations |
Should all citation pages be read? |
sleeptime |
time in seconds between http requests, to avoid Google Scholar rate limit |
... |
Additional arguments passed to |
A list of citations, citation indices, and a
data.frame
of authors, journal, and citations, and a
data.frame
of the links to all paper URLs.
## Not run: if (!is_travis()) { df = gcite_author_info(author = "John Muschelli", secure = FALSE) } ## End(Not run) if (!is_travis() & !is_cran()) { df = gcite_author_info(author = "Jiawei Bai", secure = FALSE) }
## Not run: if (!is_travis()) { df = gcite_author_info(author = "John Muschelli", secure = FALSE) } ## End(Not run) if (!is_travis() & !is_cran()) { df = gcite_author_info(author = "Jiawei Bai", secure = FALSE) }
Parses a google citation indices (h-index, etc.) from main page
gcite_citation_index(doc, ...) ## S3 method for class 'xml_node' gcite_citation_index(doc, ...) ## S3 method for class 'xml_document' gcite_citation_index(doc, ...) ## S3 method for class 'character' gcite_citation_index(doc, ...)
gcite_citation_index(doc, ...) ## S3 method for class 'xml_node' gcite_citation_index(doc, ...) ## S3 method for class 'xml_document' gcite_citation_index(doc, ...) ## S3 method for class 'character' gcite_citation_index(doc, ...)
doc |
A xml_document or the url for the main page |
... |
Additional arguments passed to |
A matrix of indices
library(httr) library(rvest) library(gcite) url = "https://scholar.google.com/citations?user=T9eqZgMAAAAJ" url = gcite_url(url = url, pagesize = 10, cstart = 0) if (!is_travis() & !is_cran()) { ind = gcite_citation_index(url) doc = content(httr::GET(url)) ind = gcite_citation_index(doc) ind_nodes = rvest::html_nodes(doc, "#gsc_rsb_st")[[1]] ind = gcite_citation_index(ind_nodes) }
library(httr) library(rvest) library(gcite) url = "https://scholar.google.com/citations?user=T9eqZgMAAAAJ" url = gcite_url(url = url, pagesize = 10, cstart = 0) if (!is_travis() & !is_cran()) { ind = gcite_citation_index(url) doc = content(httr::GET(url)) ind = gcite_citation_index(doc) ind_nodes = rvest::html_nodes(doc, "#gsc_rsb_st")[[1]] ind = gcite_citation_index(ind_nodes) }
Parses a google citation indices (h-index, etc.) from main page
gcite_citation_page(doc, title = NULL, force = FALSE, ...) ## S3 method for class 'xml_nodeset' gcite_citation_page(doc, title = NULL, force = FALSE, ...) ## S3 method for class 'xml_document' gcite_citation_page(doc, title = NULL, force = FALSE, ...) ## S3 method for class 'character' gcite_citation_page(doc, title = NULL, force = FALSE, ...) ## S3 method for class 'list' gcite_citation_page(doc, title = NULL, force = FALSE, ...) ## Default S3 method: gcite_citation_page(doc, title = NULL, force = FALSE, ...)
gcite_citation_page(doc, title = NULL, force = FALSE, ...) ## S3 method for class 'xml_nodeset' gcite_citation_page(doc, title = NULL, force = FALSE, ...) ## S3 method for class 'xml_document' gcite_citation_page(doc, title = NULL, force = FALSE, ...) ## S3 method for class 'character' gcite_citation_page(doc, title = NULL, force = FALSE, ...) ## S3 method for class 'list' gcite_citation_page(doc, title = NULL, force = FALSE, ...) ## Default S3 method: gcite_citation_page(doc, title = NULL, force = FALSE, ...)
doc |
A xml_document or the url for the main page |
title |
title of the article |
force |
If passing a URL and there is a failure, should the
program return |
... |
arguments passed to |
A matrix of indices
library(httr) library(rvest) url = paste0("https://scholar.google.com/citations?view_op=view_citation&", "hl=en&oe=ASCII&user=T9eqZgMAAAAJ&pagesize=100&", "citation_for_view=T9eqZgMAAAAJ:W7OEmFMy1HYC") url = gcite_url(url = url, pagesize = 10, cstart = 0) if (!is_travis() & !is_cran()) { ind = gcite_citation_page(url) doc = content(httr::GET(url)) ind = gcite_citation_page(doc) ind_nodes = html_nodes(doc, "#gsc_oci_table div") ind_nodes = html_nodes(ind_nodes, xpath = '//div[@class = "gs_scl"]') ind = gcite_citation_page(ind_nodes) }
library(httr) library(rvest) url = paste0("https://scholar.google.com/citations?view_op=view_citation&", "hl=en&oe=ASCII&user=T9eqZgMAAAAJ&pagesize=100&", "citation_for_view=T9eqZgMAAAAJ:W7OEmFMy1HYC") url = gcite_url(url = url, pagesize = 10, cstart = 0) if (!is_travis() & !is_cran()) { ind = gcite_citation_page(url) doc = content(httr::GET(url)) ind = gcite_citation_page(doc) ind_nodes = html_nodes(doc, "#gsc_oci_table div") ind_nodes = html_nodes(ind_nodes, xpath = '//div[@class = "gs_scl"]') ind = gcite_citation_page(ind_nodes) }
Parses a google citations over time from the main Citation page
gcite_cite_over_time(doc, ...) ## S3 method for class 'xml_node' gcite_cite_over_time(doc, ...) ## S3 method for class 'xml_document' gcite_cite_over_time(doc, ...) ## S3 method for class 'character' gcite_cite_over_time(doc, ...) ## Default S3 method: gcite_cite_over_time(doc, ...)
gcite_cite_over_time(doc, ...) ## S3 method for class 'xml_node' gcite_cite_over_time(doc, ...) ## S3 method for class 'xml_document' gcite_cite_over_time(doc, ...) ## S3 method for class 'character' gcite_cite_over_time(doc, ...) ## Default S3 method: gcite_cite_over_time(doc, ...)
doc |
A xml_document or the url for the main page |
... |
arguments passed to |
A matrix of citations
library(httr) library(rvest) url = "https://scholar.google.com/citations?user=T9eqZgMAAAAJ" url = gcite_url(url = url, pagesize = 10, cstart = 0) if (!is_travis() & !is_cran()) { #' ind = gcite_cite_over_time(url) doc = content(httr::GET(url)) ind = gcite_cite_over_time(doc) ind_nodes = rvest::html_nodes(doc, ".gsc_md_hist_b") ind = gcite_cite_over_time(ind_nodes) }
library(httr) library(rvest) url = "https://scholar.google.com/citations?user=T9eqZgMAAAAJ" url = gcite_url(url = url, pagesize = 10, cstart = 0) if (!is_travis() & !is_cran()) { #' ind = gcite_cite_over_time(url) doc = content(httr::GET(url)) ind = gcite_cite_over_time(doc) ind_nodes = rvest::html_nodes(doc, ".gsc_md_hist_b") ind = gcite_cite_over_time(ind_nodes) }
Parses a google citation bar graph from html
gcite_graph(citations, ...) ## S3 method for class 'xml_node' gcite_graph(citations, ...) ## S3 method for class 'xml_document' gcite_graph(citations, ...) ## S3 method for class 'character' gcite_graph(citations, ...) ## Default S3 method: gcite_graph(citations, ...)
gcite_graph(citations, ...) ## S3 method for class 'xml_node' gcite_graph(citations, ...) ## S3 method for class 'xml_document' gcite_graph(citations, ...) ## S3 method for class 'character' gcite_graph(citations, ...) ## Default S3 method: gcite_graph(citations, ...)
citations |
A list of nodes or xml_node |
... |
arguments passed to |
A matrix of citations and years
Parses a google citation bar graph from html
gcite_main_graph(citations, ...) ## S3 method for class 'xml_document' gcite_main_graph(citations, ...) ## S3 method for class 'character' gcite_main_graph(citations, ...) ## Default S3 method: gcite_main_graph(citations, ...)
gcite_main_graph(citations, ...) ## S3 method for class 'xml_document' gcite_main_graph(citations, ...) ## S3 method for class 'character' gcite_main_graph(citations, ...) ## Default S3 method: gcite_main_graph(citations, ...)
citations |
A list of nodes or xml_node |
... |
arguments passed to |
A matrix of citations and years
Get Paper Data Frame from Title URLs
gcite_paper_df(urls, verbose = TRUE, force = FALSE, sleeptime = 0, ...)
gcite_paper_df(urls, verbose = TRUE, force = FALSE, sleeptime = 0, ...)
urls |
A character vector of urls, from
|
verbose |
Print diagnostic messages |
force |
If passing a URL and there is a failure, should the
program return |
sleeptime |
time in seconds between http requests, to avoid Google Scholar rate limit |
... |
Additional arguments passed to |
A data.frame
of authors, journal, and citations
if (!is_travis() & !is_cran()) { L = gcite_user_info(user = "uERvKpYAAAAJ", read_citations = FALSE) urls = L$all_papers$title_link paper_df = gcite_paper_df(urls = urls, force = TRUE) }
if (!is_travis() & !is_cran()) { L = gcite_user_info(user = "uERvKpYAAAAJ", read_citations = FALSE) urls = L$all_papers$title_link paper_df = gcite_paper_df(urls = urls, force = TRUE) }
Parses a google citation indices (h-index, etc.) from main page
gcite_papers(doc, ...) ## S3 method for class 'xml_nodeset' gcite_papers(doc, ...) ## S3 method for class 'xml_document' gcite_papers(doc, ...) ## S3 method for class 'character' gcite_papers(doc, ...) ## Default S3 method: gcite_papers(doc, ...)
gcite_papers(doc, ...) ## S3 method for class 'xml_nodeset' gcite_papers(doc, ...) ## S3 method for class 'xml_document' gcite_papers(doc, ...) ## S3 method for class 'character' gcite_papers(doc, ...) ## Default S3 method: gcite_papers(doc, ...)
doc |
A xml_document or the url for the main page |
... |
Additional arguments passed to |
A matrix of indices
library(httr) library(rvest) url = "https://scholar.google.com/citations?user=T9eqZgMAAAAJ" url = gcite_url(url = url, pagesize = 10, cstart = 0) if (!is_travis() & !is_cran()) { ind = gcite_papers(url) doc = content(httr::GET(url)) ind = gcite_papers(doc) ind_nodes = rvest::html_nodes(doc, "#gsc_a_b") ind = gcite_papers(ind_nodes) }
library(httr) library(rvest) url = "https://scholar.google.com/citations?user=T9eqZgMAAAAJ" url = gcite_url(url = url, pagesize = 10, cstart = 0) if (!is_travis() & !is_cran()) { ind = gcite_papers(url) doc = content(httr::GET(url)) ind = gcite_papers(doc) ind_nodes = rvest::html_nodes(doc, "#gsc_a_b") ind = gcite_papers(ind_nodes) }
Additional stopwords to remove from Google Cite results
gcite_stopwords()
gcite_stopwords()
Character Vector
gcite_stopwords()
gcite_stopwords()
Simple wrapper for adding in pagesize
and start values for the page
gcite_url(url, cstart = 0, pagesize = 100) gcite_base_url(secure = TRUE) gcite_user_url(user, secure = TRUE)
gcite_url(url, cstart = 0, pagesize = 100) gcite_base_url(secure = TRUE) gcite_user_url(user, secure = TRUE)
url |
URL of the google citations page |
cstart |
Starting value for the citation page |
pagesize |
number of citations to return, max is 100 |
secure |
should https be used (default), instead of http |
user |
Username/user ID for Google Scholar Citations |
A character string
url = "https://scholar.google.com/citations?user=T9eqZgMAAAAJ" gcite_url(url = url, pagesize = 100, cstart = 5)
url = "https://scholar.google.com/citations?user=T9eqZgMAAAAJ" gcite_url(url = url, pagesize = 100, cstart = 5)
Loops through pages for all information on Google Citations
gcite_user_info( user, pagesize = 100, verbose = TRUE, secure = TRUE, force = FALSE, read_citations = TRUE, sleeptime = 0, ... )
gcite_user_info( user, pagesize = 100, verbose = TRUE, secure = TRUE, force = FALSE, read_citations = TRUE, sleeptime = 0, ... )
user |
user ID for google Citations |
pagesize |
Size of pages, max 100, passed to |
verbose |
Print diagnostic messages |
secure |
use https vs. http |
force |
If passing a URL and there is a failure, should the
program return |
read_citations |
Should all citation pages be read? |
sleeptime |
time in seconds between http requests, to avoid Google Scholar rate limit |
... |
Additional arguments passed to |
A list of citations, citation indices, and a
data.frame
of authors, journal, and citations, and a
data.frame
of the links to all paper URLs and the character
string of the user name.
## Not run: if (!is_travis() & !is_cran()) { df = gcite_user_info(user = "uERvKpYAAAAJ") } ## End(Not run)
## Not run: if (!is_travis() & !is_cran()) { df = gcite_user_info(user = "uERvKpYAAAAJ") } ## End(Not run)
Search Google Citation for an author username
gcite_username(author, verbose = TRUE, ask = TRUE, secure = TRUE, ...)
gcite_username(author, verbose = TRUE, ask = TRUE, secure = TRUE, ...)
author |
author name separated by spaces |
verbose |
Verbose diagnostic printing |
ask |
If multiple authors are found, should a menu be given |
secure |
use https vs. http |
... |
arguments passed to |
A character vector of the username of the author
if (!is_travis() & !is_cran()) { gcite_username("John Muschelli") }
if (!is_travis() & !is_cran()) { gcite_username("John Muschelli") }
Simple wrapper for author_cloud
and title_cloud
gcite_wordcloud( paper_df, author_args = list(), title_args = list(), warn = FALSE )
gcite_wordcloud( paper_df, author_args = list(), title_args = list(), warn = FALSE )
paper_df |
A |
author_args |
Arguments to pass to |
title_args |
Arguments to pass to |
warn |
should warnings be printed from wordcloud? |
Simple wrapper for wordcloud
with
different defaults
gcite_wordcloud_spec( words, freq, min.freq = 1, max.words = Inf, random.order = FALSE, colors = c("#F768A1", "#DD3497", "#AE017E", "#7A0177", "#49006A"), vfont = c("sans serif", "plain"), ... )
gcite_wordcloud_spec( words, freq, min.freq = 1, max.words = Inf, random.order = FALSE, colors = c("#F768A1", "#DD3497", "#AE017E", "#7A0177", "#49006A"), vfont = c("sans serif", "plain"), ... )
words |
words to be plotted |
freq |
the frequency of those words |
min.freq |
words with frequency below min.freq will not be plotted |
max.words |
Maximum number of words to be plotted. least frequent terms dropped |
random.order |
plot words in random order. If false, they will be plotted in decreasing frequency |
colors |
color words from least to most frequent |
vfont |
passed to text for the font |
... |
additional options passed to |
Nothing
Simple check for Travis CI for examples
is_travis() is_cran()
is_travis() is_cran()
Logical if user is named travis
is_travis() is_cran()
is_travis() is_cran()
Set Cookies from Text file
set_cookies_txt(file)
set_cookies_txt(file)
file |
tab-delimited text file of cookies, to be read in using
|
Either NULL
if no domains contain the word "scholar"
,
or an object of class request
from set_cookies
This function searches for domains that contain the word "scholar"
Takes a vector of titles and then creates a frequency table of those words and plots a wordcloud
title_cloud(titles, addstopwords = gcite_stopwords(), ...) paper_cloud(...) title_word_frequency(titles, addstopwords = NULL)
title_cloud(titles, addstopwords = gcite_stopwords(), ...) paper_cloud(...) title_word_frequency(titles, addstopwords = NULL)
titles |
Vector of titles of papers |
addstopwords |
Additional words to remove from wordcloud |
... |
additional options passed to |
A data.frame
of the words and the frequencies of the
title words
## Not run: L = gcite_author_info("John Muschelli") paper_df = L$paper_df titles = paper_df$title title_cloud(titles) ## End(Not run)
## Not run: L = gcite_author_info("John Muschelli") paper_df = L$paper_df titles = paper_df$title title_cloud(titles) ## End(Not run)