Skip to main content
AI in Production 2026 is now open for talk proposals.
Share insights that help teams build, scale, and maintain stronger AI systems.
items
Menu
  • About
    • Overview 
    • Join Us  
    • Community 
    • Contact 
  • Training
    • Overview 
    • Course Catalogue 
    • Public Courses 
  • Posit
    • Overview 
    • License Resale 
    • Managed Services 
    • Health Check 
  • Data Science
    • Overview 
    • Visualisation & Dashboards 
    • Open-source Data Science 
    • Data Science as a Service 
    • Gallery 
  • Engineering
    • Overview 
    • Cloud Solutions 
    • Enterprise Applications 
  • Our Work
    • Blog 
    • Case Studies 
    • R Package Validation 
    • diffify  

Parquet vs the RDS Format

Author: Colin Gillespie

Published: February 1, 2024

tags: r, arrow, parquet, rds

This is part of a series of related posts on Apache Arrow. Other posts in the series are:

  • Understanding the Parquet file format
  • Reading and Writing Data with {arrow}
  • Parquet vs the RDS Format (This post)

The benefit of using the {arrow} package with parquet files, is it enables you to work with ridiculously large data sets from the comfort of an R session. Using the NYC-Taxi data from the previous blog post we can perform standard data science operations, such as,

library("arrow")
nyc_taxi = open_dataset(nyc_data)
nyc_taxi |>
  dplyr::filter(year == 2019) |> 
  dplyr::group_by(month) |>
  dplyr::summarise(trip_distance = max(trip_distance)) |> 
  dplyr::collect()

with a speed that seems almost magical. When your dataset is as large as the NYC-Taxi data, then standard file formats, such as, CSV files and R binary files, simply aren’t an option.

Whether you want to start from scratch, or improve your skills, Jumping Rivers has a training course for you.

However, let’s suppose you are in the situation where your data is inconvenient - not big, just a bit annoying. For example, if we take a single year and a single month

taxi_subset = open_dataset(nyc_data) |>
  dplyr::filter(year == 2019 & month == 1) |>
  dplyr::collect()  

The data is still large, with around eight million rows

nrow(taxi_subset)

and takes around 1.2GB of RAM when we load it into R. The data isn’t big, just annoying! In this situation, should we use the native binary format or stick with parquet?

In theory, we could use CSV, but that’s really slow!

RDS vs Parquet

The RDS format is a binary file format, native to R. It has been part of R for many years, and provides a convenient method for saving R objects, including data sets.

The obvious question is which file format should you use for storing tabular data? RDS or parquet? For this comparison, I’m interested in the following characteristics:

  • the time required to save the file;
  • the file size;
  • the time required to load the file.

I’m also a firm believer of keeping things stable and simple. So if both methods are roughly the same or even if parquet is little better, then I would stick with R’s binary format. Consequently, I don’t really care about a few MBs or seconds.

Reading and writing the data

To save the taxi data subset, we use saveRDS() for the rds format and write_parquet() for the parquet format. The default compression method used by RDS is gzip, whereas the parquet uses snappy. As you might guess, the gzip method produces smaller files, but takes longer.

saveRDS(taxi_subset, file = "taxi.rds")
# Default parquet compression is "snappy"
tf1 = tempfile(fileext = ".parquet")
write_parquet(taxi_subset, sink = tf1, compression = "snappy")
tf2 = tempfile(fileext = ".gzip.parquet")
write_parquet(taxi_subset, sink = tf2, compression = "gzip")

Reading in either file type is also straightforward

readRDS("taxi.rds")
# Need to use collect() to make comparison far
open_dataset(file_path) |>
  dplyr::collect()

Results

Each test was run a couple of times, and the average is given in the table below. The read times and size were fairly deterministic, but the write times had massive variability.

MethodCompressionSize (MB)Write Time (s)Read Time (s)
RDSgzip115275.7
Parquetsnappy14340.3
Parquetgzip105120.4

For me the results suggest that for files of this size, I would consider using the native binary R format only if

  1. the writing and reading file times weren’t an issue;
  2. and/or using the native binary R format (and the implied stability) was really important.

However, parquet and {arrow} do look appealing.

When Should we use Parquet over RDS?

The above timings are for a particular size data set (110MB). However, a few quick experiments show the performance improvement is fairly consistent for different file sizes:

  • Writing (parquet vs rds): around 6 time faster using snappy, and twice as fast using gzip;
  • Reading (parquet vs rds): around 16 times faster using parquet.

So to answer the question, when should we use parquet over rds? For me that depends. If it was for a standard analysis, and the files were fairly modest (less than 20 MB), I would probably just go for an RDS file. However, if I had a Shiny application, then this would significantly lower the threshold where I would use parquet, for the simple reason that one second on a web application feels like a lifetime. Remember that if you are using {pins}, then pin_write() can handle parquet files without any issue.


Jumping Rivers Logo

Recent Posts

  • Start 2026 Ahead of the Curve: Boost Your Career with Jumping Rivers Training 
  • Should I Use Figma Design for Dashboard Prototyping? 
  • Announcing AI in Production 2026: A New Conference for AI and ML Practitioners 
  • Elevate Your Skills and Boost Your Career – Free Jumping Rivers Webinar on 20th November! 
  • Get Involved in the Data Science Community at our Free Meetups 
  • Polars and Pandas - Working with the Data-Frame 
  • Highlights from Shiny in Production (2025) 
  • Elevate Your Data Skills with Jumping Rivers Training 
  • Creating a Python Package with Poetry for Beginners Part2 
  • What's new for Python in 2025? 

Top Tags

  • R (236) 
  • Rbloggers (182) 
  • Pybloggers (89) 
  • Python (89) 
  • Shiny (63) 
  • Events (26) 
  • Training (23) 
  • Machine Learning (22) 
  • Conferences (20) 
  • Tidyverse (17) 
  • Statistics (14) 
  • Packages (13) 

Authors

  • Amieroh Abrahams 
  • Colin Gillespie 
  • Aida Gjoka 
  • Gigi Kenneth 
  • Osheen MacOscar 
  • Sebastian Mellor 
  • Keith Newman 
  • Pedro Silva 
  • Shane Halloran 
  • Russ Hyde 
  • Myles Mitchell 
  • Tim Brock 
  • Theo Roe 

Keep Updated

Like data science? R? Python? Stan? Then you’ll love the Jumping Rivers newsletter. The perks of being part of the Jumping Rivers family are:

  • Be the first to know about our latest courses and conferences.
  • Get discounts on the latest courses.
  • Read news on the latest techniques with the Jumping Rivers blog.

We keep your data secure and will never share your details. By subscribing, you agree to our privacy policy.

Follow Us

  • GitHub
  • Bluesky
  • LinkedIn
  • YouTube
  • Eventbrite

Find Us

The Catalyst Newcastle Helix Newcastle, NE4 5TG
Get directions

Contact Us

  • hello@jumpingrivers.com
  • + 44(0) 191 432 4340

Newsletter

Sign up

Events

  • North East Data Scientists Meetup
  • Leeds Data Science Meetup
  • Shiny in Production
British Assessment Bureau, UKAS Certified logo for ISO 9001 - Quality management British Assessment Bureau, UKAS Certified logo for ISO 27001 - Information security management Cyber Essentials Certified Plus badge
  • Privacy Notice
  • |
  • Booking Terms

©2016 - present. Jumping Rivers Ltd