Skip to main content
AI in Production 2026 is now open for talk proposals.
Share insights that help teams build, scale, and maintain stronger AI systems.
items
Menu
  • About
    • Overview 
    • Join Us  
    • Community 
    • Contact 
  • Training
    • Overview 
    • Course Catalogue 
    • Public Courses 
  • Posit
    • Overview 
    • License Resale 
    • Managed Services 
    • Health Check 
  • Data Science
    • Overview 
    • Visualisation & Dashboards 
    • Open-source Data Science 
    • Data Science as a Service 
    • Gallery 
  • Engineering
    • Overview 
    • Cloud Solutions 
    • Enterprise Applications 
  • Our Work
    • Blog 
    • Case Studies 
    • R Package Validation 
    • diffify  

Importing Data with Python

Author: Russ Hyde

Published: July 17, 2025

tags: python
The Polars and Pandas logos

Importing data is a key step in the data science workflow. It also has a huge responsibility. How you import (or connect to) a dataset has consequences for how you work with that data throughout a project, because a Pandas DataFrame (say) requires Pandas-specific code. Similarly, your data constrains your code - if it can fit in memory on a single computer, the constraints are different than if your data is so large that its storage must be distributed. Data-import is a key place where a data-project can go wrong. If you import a dataset without validating that it contains sensible values, a nasty surprise may await you….

Python has wonderful libraries for data manipulation and analysis. You can readily work with data sources of a variety of types. For example:

  • that are too big to hold in memory;
  • that are distributed across a network;
  • that update rapidly;
  • or that don’t easily conform to a tabular, relational form.

The Python data stack is sufficiently mature that there are multiple libraries for all of these settings. There are some moves to introduce a standardised syntax for some data-frame functionality across libraries. At Jumping Rivers, we have a number of Python training courses, and teach how to use Pandas, PySpark, Numpy and how to work with databases via SQL from Python.

Whether you want to start from scratch, or improve your skills, Jumping Rivers has a training course for you.

Importing into memory

Firstly we will compare two in-memory data-frame packages: Pandas and Polars. We will work in a virtual environment (so that the packages installed are handled independently of the system-wide Python; see our Barbie-themed blog post on virtual environments).

# [Linux terminal commands]
# Create & activate a new virtual environment
python -m venv .venv
source .venv/bin/activate
# Install pandas and polars into the environment
pip install pandas polars

We will generate a simple dataset to import with the two packages. PyPI download numbers for a specific Python package can be obtained using the pypistats package. After installing it, we will pull out the number of downloads for the package pytest - see our recent blog posts for an introduction to this testing library.

# [Linux terminal commands]
# Install pypistats
pip install pypistats
# Obtain download-statistics for `pytest` in tab-separated format
pypistats python_minor -f tsv pytest > data/pytest-downloads.tsv

The structure of that file is straight-forward. It records both the number, and the percentage, of pytest downloads across each minor version of Python (“3.8”, “3.9” and so on) for the last 180 days (a default time-span).

# [Linux terminal commands]
head data/pytest-downloads.tsv
## "category"   "percent"   "downloads"
## "3.11"   "23.40%"    282,343,944
## "3.9"    "18.78%"    226,548,604
## "3.10"   "18.74%"    226,155,405
## "3.12"   "15.48%"    186,819,921
## "null"   "8.41%" 101,489,156
## "3.8"    "6.13%" 73,965,471
## "3.13"   "4.91%" 59,253,846
## "3.7"    "3.36%" 40,551,618
## "3.6"    "0.54%" 6,546,017

So it should be trivial to import it into Python using either Pandas or Polars.

files = {
  "pytest_data": "data/pytest-downloads.tsv"
}
import pandas as pd
downloads_pd = pd.read_csv(files["pytest_data"], sep="\t")
downloads_pd.head()
##   category percent    downloads
## 0     3.11  23.40%  282,343,944
## 1      3.9  18.78%  226,548,604
## 2     3.10  18.74%  226,155,405
## 3     3.12  15.48%  186,819,921
## 4      NaN   8.41%  101,489,156
import polars as pl
downloads_pl = pl.read_csv(files["pytest_data"], separator="\t")
downloads_pl.head()
## shape: (5, 3)
## ┌──────────┬─────────┬─────────────┐
## │ category ┆ percent ┆ downloads   │
## │ ---      ┆ ---     ┆ ---         │
## │ str      ┆ str     ┆ str         │
## ╞══════════╪═════════╪═════════════╡
## │ 3.11     ┆ 23.40%  ┆ 282,343,944 │
## │ 3.9      ┆ 18.78%  ┆ 226,548,604 │
## │ 3.10     ┆ 18.74%  ┆ 226,155,405 │
## │ 3.12     ┆ 15.48%  ┆ 186,819,921 │
## │ null     ┆ 8.41%   ┆ 101,489,156 │
## └──────────┴─────────┴─────────────┘

We can see that Pytest was downloaded ~ 280 million times on Python 3.11, and this accounted for about 23% of downloads.

The syntax for importing the dataset using the two libraries is almost identical. The only difference for the read_csv() function is that you use sep=... in Pandas and separator=... in Polars.

Polars is more memory efficient than Pandas in a lot of settings. One of the memory-efficiencies that Polars allows, is filtering by rows and columns during import. To take advantage of this, you can use the scan_csv() function in Polars. This creates a lazy data-frame that can be directly manipulated by Polars DataFrame methods. These method calls are applied at the point when the data is loaded, rather than on an in-memory data-frame. The Polars website has far more details about the lazy API, and the benefits it can bring to your work.

Here, we will be working with eagerly-loaded, in-memory, data-frames - a good choice during exploratory work.

Validating a dataset

Have we loaded what we wanted?

The Pandas/Polars .dtypes attribute gives you information about the data-types in each column of a data-frame:

print(downloads_pd.dtypes)
## category     object
## percent      object
## downloads    object
## dtype: object
print(downloads_pl.dtypes)
## [String, String, String]

Again, the formatting of the output looks a little different between the two packages, but the results are broadly similar: all of our data—the version-strings in ‘category’, the percentage values and download counts—have been read as character strings. Pandas tell us they are ’object’s, but we know what that typically means they are strings:

type(
  downloads_pd["category"][0]
)
## <class 'str'>

So Python has imported our data incorrectly. In a real project we would want to know about this, so we might validate the data in our datasets. We would also want to prevent data-import mistakes as far as possible (see later) by being more explicit about how each column is imported and converted.

Python has a range of packages for validating both the schema for, and the values in, a dataset.

For a general class, if you want to check the data-types that are stored in the fields, you could use Pydantic:

# [Terminal]
pip install pydantic
from pydantic import BaseModel, NonNegativeInt

class Person(BaseModel):
    name: str
    age: NonNegativeInt

Correct data-types cause no fuss:

Person(name="Russ", age=47)
## Person(name='Russ', age=47)

But incorrect data (here a negative age) throw errors:

Person(name="Buddy", age=-1)
## pydantic_core._pydantic_core.ValidationError: 1 validation error for Person
## age
##   Input should be greater than or equal to 0 [type=greater_than_equal, input_value=-1, input_type=int]
##     For further information visit https://errors.pydantic.dev/2.11/v/greater_than_equal

There are extensions of Pydantic that work with data-frames. For example, Pandera can work with Pandas, Polars and several other data-frame libraries. You would need to install a different Pandera extension depending on the data-frame library you are working with (“pandera[pandas]” for Pandas, “pandera[polars]” for Polars etc).

# Terminal
pip install "pandera[pandas]"
import pandera.pandas as pa

schema = pa.DataFrameSchema({
    "category": pa.Column(str, nullable=True),
    "percent": pa.Column(float, checks=pa.Check.in_range(0, 100)),
    "downloads": pa.Column(int)
})
schema.validate(downloads_pd)
## pandera.errors.SchemaError: non-nullable series 'percent' contains null values:
## 17    NaN
## 18    NaN
## Name: percent, dtype: object

Validation of the data has identified an issue with the dataset. There is a missing value in the “percent” column towards the end of the dataset. There are other issues - the two numeric columns are currently strings - but lets check out the issue that Pandera has identified first.

This is the end of the dataset:

downloads_pd.tail()
##                                category percent      downloads
## 14                                  3.3   0.00%          2,259
## 15                                  2.6   0.00%             87
## 16                                  3.2   0.00%             68
## 17                                Total     NaN  1,206,596,056
## 18  Date range: 2024-12-31 - 2025-07-07     NaN            NaN

There is some metadata included in the final couple of lines of the dataset that should be ignored at import. Let’s just ignore the final couple of lines at import (this isn’t a very robust solution, but is fine for now).

downloads_pd = pd.read_csv(files["pytest_data"], sep="\t", nrows = 17)
downloads_pd.tail()
##     category percent downloads
## 12      3.40   0.00%    12,777
## 13      3.15   0.00%     2,882
## 14      3.30   0.00%     2,259
## 15      2.60   0.00%        87
## 16      3.20   0.00%        68

Now if we validate the Pandas data-frame, another issue has been identified.

schema.validate(downloads_pd)
## pandera.errors.SchemaError: expected series 'category' to have type str:
## failure cases:
##     index  failure_case
## 0       0          3.11
## 1       1          3.90
## 2       2          3.10
## 3       3          3.12
## 4       5          3.80
## 5       6          3.13
## 6       7          3.70
## 7       8          3.60
## 8       9          2.70
## 9      10          3.14
## 10     11          3.50
## 11     12          3.40
## 12     13          3.15
## 13     14          3.30
## 14     15          2.60
## 15     16          3.20

This is a more substantial issue - the Python version-strings (3.2, 3.3 and so on) have been converted into floating-point numbers during import. So now Python version “3.2” is Python 3.20 in the data-frame.

Rich Iannone has written a useful blog post comparing various data-validation libraries for Polars at the “Posit-dev” blog. The tools mentioned in his post can check more substantial matters than the missing-data, data-type and data-range issues that we mentioned above. In particular, the tool “pointblank” can create data-validation summary reports that can be used to report back to data-collection teams or to analysts. Python already had data-reporting tools like “great expectations” and “test-driven data analysis”.

Importing with data-type constraints

I knew the structure of the tab-separated file before attempting to load it with Pandas and Polars. That is, I knew that the percents looked like “12.34%” and that the download counts looked like “123,456,789” (with commas separating the thousands and millions). Neither package can automatically convert these number formats into the format that they require without a bit of help. Even if we explained what the post-import data-type should be for each column, the two libraries wouldn’t be able to parse the input data directly.

Both Polars and Pandas allow you to provide a pre-defined schema when importing data. For Pandas, you provide a dtype dictionary which specifies what the output column data-type should be. For Polars, you provide a schema argument, where the data-types are specified in a Polars-specific format (because the data is stored in Rust, the Python int and str data-types don’t work for Polars).

If we import our data using a schema, Pandas and Polars will complain:

downloads_pd = pd.read_csv(
    files["pytest_data"],
    sep="\t",
    nrows = 17,
    dtype={"category": str, "percent": float, "downloads": int}
)
## ValueError: could not convert string to float: '23.40%'
downloads_pl = pl.read_csv(
    files["pytest_data"],
    separator="\t",
    n_rows=17,
    schema={"category": pl.Utf8, "percent": pl.Float64, "downloads": pl.Int64}
)
## polars.exceptions.ComputeError: could not parse `"23.40%"` as dtype `f64` at column 'percent' (column number 2)
## 
## The current offset in the file is 7 bytes.
## 
## You might want to try:
## - increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),
## - specifying correct dtype with the `schema_overrides` argument
## - setting `ignore_errors` to `True`,
## - adding `"23.40%"` to the `null_values` list.
## 
## Original error: ```remaining bytes non-empty```

The errors arise because we need to also specify how to convert our data into the expected data-types when it isn’t obvious. This is done using ‘converters’ in Pandas:

downloads_pd = pd.read_csv(
    files["pytest_data"],
    sep="\t",
    nrows = 17,
    dtype={"category": str},
    converters = {
      # 12.34% -> 12.34
      "percent": lambda x: float(x.strip("%")),
      # 123,456,789 -> 123456789
      "downloads": lambda x: int(x.replace(",", ""))
    }
)

downloads_pd.head()
##   category  percent  downloads
## 0     3.11    23.40  282343944
## 1      3.9    18.78  226548604
## 2     3.10    18.74  226155405
## 3     3.12    15.48  186819921
## 4      NaN     8.41  101489156

In Polars, the with_columns() method allows conversion to the expected data-types.

downloads_pl = pl.read_csv(
    files["pytest_data"],
    separator="\t",
    n_rows=17
).with_columns(
  # 12.34% -> 12.34
  pl.col("percent").str.replace("%", "").cast(pl.Float64),
  # 123,456,789 -> 123456789
  pl.col("downloads").str.replace_all(",", "").cast(pl.Int64)
)

downloads_pl.head()
## shape: (5, 3)
## ┌──────────┬─────────┬───────────┐
## │ category ┆ percent ┆ downloads │
## │ ---      ┆ ---     ┆ ---       │
## │ str      ┆ f64     ┆ i64       │
## ╞══════════╪═════════╪═══════════╡
## │ 3.11     ┆ 23.4    ┆ 282343944 │
## │ 3.9      ┆ 18.78   ┆ 226548604 │
## │ 3.10     ┆ 18.74   ┆ 226155405 │
## │ 3.12     ┆ 15.48   ┆ 186819921 │
## │ null     ┆ 8.41    ┆ 101489156 │
## └──────────┴─────────┴───────────┘

Now if we validate our datasets against the Pandera schema, we should have a little more success:

schema.validate(downloads_pd)
##    category  percent  downloads
## 0      3.11    23.40  282343944
## 1       3.9    18.78  226548604
## 2      3.10    18.74  226155405
## 3      3.12    15.48  186819921
## 4       NaN     8.41  101489156
## 5       3.8     6.13   73965471
## 6      3.13     4.91   59253846
## 7       3.7     3.36   40551618
## 8       3.6     0.54    6546017
## 9       2.7     0.20    2371860
## 10     3.14     0.02     300816
## 11      3.5     0.02     231325
## 12      3.4     0.00      12777
## 13     3.15     0.00       2882
## 14      3.3     0.00       2259
## 15      2.6     0.00         87
## 16      3.2     0.00         68

Nice.

Summary

In this post we have looked at importing data with both Pandas and Polars. We have seen that the import functions are similar (read_csv(...)) but that there are some subtle differences with how you specify some things (the column separator argument, the data-schema and so on). Explicit conversion of the imported data is performed differently between the two packages as well.

Once your data has been imported, you should check that it contains what it is supposed to: are the columns you need all present, do they store the correct data-types, do the values within those columns sit within the expected range. Data validation libraries, like Pointblank and Panderas are really useful for checking data-schema and data-values before you do anything critical with your dataset.


Jumping Rivers Logo

Recent Posts

  • Start 2026 Ahead of the Curve: Boost Your Career with Jumping Rivers Training 
  • Should I Use Figma Design for Dashboard Prototyping? 
  • Announcing AI in Production 2026: A New Conference for AI and ML Practitioners 
  • Elevate Your Skills and Boost Your Career – Free Jumping Rivers Webinar on 20th November! 
  • Get Involved in the Data Science Community at our Free Meetups 
  • Polars and Pandas - Working with the Data-Frame 
  • Highlights from Shiny in Production (2025) 
  • Elevate Your Data Skills with Jumping Rivers Training 
  • Creating a Python Package with Poetry for Beginners Part2 
  • What's new for Python in 2025? 

Top Tags

  • R (236) 
  • Rbloggers (182) 
  • Pybloggers (89) 
  • Python (89) 
  • Shiny (63) 
  • Events (26) 
  • Training (23) 
  • Machine Learning (22) 
  • Conferences (20) 
  • Tidyverse (17) 
  • Statistics (14) 
  • Packages (13) 

Authors

  • Amieroh Abrahams 
  • Tim Brock 
  • Theo Roe 
  • Colin Gillespie 
  • Aida Gjoka 
  • Shane Halloran 
  • Russ Hyde 
  • Gigi Kenneth 
  • Osheen MacOscar 
  • Sebastian Mellor 
  • Keith Newman 
  • Pedro Silva 
  • Myles Mitchell 

Keep Updated

Like data science? R? Python? Stan? Then you’ll love the Jumping Rivers newsletter. The perks of being part of the Jumping Rivers family are:

  • Be the first to know about our latest courses and conferences.
  • Get discounts on the latest courses.
  • Read news on the latest techniques with the Jumping Rivers blog.

We keep your data secure and will never share your details. By subscribing, you agree to our privacy policy.

Follow Us

  • GitHub
  • Bluesky
  • LinkedIn
  • YouTube
  • Eventbrite

Find Us

The Catalyst Newcastle Helix Newcastle, NE4 5TG
Get directions

Contact Us

  • hello@jumpingrivers.com
  • + 44(0) 191 432 4340

Newsletter

Sign up

Events

  • North East Data Scientists Meetup
  • Leeds Data Science Meetup
  • Shiny in Production
British Assessment Bureau, UKAS Certified logo for ISO 9001 - Quality management British Assessment Bureau, UKAS Certified logo for ISO 27001 - Information security management Cyber Essentials Certified Plus badge
  • Privacy Notice
  • |
  • Booking Terms

©2016 - present. Jumping Rivers Ltd