Have you ever dealt with a big-scary CSV file that has many columns that you don’t want and many records that slow down the process for you to filter and get the desired information?
This tutorial is about using two command-line programs that can solve these problems; csvkit and xsv. We will compare the two at the end and see how performant each and when we can use one and not the other in terms of speed especially if we’re processing a large CSV file. In the last blog post, we talked about how to clean text data at the command line that I recommend to have a look at.
Downloading COVID data from covidtracking
Let's first download recent coronavirus data across the United States from COVID Tracking Project which is a volunteer organization dedicated to collecting and publishing the data required to understand the COVID-19 outbreak in the US. Btw, The data is published under a Creative Commons CC BY 4.0 license.
Let's do this by downloading the CSV file manually or using curl:
-LO is a combination of -L and -O
- -L is used to make sure if the URL has changed to another location, curl will redo the request on the new redirection link
- -O this option is used to create an output file of the same name of the requested file name which is all-states-history.csv here
Printing the CSV file headers
Let's first print what column names we have for this all-states.history.csv file:
As you can see, using csvcut with the option -n can list all the headers we have with their associated order which can help us select some specific columns that we're interested in.
Selecting specific columns
In this tutorial, we're interested in four columns and these are their descriptions as reported by the COVID Tracking Project:
- data: Date on which data was collected by The COVID Tracking Project.
- state: Two-letter abbreviation for the state or territory.
- positive: Total number of confirmed plus probable cases of COVID-19 reported by the state or territory
- death: Total fatalities with confirmed OR probable COVID-19 case diagnosis
Let's see how we can get the first 10 lines of these 4 columns in our CSV file at the command line:
So csvcut with the option -c is used here to select the upcoming columns separated by commas. These 10 lines look better aligned with csvlook
Note that we could've done that with either of the following commands:
Meaning you can select the columns with their numbers or ranges or a combination of numbers and column names as strings.
Take care that this CSV data may differ from yours if you're using the recent data from the COVID Tracking Project on another day than the day this tutorial was written.
Let's now filter out COVID data at California state:
We used here csvgrep with the option -c to select the column that we’re filtering which is the state here to match AL using -m option that matches the pattern we search for.
I'd like to make sure of this data, so I went to Google and asked how many cases we have at Alabama and this is the answer:
Looks like the data reported by the COVID Tracking Project is close to what Google is reporting having 186K positive cases and 2892 fatalities.
If you also put another column to show the increase in the positive cases from the previous day, you'd find:
967 positive cases increased from Oct. 26 to Oct. 27 and this number exactly matches what Google reports (+967) below the Total cases number in the image above.
Joining two CSVs
I'm not familiar with some abbreviations in the state column, so let's have the second CSV file which we can join on to get a cleaner output of CSV data we understand. Let's download it using curl:
This states.csv file has two columns: State and Abbreviation
Let's see how we can make this interesting join here:
Note here that csvjoin command takes much time because it's reading both files into memory.
Here we joined the two CSV files on a column for each CSV; Abbreviation in the first file and state in the second one and then we filtered out 5 columns to view using csvcut -c
Also, note that there the second column you filtered out when you joined is gone meaning if you filtered out state (which was the column that has the two-letter abbreviation of the state) it will give an error that 'state' is invalid which means this column is not there anymore.
Comparing between xsv and csvkit utilities
As we noticed, some commands took much time using csvkit command line utility. Let's see a quick comparison between its command-line tools and their associated ones at xsv.
All the upcoming commands run are relative to my machine, let's compare one by another:
xsv headers vs csvcut -n
Time of csvkit's csvcut -n: ~307ms
Time of xsv's headers: ~13ms
xsv select vs csvcut -c
Time of csvkit's csvcut -c: ~288ms
Time of xsv's select: ~35ms
xsv search vs csvgrep
Time of csvkit's csvgrep with csvcut: ~438ms
Time of xsv's search with select: ~38ms
xsv table vs csvlook
Time of csvkit's csvlook with csvgrep and csvcut: ~476ms
Time of xsv's table with search with select: ~41ms
xsv join vs csvjoin
Time of csvkit's csvjoin with csvcut: ~1.6min
Time of xsv's join with select: ~51ms
Did you see what just happened?! 51 milliseconds vs 1.6 minutes?!
Well, xsv can do better than 51 ms here with indexing with:
but we can feel more how fast xsv is if the file we’re investigating has many more records.
From our investigation of the 13269 records data of the COVID Tracking Project CSV file, it seems that xsv is killing it and it's more performant than csvkit. We've seen a tremendous improvement in speed when cleaning our data by:
- knowing the headers of the CSV file using xsv headers and comparing it with csvcut -n
- filtering out our desired columns using xsv select comparing it with csvcut -c
- searching for a specific pattern using xsv select comparing it with csvgrep
- better looking for the CSV using xsv table comparing it with csvlook
- or joining two tables using xsv join comparing it with csvjoin
In the end, you have the choice to choose whatever you want from the csvkit or xsv but it'd be fair to use what makes our life easy which is xsv especially when we're dealing with a large CSV file and if speed and performance is not what we look for especially when we're dealing with small CSVs, we can go for csvkit.
You might have noticed that the syntax is kind of similar, it just differs in some commands like joining two CSVs. So you always have the choice!
This tutorial is mainly motivated by Data Science at the Command Line
Disclosure: The Amazon links for the book (in this section) are paid links so if you buy the book, I will have a small commission
This book tries to catch your attention on the ability of the command line when you do data science tasks - meaning you can obtain your data, manipulate it, explore it, and make your prediction on it using the command line. If you are a data scientist, aspiring to be, or want to know more about it, I highly recommend this book. You can read it online for free from its website or order an ebook or paperback. In this tutorial we're gonna focus on using the command line to clean our data.
Stay safe everybody, will see you in the next tutorials ;)
- csvkit Documentation
- 5 Magic Fixes for the Most Common CSV File reader Problems | ThoughtSpot
- COVID Tracking Data
- States Abbreviations Data
- xsv Repo by the BruntSushi Author
- XSV: A fast CSV command-line toolkit written in Rust