Today, we're talking about part 2 of cleaning CSV data at the command line investigating a large CSV file (from Kaggle) that contains ~26 million users and their number of comments posted from 2005 to 2017. We also talked about cleaning text files in general using the command line if you want to check out.
You just need to install csvkit and xsv with the following commands:
Just need pip to be able to install csvkit:
Binaries for Windows, Linux and macOS are available from Github.
Or if you're using macOS Homebrew, install it just like this:
Compiling from its repository also works similarly:
but you need to have Cargo installed which is Rust's package manager since xsv is written in Rust
Concatenating: xsv cat vs csvstack
Let's first create a CSV file to play with instead of the large file. Create it with your favorite text editor or just at the command line:
Here, we’d like to concatenate this fake_users.csv file with the large users.csv file.
To see the effect of the performance here, we will stack the large file first because that will be more accurate when using tail to validate that the second file is already read into memory and concatenated.
xsv has a tool for that which is cat, let's use it:
- rows here is used to indicate that we will concatenate the rows so the rows of the second file will be appended to the rows of the first file
Time here taken by xsv cat is ~15s
csvkit also has a csvstack tool, let’s see it:
Time here taken by csvstack is ~1.7mins
So approximately, xsv is ~7x faster here than csvkit
Sorting: pandas.sort_values() vs xsv sort vs GNU sort vs csvsort
Let’s first see how csvsort and xsv sort work by sorting fake_users.csv by the number of comments n:
We can also sort in reverse order by setting the option -r:
The equivalent xsv command is:
-RNs is a combination of:
- -R for reverse order
- -N for numeric sorting
- -s for selecting the column that you sort by
After downloading users.csv, the ~359MB CSV file, let's sort it with xsv first. This is a +25 million records so expect a latency here!
So xsv sort took here ~1.4 mins for this much data.
Let's try the sort command coming with the UNIX system:
- -t, to determine the field separator which is comma here as the file is a CSV file
- -r for reverse order
- -n for numeric sorting
- -k2 for setting the key field (the column that you sort by which is the second column here)
Here BSD sort took ~6.3min which is ~5x slower than xsv
What about csvsort?
Do you know that I have terminated my terminal after more than an hour of waiting for this command to complete execution and it's not completed yet!
Knowing that csvkit is written in pure python can somehow proves that it's not advisable to write a very performant tool with such technology alone unless you use C or Cython with Python as Pandas did:
Pandas here beats ALL, csvkit, BSD sort, and even xsv with execution time of ~78 seconds which means it's very close to xsv result (1.4min = 84s) so just 6 seconds difference and with each run, the time varies so it's acceptable for xsv to be competing with pandas for performance.
From our investigation of the 26 million records data of the Reddit username data, it seems that xsv is the fastest command line I ever used and I think ever existed at the terminal. It is more performant than csvkit. We’ve seen a tremendous improvement in speed when cleaning our data by:
- concatenating the rows of the CSV files using xsv cat rows and comparing it with csvstack
- sorting by a column and in a reverse order using xsv sort -sRN comparing it with csvsort -rc
- also comparing the two utilities with BSD sort and pandas resulting in a very good performing of pandas and xsv
In the end, as always you have the choice to choose whatever you want from the csvkit or xsv but it'd be fair to use what makes our life easy which is xsv especially when we're dealing with a large CSV file like what we saw today and if speed and performance are not what we look for especially when we're dealing with small CSVs, we can go for csvkit.
You might have noticed that the syntax is kind of similar in sorting and stacking. So you always have the choice!
This tutorial is mainly motivated by Data Science at the Command Line
Disclosure: The Amazon links for the book (in this section) are paid links so if you buy the book, I will have a small commission
This book tries to catch your attention on the ability of the command line when you do data science tasks - meaning you can obtain your data, manipulate it, explore it, and make your prediction on it using the command line. If you are a data scientist, aspiring to be, or want to know more about it, I highly recommend this book. You can read it online for free from its website or order an ebook or paperback.
Take care, will see you in the next tutorials :)
Click here to get fresh content to your inbox
- csvkit Documentation
- xsv Repo by the BruntSushi Author
- Reddit Usernames data | Kaggle
- how to sort pandas dataframe from one column
- How do you calculate program run time in python? [duplicate]
- Photo by Yan Ots on Unsplash
Please clap on medium if you like this article, thank you! :)