How to Clean CSV Data at the Command Line | Part 2

A tutorial about cleaning a large CSV file using command-line programs csvkit and xsv when sorting and concatenating, and their performance

In the last blog post, we proved that xsv is ~1882x faster than csvkit when joining two CSVs and saw how performant xsv is and when we can use xsv or csvkit when we are at the terminal.

Today, we're talking about part 2 of cleaning CSV data at the command line investigating a large CSV file (from Kaggle) that contains ~26 million users and their number of comments posted from 2005 to 2017. We also talked about cleaning text files in general using the command line if you want to check out.

Prerequisites

You just need to install csvkit and xsv with the following commands:

install csvkit

Just need pip to be able to install csvkit:

$ pip install csvkit

install xsv

Binaries for Windows, Linux and macOS are available from Github.

Or if you're using macOS Homebrew, install it just like this:

$ brew install xsv

Compiling from its repository also works similarly:

$ git clone git://github.com/BurntSushi/xsv
$ cd xsv
$ cargo build --release

but you need to have Cargo installed which is Rust's package manager since xsv is written in Rust

Concatenating: xsv cat vs csvstack

Let's first create a CSV file to play with instead of the large file. Create it with your favorite text editor or just at the command line:

$ cat >> fake_users.csv
author,n
x,5
z,7
y,6

Here, we’d like to concatenate this fake_users.csv file with the large users.csv file.

To see the effect of the performance here, we will stack the large file first because that will be more accurate when using tail to validate that the second file is already read into memory and concatenated.

xsv has a tool for that which is cat, let's use it:

$ time xsv cat rows users.csv fake_users.csv | tail
youbd636,1
generalkickstand,1
Throwaway_injuries,5
LowlzMan,3
ImSoLitBroLMFAO,1
i_want_to_die_killme,2
beeman3459,1
x,5
z,7
y,6

real	0m15.198s
user	0m17.499s
sys	0m0.364s
  • rows here is used to indicate that we will concatenate the rows so the rows of the second file will be appended to the rows of the first file

Time here taken by xsv cat is ~15s

csvkit also has a csvstack tool, let’s see it:

$ time csvstack users.csv fake_users.csv | tail
youbd636,1
generalkickstand,1
Throwaway_injuries,5
LowlzMan,3
ImSoLitBroLMFAO,1
i_want_to_die_killme,2
beeman3459,1
x,5
z,7
y,6

real	1m7.213s
user	1m21.464s
sys	0m1.026s

Time here taken by csvstack is ~1.7mins

So approximately, xsv is ~7x faster here than csvkit

Sorting: pandas.sort_values() vs xsv sort vs GNU sort vs csvsort

Let’s first see how csvsort and xsv sort work by sorting fake_users.csv by the number of comments n:

$ csvsort -c n fake_users.csv
author,n
x,5
y,6
z,7

We can also sort in reverse order by setting the option -r:

$ csvsort -rc n fake_users.csv
author,n
z,7
y,6
x,5

The equivalent xsv command is:

$ xsv sort -RNs n fake_users.csv 

-RNs is a combination of:

  • -R for reverse order
  • -N for numeric sorting
  • -s for selecting the column that you sort by

After downloading users.csv, the ~359MB CSV file, let's sort it with xsv first. This is a +25 million records so expect a latency here!

$ time xsv sort -RNs n users.csv > /dev/null

real	1m48.611s
user	1m27.743s
sys	0m16.621s

So xsv sort took here ~1.4 mins for this much data.

Let's try the sort command coming with the UNIX system:

$ time sort -t, -rnk2 users.csv > /dev/null

real	6m29.940s
user	5m49.416s
sys	0m32.340s
  • -t, to determine the field separator which is comma here as the file is a CSV file
  • -r for reverse order
  • -n for numeric sorting
  • -k2 for setting the key field (the column that you sort by which is the second column here)

Here BSD sort took ~6.3min which is ~5x slower than xsv

What about csvsort?

$ time csvsort -rc n users.csv > /dev/null

Do you know that I have terminated my terminal after more than an hour of waiting for this command to complete execution and it's not completed yet!

Knowing that csvkit is written in pure python can somehow proves that it's not advisable to write a very performant tool with such technology alone unless you use C or Cython with Python as Pandas did:

$ python sort_with_pandas.py
Time: 77.732413994

Pandas here beats ALL, csvkit, BSD sort, and even xsv with execution time of ~78 seconds which means it's very close to xsv result (1.4min = 84s) so just 6 seconds difference and with each run, the time varies so it's acceptable for xsv to be competing with pandas for performance.

Final thoughts

From our investigation of the 26 million records data of the Reddit username data, it seems that xsv is the fastest command line I ever used and I think ever existed at the terminal. It is more performant than csvkit. We’ve seen a tremendous improvement in speed when cleaning our data by:

  • concatenating the rows of the CSV files using xsv cat rows and comparing it with csvstack
  • sorting by a column and in a reverse order using xsv sort -sRN comparing it with csvsort -rc
  • also comparing the two utilities with BSD sort and pandas resulting in a very good performing of pandas and xsv

In the end, as always you have the choice to choose whatever you want from the csvkit or xsv but it'd be fair to use what makes our life easy which is xsv especially when we're dealing with a large CSV file like what we saw today and if speed and performance are not what we look for especially when we're dealing with small CSVs, we can go for csvkit.

You might have noticed that the syntax is kind of similar in sorting and stacking. So you always have the choice!

This tutorial is mainly motivated by Data Science at the Command Line

Disclosure: The Amazon links for the book (in this section) are paid links so if you buy the book, I will have a small commission

This book tries to catch your attention on the ability of the command line when you do data science tasks - meaning you can obtain your data, manipulate it, explore it, and make your prediction on it using the command line. If you are a data scientist, aspiring to be, or want to know more about it, I highly recommend this book. You can read it online for free from its website or order an ebook or paperback.

You might be interested in my previous tutorials on part 1 of cleaning CSV data or why we use docker tutorial or the one similar to this on how to clean text data at the command line

Take care, will see you in the next tutorials :)

Click here to get fresh content to your inbox

Resources

Please clap on medium if you like this article, thank you! :)

Get FREE coupons and discounts
on my upcoming courses
when you subscribe!

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.