How to Clean Text Data at the Command Line

Created on Oct 14, 2020
Updated on Mar 6, 2021

Cleaning data is like cleaning the walls in your house, you clear any scribble, remove the dust, and filter out what is unnecessary that makes your walls ugly and get rid of it. The same thing happens when cleaning your data, it’s filtering what we want and removing what we don’t want to make the raw data useful and not raw anymore. You can do the cleaning with Python, R, or whatever language you prefer but in this tutorial, I’m going to explain how you can clean your text files at the command line  files by giving insights from a paper researching clickbait and non-clickbait data.

This tutorial is mainly motivated by Data Science at the Command Line

This book tries to catch your attention on the ability of the command line when you do data science tasks - meaning you can obtain your data, manipulate it, explore it, and make your prediction on it using the command line. If you are a data scientist, aspiring to be, or want to know more about it, I highly recommend this book. You can read it online for free from its website or order an ebook or paperback . In this tutorial we’re gonna focus on using the command line to clean our data.

Pulling and running the docker image

To remove the hassle of downloading files we deal with and dependencies we need, I’ve made a docker image for you that has all you need. You just pull it from docker hub and you’ll find what you need to play with to let you focus on the cleaning part. So let’s pull that image and then run it interactively to enter the shell and write some command-lines.

$ docker pull ezzeddin/clean-data
$ docker run --rm -it ezzeddin/clean-data

If using docker is still unclear for you, you can see why we use docker tutorial

Cleaning text files

Let’s clean two text files containing clickbait and non clickbait headlines for 16,000 articles each. This data is used from a paper titled: Stop Clickbait: Detecting and Preventing Clickbaits in Online News Media at 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM). Our goal here is to get the most common words used in both clickbait and non-clickbait headlines.

If you list what’s inside the container you’ll see two text files called clickbait_data and non_clickbait_data . Let’s see what the final output first that we want to get. For the clickbait data, we want the most common 20 words to be represented like this with their counts:

Image by the Author

And for the most 20 common words of non-clickbait headlines:

Image by the Author

Let’s see how we can get these histograms through the command line by taking it step by step. After running the docker image, we’re now in a new shell with the new environment Let’s first see what the clickbait_data file ** has by getting the first 10 lines of it:

$ head clickbait_data

So it seems this file has headlines that are labeled as clickbait as you can see:

Should I Get Bings

Which TV Female Friend Group Do You Belong In

The New "Star Wars: The Force Awakens" Trailer Is Here To Give You Chills

This Vine Of New York On "Celebrity Big Brother" Is Fucking Perfect

A Couple Did A Stunning Photo Shoot With Their Baby After Learning She Had An Inoperable Brain Tumor

And if you use head for getting the first lines of non_clickbait_data you’ll find:

Bill Changing Credit Card Rules Is Sent to Obama With Gun Measure Included
In Hollywood, the Easy-Money Generation Toughens Up
1700 runners still unaccounted for in UK's Lake District following flood

Yankees Pitchers Trade Fielding Drills for Putting Practice
Large earthquake rattles Indonesia; Seventh in two days

Coldplay's new album hits stores worldwide this week

U.N. Leader Presses Sri Lanka on Speeding Relief to War Refugees in Camps

We’re interested in words here not phrases, so we can get words starting from 3 letters to more with:

$ head clickbait_data | grep -oE '\w{3,}'

head clickbait_data is used here because we’re doing statistics here on the couple of headlines at the top of the file which is piped to the next grep command grep -oE ‘\w{3,}'

grep

-oE -o for getting only matching words and -E for using extended regular expression which is the next pattern

'\w{3,}' this pattern is like '\w\w\w+' which matches whole words with 3 letters or more

In order to get the counts of each word we need to first get the unique words which we can get by uniq command with the option -c to give you counts, but to let uniq delete duplicate words you need to sort first:

$ head clickbait_data | grep -oE '\w{3,}' | sort | uniq -c

This command is done on the first 10 lines, let’s do it across the entire clickbait headlines:

$ cat clickbait_data | grep -oE '\w{3,}' | sort | uniq -c | sort -nr | head

Here is the output of the previous command:

   5538 You
   4983 The
   2538 Your
   1945 That
   1942 Are
   1812 This
   1645 And
   1396 For
   1326 What
   1242 Will

Looks like we’re close now to be in a good shape, let’s see what we can do to better clean that up.

If we get a deeper look

Photo by David Travis on Unsplash

We can see we’re missing small letters and all caps letters. For example, for the ‘You’ word we’re missing ‘you’ and we’re also missing ‘YOU’. Let’s try to see if these words already exist:

$ cat clickbait_data | grep -oE '\w{3,}' | sort | uniq -c | sort -nr | grep you
$ cat clickbait_data | grep -oE '\w{3,}' | sort | uniq -c | sort -nr | grep YOU

So as we can see:

      1 your
      1 you
      1 YOUR
      1 YOU

We’re missing 2 words that each can contribute to our counts to the ‘You’ occurrence and ‘Your’ to make them 5540 and 2540 respectively.

What we need to do first is to convert each capital letter into small ones using tr which is a command-line utility that translates characters:

$ cat clickbait_data | tr '[:upper:]' '[:lower:]'| grep -oE '\w{3,}' \
| sort | uniq -c | sort -nr | head

tr ‘[:upper:]’ ‘[:lower:]' here translates the contents of clickbait_data into lower-case. [‘upper’] is a character class that represents all upper case characters and [‘lower’] is a character class that represents all lower case characters.

To prepend these values with the header, we can use sed to put two column names to represent each column:

$ cat clickbait_data | tr '[:upper:]' '[:lower:]'| grep -oE '\w{3,}' \
| sort | uniq -c | sort -nr | sed '1i count,word' | head 

sed ‘1i count,word’ so we’re having count representing the number of occurrences and word obviously representing the word

1i is used here to write these two words at the first line and the change in the file will be in-place

outputting:

count,word 
5540 you
4992 the
2540 your
1950 that
1944 are
1812 this
1653 and
1397 for
1326 what

In order to print that in a pretty shape we can use csvlook which will get us this:

| count        | word |
| ------------ | ---- |
|    5540 you  |      |
|    4992 the  |      |
|    2540 your |      |
|    1950 that |      |
|    1944 are  |      |
|    1812 this |      |
|    1653 and  |      |
|    1397 for  |      |
|    1326 what |      |

which is not pretty at all. The reason this happened is that csvlook works as its name indicates to a better look for a CSV file so we should have a CSV (Comma Separated Value) file first. We should then find a way to separate each value at each line with a comma. At this point, we can use awk which is a pattern-directed scanning and processing language:

$ cat clickbait_data | tr '[:upper:]' '[:lower:]'| grep -oE '\w{3,}' \
| sort | uniq -c | sort -nr | awk '{print $1","$2}' | sed '1i count,word' | head | csvlook

awk ‘{print $1","$2}'

'{

print $1 here prints the first field (which is the count column) followed by…

"," a comma followed by…

$2 the second field which is the word column

}'

It seems we’re in much better shape now:

| count | word |
| ----- | ---- |
| 5,540 | you  |
| 4,992 | the  |
| 2,540 | your |
| 1,950 | that |
| 1,944 | are  |
| 1,812 | this |
| 1,653 | and  |
| 1,397 | for  |
| 1,326 | what |

If we want to get the word column in the first field and the count column in the second field, we just need to reverse the order in the awk and sed command:

$ cat clickbait_data | tr '[:upper:]' '[:lower:]'| grep -oE '\w{3,}' \
| sort | uniq -c | sort -nr | awk '{print $2","$1}' | sed '1i word,count' | head | csvlook

In order to get the same output for the non-clickbait data, we just need to change the filename:

$ cat non_clickbait_data | tr '[:upper:]' '[:lower:]'| grep -oE '\w{3,}' \
| sort | uniq -c | sort -nr | awk '{print $2","$1}' | sed '1i word,count' | head | csvlook

Getting insight into the clickbait study

In this study as reported by the paper , it addresses the clickbait and non-clickbait headlines to be able to detect both as

Most of the online news media outlets rely heavily on the revenues generated from the clicks made by their readers, and due to the presence of numerous such outlets, they need to compete with each other for reader attention.

So what’s in this tutorial is a way to clean up data through the command-line so that we can get some insights about the result of this paper and see if we can get some points claimed by the paper through their research.

Again, let’s see the final distribution of the most 20 common words for clickbait headlines for this data is:

Image by the Author

we can obviously see the excessive use of the possessive case you and not using the third person references like he, she, or a specific name

and also your as it might appear in phrases commonly used in clickbait like in ‘Will Blow Your Mind’. Also, we can find common words of determiners like this, that, which that

make the user inquisitive about object  being referenced and persuade them to pursue the article further.

On the other hand, the distribution for the most 20 common words for non-clickbait headlines is:

Image by the Author

We can see here non-possessives like australian, president, obama and some other words that can happen in both.

Final thoughts

The clickbait paper suggests much deeper investigations than we did here, but we could get some valuable insights with just one line of code at the command line. We learned how to use tr to translate characters, used grep to filter out words starting from 3 letters, used sort and uniq to get a histogram of word occurrences, used awk to print fields in our desired positions, used sed to put the header to the file we’re processing.

Thank you for reaching here!

If you want the whole package of cleaning data at the command line in PDF format, get my ebook below 👇

Get your ebook

Resources

More tutorials?

If you want to see more tutorials blog posts, check out:

https://www.ezzeddinabdullah.com/topic/tutorials

Published on medium