How to Clean JSON Data at the Command Line

Created on Nov 14, 2020
Updated on Mar 6, 2021

jq is a lightweight command-line JSON processor written in C. It follows the Unix philosophy that it’s focused on one thing and it can do it very well. In this tutorial, we see how jq can be used to clean JSONs and retrieve some information or get rid of undesired ones.

There are some data that are more suitable to be in a JSON format than a CSV or any other format. Most modern APIs and NoSQL databases support JSONs and also useful if your data is hierarchical that can be considered trees that can go to any depth, essentially any dimension, unlike CSV which is just 2D and can only form tabular data, not a hierarchical one.

Chatbots: Intent Recognition Dataset

Today, we’re investigating a JSON file (from Kaggle) that contains intent recognition data . Please download it as this is the file we’re working on in this tutorial.

Prerequisite

With Homebrew

If you’re using macOS, try this:

$ brew install jq

or this if you want the latest version:

$ brew install --HEAD jq

From GitHub

$ mkdir github      
$ cd github     
$ git clone https://github.com/stedolan/jq.git      
$ cd jq
$ autoreconf -i
$ ./configure --disable-maintainer-mode
$ make

Windows with choco

$ choco install jq

If you need more information on how to install jq , please check out the installation page in the jq wiki

Filtering JSONs by indexing

For this chatbot data, we have some probabilities for the conversation’s intents for a chatbot to use, and for each intent, we have multiple keys like the intent type and the text that the user can type and the responses that should be replied by the chatbot and so on.

Identity operator: .

Let’s now experiment jq by using the identity filter .

Array index: .[0]

And let’s see the content of the first object:

Object Identifier-Index: .foo.bar

We can also use indexing, let’s get the first intent type:

Array/Object Value Iterator: .[]

What if we want to get all intent types that this chatbot can understand:

Filtering out specific value(s)

select(boolean_expression)

One of the useful functions of jq is the select function

We can use it to filter some useful information. For example, let’s get the object of Thanks intent:

Let’s just get the response out of it:

The intent object in the last example has just one value. What if an object has multiple values like text , we then need to use the object value iterator .[]
For example, let’s see if a text in any object has the literal “ Can you see me?” :

Filtering nested objects from JSON

jq can get you the nested objects with the ’.’ identity operator before the name of the key:

so .extension.responses ** is equivalent to | .extension.responses ** (the stdout of the last filter is piped into the nested objects) which is also equivalent to .extension|.responses

Deleting specific keys from JSON

del(path_expression)

Let’s delete the context, extension, entityType, and entities keys:

Note here that multiple keys can be separated by commas:

del(.context,.extension,.entityType,.entities)

Final thoughts

From our experiment with the JSON data of the chatbot intent, we learned how to clean JSON data by the following:

I first saw jq at Data Science at the Command Line , I love this book!

This book tries to catch your attention on the ability of the command line when you do data science tasks - meaning you can obtain your data, manipulate it, explore it, and make your prediction on it using the command line. If you are a data scientist, aspiring to be, or want to know more about it, I highly recommend this book. You can read it online for free from its website or order an ebook or paperback .

You might be interested in the series of cleaning data at the command line:

or why we use docker tutorial

Take care, will see you in the next tutorials :)

Peace!

If you want the whole package of cleaning data at the command line in PDF format, get my ebook below 👇

Get your ebook

Click here to get fresh content to your inbox

Resources

Published on medium