jq is a lightweight command-line JSON processor written in C. It follows the Unix philosophy that it’s focused on one thing and it can do it very well. In this tutorial, we see how jq can be used to clean JSONs and retrieve some information or get rid of undesired ones.
There are some data that are more suitable to be in a JSON format than a CSV or any other format. Most modern APIs and NoSQL databases support JSONs and also useful if your data is hierarchical that can be considered trees that can go to any depth, essentially any dimension, unlike CSV which is just 2D and can only form tabular data, not a hierarchical one.
Chatbots: Intent Recognition Dataset
Today, we’re investigating a JSON file (from Kaggle) that contains intent recognition data. Please download it as this is the file we’re working on in this tutorial.
If you’re using macOS, try this:
or this if you want the latest version:
Windows with choco
If you need more information on how to install jq, please check out the installation page in the jq wiki
Filtering JSONs by indexing
For this chatbot data, we have some probabilities for the conversation’s intents for a chatbot to use, and for each intent, we have multiple keys like the intent type and the text that the user can type and the responses that should be replied by the chatbot and so on.
Identity operator: .
Let’s now experiment jq by using the identity filter .
Array index: .
And let’s see the content of the first object:
Object Identifier-Index: .foo.bar
We can also use indexing, let’s get the first intent type:
Array/Object Value Iterator: .
What if we want to get all intent types that this chatbot can understand:
Filtering out specific value(s)
One of the useful functions of jq is the select function
We can use it to filter some useful information. For example, let’s get the object of Thanks intent:
Let’s just get the response out of it:
The intent object in the last example has just one value. What if an object has multiple values like text, we then need to use the object value iterator .
For example, let’s see if a text in any object has the literal “Can you see me?”:
Filtering nested objects from JSON
jq can get you the nested objects with the ’.’ identity operator before the name of the key:
so .extension.responses is equivalent to | .extension.responses (the stdout of the last filter is piped into the nested objects) which is also equivalent to .extension|.responses
Deleting specific keys from JSON
Let’s delete the context, extension, entityType, and entities keys:
Note here that multiple keys can be separated by commas:
From our experiment with the JSON data of the chatbot intent, we learned how to clean JSON data by the following:
- filtering out specific information from a JSON by indexing with the identity operator, array indexing, object identifier-index, and array/object value iterator
- filtering out specific values inside an object using select function and we can also filter nested objects by piping the stdout to the desired object(s)
- deleting specific keys from JSON using del function
I first saw jq at Data Science at the Command Line, I love this book!
Disclosure: The Amazon links for the book (in this section) are paid links so if you buy the book, I will have a small commission
This book tries to catch your attention on the ability of the command line when you do data science tasks - meaning you can obtain your data, manipulate it, explore it, and make your prediction on it using the command line. If you are a data scientist, aspiring to be, or want to know more about it, I highly recommend this book. You can read it online for free from its website or order an ebook or paperback.
You might be interested in the series of cleaning data at the command line:
Take care, will see you in the next tutorials :)
Click here to get fresh content to your inbox
- Chatbots: Intent Recognition Dataset
- 4 Reasons You Should Use JSON Instead of CSV
- How to install jq
- jq repo by the author Stephen Dolan
- Guide to Linux jq Command for JSON Processing
- jq cookbook
- JSON on the command line with jq