How to Scrape Amazon Best Seller Data in Python

The easiest way to get your favorite top 30 books on Amazon Seller list
best_seller

Reading is the fuel of so many skills you learn in life. You gain the most out of it by reading the best books.

The best-seller list is one of the best ways to get a sense of what is worth reading and buying on a certain topic.

Probably you used to copy and paste book data like the title, author, rating from the readers, and ranking. And copied data from the web and pasted it into a spreadsheet.

Copying and pasting are cumbersome, error-prone, and time-consuming.

Or you might need it to decide which books to buy. This is actually my case.

I like to read the best books on my favorite topics. So I made this script to get the best books in a CSV file quickly and easily and I wanted to share it with you.

This quick tutorial will help you get started. You'll learn how to scrape data from Amazon bestseller page and export it to a spreadsheet. You can apply the same strategy with any bestseller page.

Use the code in your own interest (Licensed as Creative Commons Zero).

Let's get started.

Requirements

All you need to install is two libraries: BeautifulSoup and Pandas. I assume you have Python3 and pip installed.

If you haven't already, please do so by running the following command on your terminal:

$ pip install beautifulsoup4 pandas

Then create a new a Python script and import both along with the standard library: urllib:

from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
import pandas as pd

Here you use:

  • urlopen to open the page.
  • Request to create a request object to communicate with the server.
  • BeautifulSoup to parse the HTML.
  • pandas to export the data to a CSV file.

Requesting the Page

You need to request the page. First pass the URL to the urlopen function:

# Page for best sellers in writing (Authorship subcategory)
url = 'https://www.amazon.com/gp/bestsellers/books/11892'
request = Request(url, headers={'User-agent': 'Mozilla/5.0'})
html = urlopen(request)

Note: The User-agent header is required to prevent Amazon from blocking your request. If you don't pass it, you will get this error: urllib.error.HTTPError: HTTP Error 503: Service Unavailable.

Parsing the HTML

Now, the html variable contains the HTML of the page. urlopen doesn't understand HTML, it just returns it.

So you need to parse it with BeautifulSoup. Doing so will let you scrape the data depending on the structure of the parsed HTML.

soup = BeautifulSoup(html, 'html.parser')

Next, you need to go to the page in the browser and see the structure of the HTML. Inspect the HTML by clicking CTRL+SHIFT+C (CMD+SHIFT+C on MacOS) in the browser. When you click on the item of interest you want to know the HTML structure of, you'll see the HTML tag in the elements tab.

Scraping Amazon Best Seller Page

You want to grab the list of best sellers items that you can loop over.

After you inspect the HTML of an item in the browser, you will find that each item has a div tag with the id attribute gridItemRoot. Use the find_all() method to get all the div tags with that id. This is what we want to loop over to get information about each book.

books = soup.find_all('div', id="gridItemRoot")

Side note: Amazon should have a better way to structure their HTML than this. This id attribute is not unique as it's suppose to be. It should be a class attribute.

Anyway, let's continue.

Now, you loop over each book. Focus on one item at a time and get the information you want to get from the HTML structure.

for book in books:
    rank = book.find('span', class_="zg-bdg-text").get_text().replace('#', '')
    print(rank)
    title = book.find(
        'div',
        class_="_p13n-zg-list-grid-desktop_truncationStyles_p13n-sc-css-line-clamp-1__1Fn1y"
    ).get_text(strip=True)
    print(f"Title: {title}")

In this example, you get the rank with the HTML tag span with the class zg-bdg-text. You then do a little change to the rank result (e.g. #1) and replace the # to omit it.

Similar to getting the rank result, you get the title with the HTML tag div with the associated long class.

and then you get the author and rating info:

for book in books:
    ...
    author = book.find('div', class_="a-row a-size-small").get_text(strip=True)
    print(f"Author: {author}")
    r = book.find('div', class_="a-icon-row").find('a', class_="a-link-normal")
    rating = r.get_text(strip=True).replace(' out of 5 stars', '') if r else None

You want to just get the rating number. So you do a little change to the result string.

Export to CSV

To export the data to a CSV file, you need to have empty lists before the for loop and append the data to each list in the loop.

Until you get to the point where you create a Pandas DataFrame and export it to a CSV file:

pd.DataFrame({
    'Rank': ranks,
    'Title': titles,
    'Author': authors,
    'Rating': ratings
}).to_csv('best_seller.csv', index=False)

Final Thoughts

This is a quick tutorial to get you started with scraping data in general. We've seen how to get the data from the Amazon bestseller page, parse the HTML, and export the result to a CSV file.

Note: There is a limitation to this code because we've used BeautifulSoup which is just an HTML parser. You may notice that the page contains 50 books in the bestseller list while we get only 30.

That's because there is a Javascript code on that page. It disables the rest of the content on the page after 30 books until we scroll down.

If you want to get the whole 50-book list, you need to use another library. Consider using Selenium or Scrapy.

But this is a quick tutorial to get you started. 30 books is a good number.

Published on medium

Buy me a cup of coffee

Join the conversation

Download the ebook

Download the eBook to write cleaner Python code

Get the ebook