Press "Enter" to skip to content

Recipe Scraping with Python

I am working on a new project with the aim of scraping a wide variety of data sources, with goals as various as arrest data and recipe websites.

I am going to catalog the project’s process through the various issues, and try to encapsulate as much as possible so that if there are components other’s want to use they can.

Here’s the process I am taking:

  1. Decide on a language.
    • What languages are well supported for web scraping?
    • Plenty, but of them I am the most familiar with python.
  2. Decide on feature set.
    • We are going with a headless HTML scraper for now, javascript will be supported in future versions.
    • We need to populate some basic datapoints for a recipe, but we’ll progressively enhance sites that don’t have the things we want.
  3. Build initial targeted list of sites to scrape
    • This was done by my partner in crime, more on that later.
    • >200 sites listed to target specific delicious recipes.
  4. Find some useful resources for scraping sites with python.
  5. Construct simple loop for testing/validatign data.

This Sunday’s execution:

  1. Get an isolated VM up.
  2. Install python3 and python3-venv on debian.
  3. Find there’s actually a few good tools which implement this in a more general way: https://github.com/hhursev/recipe-scrapers
  4. Test that out on a specific site (sorry, none of those links just yet!)
  5. Find that recipe-scrapers actually supports it… automagically, even though its not on their supported site list.
  6. Well… that was easy.

So I didn’t expect a great package to be available out of the box for the problem at hand, so kudos to these wonderful devs.

For my next post I will test combining spidering and downloading to create a cohesive “cookbook” from a target site.

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.