I am working on a new project with the aim of scraping a wide variety of data sources, with goals as various as arrest data and recipe websites.
I am going to catalog the project’s process through the various issues, and try to encapsulate as much as possible so that if there are components other’s want to use they can.
Here’s the process I am taking:
- Decide on a language.
- What languages are well supported for web scraping?
- Plenty, but of them I am the most familiar with python.
- Decide on feature set.
- We need to populate some basic datapoints for a recipe, but we’ll progressively enhance sites that don’t have the things we want.
- Build initial targeted list of sites to scrape
- This was done by my partner in crime, more on that later.
- >200 sites listed to target specific delicious recipes.
- Find some useful resources for scraping sites with python.
- Why go further than the tutorial for the tool you are using? https://docs.scrapy.org/en/latest/intro/tutorial.html
- Construct simple loop for testing/validatign data.
- For each site, download a recipe manually, format it to your expectations on the intermediate format, and wire a test to verify that we are not creating that correctly currently.
- Ensure that recipe correctly matches our expectations for our test.
- Find the right grain of complexity to store rules for custom sites – some sort of lookup + fallback scenario.
- Research the common formats for recipes –
This Sunday’s execution:
- Get an isolated VM up.
- Install python3 and python3-venv on debian.
- Find there’s actually a few good tools which implement this in a more general way: https://github.com/hhursev/recipe-scrapers
- Test that out on a specific site (sorry, none of those links just yet!)
- Find that recipe-scrapers actually supports it… automagically, even though its not on their supported site list.
- Well… that was easy.
So I didn’t expect a great package to be available out of the box for the problem at hand, so kudos to these wonderful devs.
For my next post I will test combining spidering and downloading to create a cohesive “cookbook” from a target site.