RecipeRadar Crawler

The RecipeRadar crawler provides an abstraction layer over external recipe websites, returning data in a format which can be ingested into the RecipeRadar search engine.

Much of this is possible thanks to the open source recipe-scrapers library; any improvements, fixes, and site coverage added there will benefit the crawler service.

In addition, scripts are provided to crawl from two readily-available sources of recipe URLs:

openrecipes - a set of ~175k public recipe URLs
reciperadar - the set of recipe URLs already known to RecipeRadar

The reciperadar set is useful during changes to the crawling and indexing components of the RecipeRadar application itself; it provides a quick way to recrawl and reindex existing recipes.

Outbound requests are routed via squid to avoid burdening origin recipe sites with repeated content retrieval requests.

Install dependencies

Make sure to follow the RecipeRadar infrastructure setup to ensure all cluster dependencies are available in your environment.

Development

To install development tools and run linting and tests locally, execute the following commands:

$ make lint tests

Local Deployment

To deploy the service to the local infrastructure environment, execute the following commands:

$ make
$ make deploy

Operations

Initial data load

To crawl and index openrecipes from scratch, execute the following commands:

$ cd openrecipes
$ make
$ venv/bin/python crawl.py

NB: This requires you to download the openrecipes dataset and extract it to a file named 'recipes.json'

Recrawling and reindexing

To recrawl and reindex the entire known reciperadar recipe set, execute the following commands:

$ cd reciperadar
$ make
$ venv/bin/python crawl_urls.py --recrawl

To reindex reciperadar recipes containing products named tofu, execute the following command:

$ cd reciperadar
$ make
$ venv/bin/python recipes.py --reindex --where "exists (select * from recipe_ingredients as ri join ingredient_products as ip on ip.ingredient_id = ri.id where ri.recipe_id = recipes.id and ip.product = 'tofu')"

NB: Running either of these commands without the --reindex / --recrawl argument will run in a 'safe mode' and tell you about the entities which match your query, without performing any actions on them.

Proxy selection

Sometimes individual websites may block or rate-limit the crawler; it's best to avoid making too many requests to an individual website, and to be as respectful as possible of their operational and network costs.

Sometimes it can be worth temporarily switching the crawler to use an anonymized proxy service. Until this is available as a configuration setting, this can be done by updating the crawler application code and redeploying the service.

Name		Name	Last commit message	Last commit date
Latest commit History 331 Commits
k8s		k8s
openrecipes		openrecipes
reciperadar		reciperadar
tests		tests
web		web
.containerignore		.containerignore
.flake8		.flake8
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
requirements-dev.in		requirements-dev.in
requirements-dev.txt		requirements-dev.txt
requirements.in		requirements.in
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RecipeRadar Crawler

Install dependencies

Development

Local Deployment

Operations

Initial data load

Recrawling and reindexing

Proxy selection

About

Languages

License

openculinary/crawler

Folders and files

Latest commit

History

Repository files navigation

RecipeRadar Crawler

Install dependencies

Development

Local Deployment

Operations

Initial data load

Recrawling and reindexing

Proxy selection

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Languages