What is ScrapeIt?

ScrapeIt is a configuration driven web scraping python framework. The goal is to easily retrieve data from a web page and return the content.

Config Example

example.yaml

python_website:
  'urls':
    homepage: https://www.python.org/
  'selector': ['div', "class": "jobs-widget"]
  'selector_is_unique': true
  'use_regex': true
  'match_after': ['Jobs']
  'stop_matching_at': ['']
#  match_after_strftime: ['%A, %b %d, %Y', '', 0]
#  stop_matching_at_strftime: ['%A, %b, %d %Y', 1]
#  timezone: 'America/Los_Angeles'

Usage Example

import os

from ruamel import yaml

import scrapeit


config_file = os.path.join('config', 'example.yaml')
with open(config_file) as file:
    config = yaml.safe_load(file)

text = scrapeit.get_text('homepage', **config['python_website'])
print(text)

Out: "Looking for work or have a Python related position that you're trying to hire for? Our relaunched community-run job board is the place to go.jobs.python.org"

Limitations

ScrapeIt is currently not equipped to handle auth.

The only data ScrapeIt currently is able to retrieve is text.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
scrapeit		scrapeit
.gitignore		.gitignore
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What is ScrapeIt?

Config Example

Usage Example

Limitations

About

Releases

Packages

Languages

JonnyFb421/scrapeit

Folders and files

Latest commit

History

Repository files navigation

What is ScrapeIt?

Config Example

Usage Example

Limitations

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages