Skip to content

Simple hypertext and hyperlink web crawler that outputs a sitemap file. PyCon 2016 Poster.

Notifications You must be signed in to change notification settings

jshields/py_sitemapper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 

Repository files navigation

py_sitemapper

Simple hypertext and hyperlink web crawler that outputs a sitemap.xml file.

URL parameters, query string, and document fragment identifier (hash navigation) do not currently carry to the output sitemap file.

Compatibility: Python 2.7+

Requires requests module: pip install requests

Example: python cli.py http://jshields.github.io/showcase/sitemapper_target/home.htm

How it Works:

  • Make a request to a URL to get its markup.
  • Search the HTML for anchor tags and retrieve their hypertext references / link URLs.
  • Appropriately filter and save those links in sitemap file format.
  • Future enhancements: warn users about 404 Not Found links picked up by the crawler.

Resources:

About

Simple hypertext and hyperlink web crawler that outputs a sitemap file. PyCon 2016 Poster.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages