Skip to content

Latest commit

 

History

History
executable file
·
19 lines (14 loc) · 1.04 KB

README.md

File metadata and controls

executable file
·
19 lines (14 loc) · 1.04 KB

PCF - Nutch on Wrangler

A Portable Crawling Framework (PCF) for Apache Nutch 1.x to run on TACC Wrangler - a supercomputer funded by NSF.

This was started as a part of another project - "Crawl Evaluation" where we evaluated Apache Nutch v1.12 on Wrangler in both Hadoop and Local mode thereby pushing the crawler to its limits for a best throughput. It also includes some of the challenging stuff - Broad crawling, Focused crawling, Intelligent Crawling, Domain Discovery and many more...

PCF provides a crawling workspace for Wrangler which is both automated and portable. It is now integrated with Apache Kafka as well. More details can be found from the respective README files.

Quick Links