Webscrapping-Indeed.com

The purpose of this project is to design a course curriculum for a new “Master of Business and Management in Data Science and Artificial Intelligence” program at University of Toronto with focus not only on technical but also on business and soft skills. To achieve this goal, we had to extract skills that are in demand at the job market from job vacancies posted on indeed web-portal and apply clustering algorithms to group/segment skills into courses.

Part 1 consists of utililizing the geckodriver extension for Arsenic on Python to set up the webscraping process. It consists of the following general steps:

Checking whether the website allows web scraping
Obtaining the source code (HTML File) by using the website URL
Downloading the website content
Parsing the content using keywords tags for elements of interest
Extracting relevant data/features
Organizing raw data in structured format (e.g., CSV)

Part 2 then includes EDA and Feature Engineering, following which unsupervised machine learning methods such as Hierarchal Clustering and K-Means are implemented to perform grouping. Then the results are collated and domain-specific evidence-based insights are provided.

pitchika_curriculum_design.pdf contains the project summary and the learning outcomes from this project.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
README.md		README.md
Webscraping_Part 2.ipynb		Webscraping_Part 2.ipynb
Webscraping_Part_1.ipynb		Webscraping_Part_1.ipynb
pitchika_curriculum_design.pdf		pitchika_curriculum_design.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Webscrapping-Indeed.com

About

Releases

Packages

Languages

ayushi-p/webscrapping-indeed

Folders and files

Latest commit

History

Repository files navigation

Webscrapping-Indeed.com

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages