Skip to content

kzkedzierska/ngs22_unsupervised

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

README

This is a repository for the workshop and lecture on Unsupervised Learning as taught at NGSchool2022: Machine Learning in Computational Biology on 16-17.09 in Jablonna, Poland.

Authors: Kasia Kędzierska and Kaspar Märtens

Tutors

This tutorial, together with a proceeding lecture [slides can be found here], are jointly prepared and taught by Kaspar Märtens and Kasia Kędzierska. Kasia is a final year PhD student in Genomic Medicine and Statistics at Wellcome Centre for Human Genetics at the University of Oxford. Kaspar recently finished his PhD in Statistical Machine Learning at the University of Oxford. Since then he was a postdoctoral Research Fellow at the Alan Turing Institute and worked at Apple Health AI. Currently, he is based in the Big Data Institute at the University of Oxford.

Outline

We split the two 90 minutes sessions into a lecture and a workshop. In that space of time is quite difficult to cover the area so vast as Unsupervised Learning. Our goal here was to talk about the methods, explain their applications and some intuitions around them. In order to fully understand them we would recommend exploring each method in more detail in the materials we link below.

What do we cover/explore?

  • Dimensionality reduction:
    • Linear: PCA
    • Non-linear: tSNE, UMAP
  • Clustering:
    • K-means
    • Hierarchical clustering

The tutorial is self contained and you should be able to run at home as well. There are some questions and exercises there, as well as the points to ponder about. We would discuss them all at the workshop.

Requirements

In order to be able to run this tutorial you need:

  • RStudio v1.0.136 or later
  • R >= 4.0
  • and a few R packages.

The packages you need are all listed in R_packages_list.txt file in scripts directory.

We also prepared the prep_help.R script that will check if you have all necessary packages, and if not will try to install them. You can either open the script with RStudio and click Run or use command line:

Rscript --vanilla scripts/prep_help.R

It might also be good to save the output of the script for potential debugging, you can use tee, for example, to copy the output of the command line to the file.

Rscript --vanilla scripts/prep_help.R |& tee prep_help.log

You are all done if the last message you saw was:

SUCCESS: Fantastic! All packages installed and ready.

NGSchool2022

You don’t have to worry about your setup - just follow the NGSchool2022 IT instructions and pull the appropriate docker image. All the packages are already installed there.

Content of the repository

The repository contains few files:

  • notebooks/00_data_preparation - this is the file with code used to download, prepare and normalise the data for the tutorial. We used TCGAbiolinks package that can access GDC data and download & read it in your R session directly.
  • notebooks/01_unsupervised_learning_in_R - both slides and Quarto file with code that generated the slides with unsupervised learning in R using Palmer Penguins data set.
  • notebooks/tutorial/tutorial.Rmd - the self contained tutorial where we look at the TCGA BRCA data set & annotation from this paper.

Additional materials and further reading

Related reading

Generative art

Controversies around the UMAP, tSNE and PCA

Various advanced topics

Here are some pointers to literature on topics that Kaspar briefly mentioned

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages