Skip to content

Projects on COVID-19 topic of genomic sequencing - mostly DataViz

License

Notifications You must be signed in to change notification settings

Mike-Honey/covid-19-genomes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

covid-19-genomes

DataViz Projects on topic of COVID-19 genomic sequencing.

Mostly showing Australia by default, other countries available for selection.

Most pages now show the nextclade.org lineages, an alternative lineage classification tool. Alternates of most pages are also available showing the original GISAID lineages if preferred, but most experts recommend nextclade for quicker and more precise calls, especially of newer lineages. Pages with (nextclade) in the page title show nextclade lineages, otherwise GISAID lineages are show. The page navigation is at bottom-centre, e.g. < 2 of 30 >.

gisaid.org with nextclade lineages - top Lineages by Country/Location

Link to interactive DataViz

Click to view and interact with the report

gisaid.org with nextclade lineages - top Countries for a selected Lineage

Link to interactive DataViz

Click to view and interact with the report

gisaid.org with nextclade lineages - top Locations for a selected Lineage

Link to interactive DataViz

Click to view and interact with the report

gisaid.org with nextclade lineages - Lineage growth comparison (log)

Link to interactive DataViz

Click to view and interact with the report

gisaid.org with nextclade lineages - map

Visualise the geographical spread of a selected lineage. Use the play control at bottom for an animated view of the spread.

Locations are approximate - typically by reporting state/province or country. Bubble sizes are driven by the % of the total set of samples selected.

Link to interactive DataViz

Click to view and interact with the report

gisaid.org with nextclade lineages - sankey

Rolls up the evolutionary tree of lineages from the highest level ancestors (far left) to the most evolved descendants (far right). Each segment of a vertical column shows the counts of that lineage plus all it's descendants. Slicers for the range of Levels and Minimum # of Samples can be used to produce a more focussed output.

The rollup logic is a bit heavy, so please be patient with this page.

Jeff Gilchrist wrote an excellent thread explaining how to drive this Sankey page.

Link to interactive DataViz

Click to view and interact with the report

gisaid.org with nextclade lineages - geography frequencies

Track the weekly progress of a selected lineage for any combination of Continents and Countries. Shows the counts of that lineage vs the overall total, by week collected, also as a %.

Link to interactive DataViz

Click to view and interact with the report

gisaid.org - archive

Link to interactive DataViz

Click to view and interact with the report

Reference:

International data on COVID-19 genomic sequencing, for analysis and reporting on variant prevalence by country, region and even global.

Global data gathered from GISAID. Sequence data is processed through the Nextclade CLI to produce the generally preferred Nextclade Lineage classifications.

I'm mainly following the visualisation style I first saw presented by Trevor Bedford. The main feature are clean, simple line charts, filtered by default to the top 7 series in the selected data. For each chart point, the frequency of that lineage in the last 7 days is calculated, always comparing to all the sequencing data available for that country/location.

Other pages presented include showing a single Lineage by Country or by Location. The top 7 lineages in the selected Continent/Countries/Locations will be shown, with frequency calculated as above.

The Lineage growth comparison (log) page was suggested by Uffe Poulsen, based on a chart produced by Alex Selby.

The main gisaid dataset only presents data for the last few months, to save processing time. It is typically refreshed weekly. The "gisaid - archive" dataviz presents all the historical data, but is only refreshed monthly at best.

Summary

The available sites presenting data on genomic sequencing are typically limited to country or global perspectives, with limited interactivity and often using overly complex visualisations. Each site has its own visualisation style. They are each updated independently.

In this project, the data from those sources is presented in an interactive data visualisation tool: Power BI. This allows interactive filtering of the data in the table, for easier analysis.

A page is presented for each data source (now only gisaid, but formerly also microreact, nextstrain, UCSC and cdgn), and the gisaid data has alternate pages showing either the Nextclade lineage classifications, or GISAID's own lineage classifications.

Earlier lineages are translated into the commonly known variant names (e.g. Delta) following the WHO naming. More recent lineages are grouped into "clans", roughly following the work of the Variant Trackers group e.g. T. Ryan Gregory. These are grouped using the field Lineage L2, for example the Lineage L2 "clan" BA.2.86.* includes the BA.2.86 lineage and all it's descendents. The Lineage L2 "clans" are mutually exclusive, so XBB.1.9.* excludes all of the EG.5.* lineages.

The default country selection for most pages is Australia. As well as being where I live, genomic sequencing for Australia has a relatively high proportion of genomes sequenced vs total COVID-19 cases.

The user can choose any alternative country, and also filter the date range or Lineages included. It is possible to combine multiple countries, even all data for a continent or globally. However note that the sampling is most datasets is heavily skewed to a handful of countries.

The primary visual on each page is a line chart showing the Lineage Frequency (calculated as a moving average over the prior 7 days, compared to all the other lineages present in the data (regardless of selections)). To keep the line charts clean, only the seven most-frequently occuring Lineages are shown (dynamically determined). Alternate pages compare Countries or Locations for a selected Lineage, again typically showing the top seven.

The gray inverted column chart below each line chart shows the counts of all genomes sequenced over the same period. A typical pattern is that the sample volume drops for more recent

An interactive table at the bottom right lists the individual observations presented by each dataset.

From gisaid.org we gather their EpiCoV metadata dataset. For most countries, this dataset is the most complete and up-to-date available.

Elbe, S., and Buckland-Merrett, G. (2017) Data, disease and diplomacy: GISAID’s innovative contribution to global health. Global Challenges, 1:33-46. DOI:10.1002/gch2.1018 PMCID: 31565258

From nextclade we classify the gisaid samples to obtain the nextclade pango lineage (using the nextclade cli tool). These offer an alternative to the pango lineages presented by gisaid. Typically new lineages are defined first in nextclade, and are preferred by some experts.

THIS REPORT IS NOT HEALTH ADVICE - REFER TO YOUR LOCAL HEALTH AUTHORITY.

🤝 Support

Contributions, issues, feature requests and sponsorship are all welcome!

Give a ⭐️ if you like this project!

About

Projects on COVID-19 topic of genomic sequencing - mostly DataViz

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

 

Packages

No packages published

Languages