diff --git a/README.md b/README.md index 3cf13b80..4a0996e7 100644 --- a/README.md +++ b/README.md @@ -1,275 +1,123 @@ -# Sequence Tube Maps +# Sequence Tube Map ![Header Graphic](/images/header.png) -### A JavaScript module for the visualization of genomic sequence graphs. It automatically generates a "tube map"-like visualization of sequence graphs which have been created with [vg](https://github.com/vgteam/vg). - -### Link to working demo: [https://vgteam.github.io/sequenceTubeMap/](https://vgteam.github.io/sequenceTubeMap/) - -## Biological Background - -Recent scientific advances have lead to a huge increase in the amount of available genomic sequence information. In the past this sequence information consisted of a single reference sequence, which can be relatively easily visualized in a linear way. Today we often know multiple variants of a particular DNA sequence. These could be sequences from different individuals of the same species, but also homologous (= having shared ancestry) sequences from different species. The differences between the individual sequences are called polymorphisms and can range in size from variations of a single base pair to variations involving long stretches of DNA. These polymorphisms are a key focus point for all kinds of sequence analysis, since analyzing the differences between sequences and correlating them to possible differences in phenotype allows to make conclusions about the function of the analyzed sequence. - -Graph data structures allow the encoding of multiple related sequences in a single data structure. The intention is to simplify the comparison of multiple sequences by making it easy to find the sequences' similarities and differences. There are a number of approaches (and file formats) for formally encoding variants of genomic sequences and their relationships in the form of graphs. Unfortunately it is often difficult to visualize these graphs in a way which conveys the complex information yet is easy to understand. - -## Functionality - -The purpose of this module is to generate visual representations of genomic sequence graphs. The visualization aims to display the information about all sequence variants in an intuitive way and as elegantly as possible. - -Genomic sequence graphs consist of nodes and paths: - -- A **Node** represents a specific sequence of bases. The length of this sequence determines the node's width in the graphical display. -- A **Path** connects multiple nodes. Each path represents one of the sequences underlying the graph data structure and its walk along multiple nodes. - -This simple example shows two paths along three nodes: - -![Simple Example 1](/images/example1.png) - -Since both paths connect the same nodes, their sequences are identical (and the three nodes could actually be merged into a single one). If the two sequences would differ somewhere in the middle, this would result in the following image: - -![Simple Example 2](/images/example2.png) - -The way genomic sequences change in living organisms can lead to subsequences being inverted. For these cases, instead of creating two different nodes, a single node is traversed in two different directions: - -![Simple Example 3](/images/example3.png) - -The sequenceTubeMap module uses these elements as building blocks and automatically lays out and draws visualizations of graphs which are a lot bigger and more complicated. - -There already exist various JavaScript tools for the visualization of graphs (see [D3.js](https://d3js.org/) [force field graphs](https://bl.ocks.org/mbostock/4062045) or the [hierarchy-based approach](http://www.graphviz.org/content/fsm) by [Graphviz](http://www.graphviz.org/)). These tools are great at displaying typical graphs as they are usually defined. But these regular graphs consist of nodes and edges instead of paths and have significant differences compared to genomic sequence graphs. Regular graphs have edges connecting two nodes each and not continuous paths connecting multiple nodes sequentially, nor do their nodes have a forward or backward orientation. We therefore need a specialized solution for displaying genomic sequence graphs (nevertheless the sequenceTubeMap module uses [D3.js](https://d3js.org/) for the actual drawing of svg graphics after calculating the coordinates of the various components). - -## Usage - -### Online Version: Explore Without Installing Anything - -The easiest way to have a look at some graph visualizations is to check out the online demo at [https://vgteam.github.io/sequenceTubeMap/](https://vgteam.github.io/sequenceTubeMap/). There you can play with visualizations from a few different data sets as well as look at some examples showcasing different structural features of variation graphs. You can even provide your own [vg](https://github.com/vgteam/vg)-generated data as an input (limited to small file sizes only). - -### Run Sequence Tube Maps On Your Own - -If you are using vg and want visualize the graphs it generates, the online version is limited to small file sizes. For visualizing bigger data sets you can run Sequence Tube Maps on your own. You can either run Tube Maps completely on your local (Linux) machine or use your local browser to access Tube Maps running on any other (Linux) machine you have access to. - -(Previously we provided a docker image at [https://hub.docker.com/r/wolfib/sequencetubemap/](https://hub.docker.com/r/wolfib/sequencetubemap/), which contained the build of this repo as well as a vg executable for data preprocessing and extraction. We now recommend a different installation approach.) - -#### Prerequisites - -* The NodeJS version [specified in the `.nvmrc` file](https://github.com/vgteam/sequenceTubeMap/blob/master/.nvmrc), which as of this writing is **18.7.0**. Other several other NodeJS versions will work, or at least mostly work, but only this version is tested. THis version of NodeJS can be installed on most systems with [nvm](https://github.com/nvm-sh/nvm). -* NPM or `yarn`. NPM comes included in most NodeJS installations. Ubuntu packages it as a separate `npm` package. -* [vg](https://github.com/vgteam/vg) (vg can be tricky to compile. If you run into problems, there are docker images for vg at [https://github.com/vgteam/vg_docker](https://github.com/vgteam/vg_docker).) - -The directory containing the vg executable needs to be added to your environment path: - -``` -PATH=/:$PATH -``` - -#### Installation - -- Clone the repo: - ``` - git clone https://github.com/vgteam/sequenceTubeMap.git - ``` -- Switch to the `sequenceTubeMap` folder -- Install npm dependencies: - ``` - yarn install - ``` - or - ``` - npm install - ``` -- Build the frontend: - ``` - yarn build - ``` - or - ``` - npm run build - ``` - -#### Execution - -- Start the node server: - ``` - yarn serve - ``` - or - ``` - npm run serve - ``` -- If the node server is running on your local machine, open a browser tab and go to `localhost:3001`. -- If the node server is running on a different machine, open a local browser tab and go to the server's URL on port 3001 `http://:3001/`. - If you cannot access the server's port 3001 from the browser, instead of configuring firewall rules etc., it's probably easiest to set up an SSH tunnel. - -``` -ssh -N -L 3001:localhost:3001 @ -``` - -#### Setting Up a Visualization +**Generates a "tube map"-like visualization of sequence graphs which have been created with [vg](https://github.com/vgteam/vg).** + +*[See the online demo!](https://vgteam.github.io/sequenceTubeMap/)* + +No idea what those squiggles are supposed to be? Read the [Introduction](doc/intro.md). + +## Online Version +**Explore Without Installing Anything** + +The easiest way to have a look at some graph visualizations is to check out the [online demo](https://vgteam.github.io/sequenceTubeMap/). There you can play with visualizations from a few different data sets as well as look at some examples showcasing different structural features of variation graphs. You can even provide your own [vg](https://github.com/vgteam/vg)-generated data as an input (limited to small file sizes only). + +## Local Version +**Run the Sequence Tube Map on Your Own** + +If you are using vg and want visualize the graphs it generates, the online version is limited to small file sizes. For visualizing bigger data sets you can run the Sequence Tube Map on your own Linux or Mac computer. You can either run the Tube Map completely on your local machine, or use your local browser to access a Tube Map server running on any other machine you have access to. + +### Installation + +1. Open your terminal. On Linux, you can usually hit `Ctrl` + `Alt` + `T`. On Mac, hit `Command` + `Space`, type `terminal.app`, and hit `Enter`. +2. If you don't already have Git installed, [install Git](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git). +3. Clone the Git repository by typing: + ``` + git clone https://github.com/vgteam/sequenceTubeMap.git + ``` + Then press `Enter`. +4. Switch to the `sequenceTubeMap` directory: + ``` + cd sequenceTubeMap + ``` +5. If you don't already have vg installed, [install vg](https://github.com/vgteam/vg?tab=readme-ov-file#installation). + - For Linux: you can drop the `vg` program file into the `sequenceTubeMap` directory and the Sequence Tube Map will find it. + 1. If you don't have `curl` installed, you may need to do something like `sudo apt update && sudo apt install curl`. + 2. Download the `vg` program and make it executable. + ``` + curl -o vg https://github.com/vgteam/vg/releases/latest/download/vg + chmod +x vg + ``` + If you have an ARM computer, use `https://github.com/vgteam/vg/releases/latest/download/vg-arm64` instead. + 3. To use the data preparation scripts in `sequenceTubeMap/scripts/`, you will need to have the directory with vg in it in your `PATH` environment variable: + ``` + echo 'export PATH="${PATH}:'"$(pwd)"'"' >>~/.bashrc + . ~/.bashrc + ``` + - For Mac: Open a new terminal, and follow the [vg instructions for building on MacOS](https://github.com/vgteam/vg?tab=readme-ov-file#building-on-macos). Make sure to do the part about adding vg to your `PATH` environment variable. When you come back to your original terminal, run: + ``` + . ~/.zshrc + ``` +6. If you don't already have the right version of NodeJS, [install nvm](https://github.com/nvm-sh/nvm?tab=readme-ov-file#install--update-script) which can install NodeJS: + ``` + curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.1/install.sh | bash + export NVM_DIR="$([ -z "${XDG_CONFIG_HOME-}" ] && printf %s "${HOME}/.nvm" || printf %s "${XDG_CONFIG_HOME}/nvm")" + [ -s "$NVM_DIR/nvm.sh" ] && \. "$NVM_DIR/nvm.sh" + ``` + (If you don't have `curl` installed, you may need to do something like `sudo apt update && sudo apt install curl`.) +7. Install the version of NodeJS that the Sequence Tube Map [asks for in its `.nvmrc` file](https://github.com/vgteam/sequenceTubeMap/blob/master/.nvmrc). As of this writing that is **18.7.0**. You can install the right version automatically with `nvm`: + ``` + nvm install + ``` +8. Activate the appropriate version of NodeJS: + ``` + nvm use + ``` +9. Install the exact versions of NPM dependencies that the Sequence Tube Map is tested against: + ``` + npm ci + ``` + Note that this is using **npm**, not **nvm** as in the previous step. +10. Build the frontend: + ``` + npm run build + ``` + +### Execution + +After installation, you can run the Sequence Tube Map: + +1. Open your terminal. On Linux, you can usually hit `Ctrl` + `Alt` + `T`. On Mac, hit `Command` + `Space`, type `terminal.app`, and hit `Enter`. +2. Switch to the `sequenceTubeMap` directory: + ``` + cd sequenceTubeMap + ``` + If you didn't clone the Git repository immediately inside your home directory, you may need to navigate to another directory first. +3. Activate the appropriate version of NodeJS. If you installed `nvm` to manage NodeJS versions, you can run: + ``` + nvm use + ``` +4. Start the Sequence Tube Map server: + ``` + npm run serve + ``` + Note that this is using **npm**, not **nvm** as in the previous step. +5. Open the Sequence Tube Map in your browser. + - If you are running the Sequence Tube Map on your local computer, you can visit [http://localhost:3001](http://localhost:3001). + - If you are running the Sequence Tube Map on a *different* computer (for example, one accessed by SSH), you will need to connect to it there. You can try browsing to port 3001 on that machine's hostname. For example, if you connected with `ssh yourname@bigserver.example.edu`, then `bigserver.example.edu` is the hostname, and you want to visit `http://bigserver.example.edu:3001`. If that doesn't work, you can try setting up an SSH tunnel by making a second SSH connection with: + ``` + ssh -L 3001:localhost:3001 yourname@bigserver.example.edu + ``` + While that SSH connection is open, you will be able to see the Sequence Tube Map at [http://localhost:3001](http://localhost:3001). + +### Setting Up a Visualization The application comes with pre-set demos that you can use to learn the tool's visual language and basic features. To set up a custom visualization of particular files, you will need to configure a set of "tracks" describing the files you want to visualize, using the "Configure Tracks" dialog in "custom (mounted files)" mode. For information on how to do this, click on the "?" help button, or [read the help documentation online](public/help/help.md). -#### Adding Your Own Data - -- The vg files you want to visualize need to contain haplotype/path info. Generating visualizations for the graph itself only is not supported. In addition to the haplotype graph, you can optionally visualize aligned reads from a gam file. -- Your data needs to be indexed by vg. To generate an index of your vg file, go to the `sequenceTubeMap/scripts/` directory and run - ``` - ./prepare_vg.sh - ``` - `` is the file name of your vg file including path information. - If there are `.vcf.gz` and `.vcf.gz.tbi` files next to your `.vg`, they will be used to generate a GBWT index of haplotypes from the VCF. In this case, the `.vg` file must contain alt paths, from the `-a` option of vg construct. -- To generate an index of your gam file (optional, you can view vg only too): - ``` - ./prepare_gam.sh - ``` - `` is the file name of your gam file including path information. -- The output files will be generated in the same folder as the original files. To tell Sequence Tube Maps this location, edit `sequenceTubeMpas/src/config.json` and modify the entry for `dataPath`: - ``` - "dataPath": "/", - ``` - If you want to use a relative path, this path should be relative to the `sequenceTubeMaps/` folder. -- restart the server and choose `custom (mounted files)` from the data dropdown in the UI to be able to pick from the files in your data folder. - -#### Preparing subgraphs in advance - -The sequenceTubeMap will fetch the necessary data when a region is queried. -That can sometimes up to 10-20 seconds. -If you already know of regions/subgraphs that you will be looking at, you can pre-fetch the data in advance. -This will save some time during the interactive visualization, especially if there are a lot of regions to visualize. - -The net result needs to be one or more chunk directories on disk, referenced from a BED file. - -To generate each chunk, you can use the `prepare_chunks.sh` script. You ought to run it from the directory containing your input files and where your output chunks will be stored (i.e. the `dataPath` in `sequenceTubeMaps/src/config.json`), which defaults to the `exampleData` directory in the repo. - -For example: - -``` -cd exampleData/ -../scripts/prepare_chunks.sh -x mygraph.xg -h mygraph.gbwt -r chr1:1-100 -d 'Region A' -o chunk-chr1-1-100 -g mygam1.gam -g mygam2.gam >> mychunks.bed -../scripts/prepare_chunks.sh -x mygraph.xg -h mygraph.gbwt -r chr1:101-200 -d 'Region B' -o chunk-chr1-100-200 -g mygam1.gam -g mygam2.gam >> mychunks.bed -``` - -The BED file linking to the chunks has two additional nonstandard columns: - -- a description of the region (column 4) -- the path to the output directory of the chunk, `chunk-chr1-1-100` in the example above, (column 5). - -``` -chr1 1 100 Region A chunk-chr1-1-100 -chr1 101 200 Region B chunk-chr2-101-200 -``` -Note each column is seperated by tabs - -This BED file needs to be in the `dataPath` directory, or it can be hosted on the web along with its chunk directories and accessed via URL. - -If you want certain nodes of the graph to be colored, place the node names to be colored in a `nodeColors.tsv` file, with a node name on each line, within output directory of the chunk. When rendered, these specified nodes will be colored differently than other nodes. - -You can use `prepare_chunks.sh` script to generate this additional `nodeColors.tsv` by adding an additional option. Here is an example: - -``` -cd exampleData/ -../scripts/prepare_chunks.sh -x mygraph.xg -h mygraph.gbwt -r chr1:1-100 -d 'Region A' -o chunk-chr1-1-100 -g mygam1.gam -g mygam2.gam -n "1 2 3" >> mychunks.bed -``` - -Adding this additional `n` flag will allow a string space delimited input of node names which will be outputted to `nodeColors.tsv`. - -``` -1 -2 -3 -``` - -##### Pre-made subgraphs - -You may want to look at a graph that has already been extracted from a larger graph. -To support this, there is a `prepare_local_chunk.sh` script, which takes a subgraph rather than a full graph. -It supports most of the options that `prepare_chunks.sh` does, with the notable exception of haplotype files. -It assumes that the graph represents some region along some reference path that is present in the graph, and expects that region to be provided with the `-r` option. -It assumes that path names in the subgraph *don't* use subregion suffixes (bracket-enclosed numbers). -The path name used in the region should *exactly* match the name of one of the paths in the graph. - -`prepare_local_chunk.sh` also accepts `.gaf` files, which will automatically be converted into a gam file using `vg convert`. - -For example, you can run it like: - -``` -cd exampleData/ -../scripts/prepare_local_chunk.sh -x subgraph.gbz -r chr5:1023911-1025911 -g subgraph_reads.gam -g other_sample_reads.gam -g another_sample_reads.gaf -o subgraph1 >> subgraphs.bed -``` - -Your graph can be a `.vg`, `.xg`, `.gfa`, or any other graph format understood by vg, but it *must* be in the same node ID space as your reads, and the script does *not* check this for you! In particular, indexing a GFA graph and mapping to it with `vg giraffe` can result in the original GFA nodes being cut into manageable pieces and assigned new numbers in the graph that the reads actually are aligned to, meaning the original GFA won't work here. You can check your reads against your graph with `vg validate subgraph.gfa --gam subgraph_reads.gam`. If your read alignments look completely absurd and jump all over the place, this is likely the problem. - -If the original subgraph file does not remain in place under the configured `dataPath` and accessible by the tube map, errors may occur complaining that it couldn't be accessed when the tube map attempts to list ist contained paths. - -The net result will be that you can select the BED file, select the region it specifies, and view a precomputed view of the subgraph, with coordinates computed assuming it covers the region provided to `prepare_local_chunk.sh`. - -A note on naming node IDs when using `.gfa` files: -VG keeps node IDs the same when all node names are strictly positive integers. However, node IDs are renamed upon encountering string-named nodes. Renaming begins at the first encounter of a string-named node, using the highest integer encountered so far (+1), or 1 if the first node is string-named in the GFA file. Future nodes are renamed in a +1 manner regardless of their datatype. - -Here's an example of a rename: - -``` -Original -> Renamed -3 -> 3 -1 -> 1 -five -> 4 -7 -> 5 -four -> 6 -``` - -#### Development Mode - -The `build`/`serve` pipeline can only produce minified code, which can be difficult to debug. In development, you should instead use: - ``` - yarn start - ``` - or - ``` - npm run start - ``` -This will use React's development mode server to serve the frontend, and run the backend in a separate process, behind React's proxy. Local ports 3000 (or set a different SERVER_PORT in .env) and 3001 must both be free. - -Running in this mode allows the application to produce human-readable stack traces when something goes wrong in the browser. - -#### Running Tests - -For interactive development, you can use: - ``` - yarn test - ``` - or - ``` - npm run test - ``` - -This will start the tests in a watching mode, where files that are changed will prompt apparently-dependent tests to rerun. Note that this only looks for changes versus the currently checked-out Git commit; if you have committed your changes, you cannot test them this way. On Mac, it also requires that the `watchman` package be installed, because it needs to watch the jillions of files in `node_modules` for changes. - -If you want to run all the tests, you can run: - ``` - yarn test -- --watchAll=false - ``` - or - ``` - npm run test -- --watchAll=false - ``` - -You can also set the environment variable `CI=true`, or [look sufficiently like a kind of CI environment known to `reach-scripts`](https://create-react-app.dev/docs/running-tests/#command-line-interface). - -If you want to run just a single test, based on its `describe` or `it` name argument, you can do something like: +### Adding your Own Data - ``` - npm run test -- --watchAll=false -t "can retrieve the list of mounted xg files" - ``` +To load your own data into the Sequence Tube Map, see the guide to [Adding Your Own Data](doc/data.md). -#### Running Prettier Formatter +## Docker -In order to format all `.js` and `.css` files you can run: +Previously we provided a Docker image at [https://hub.docker.com/r/wolfib/sequencetubemap/](https://hub.docker.com/r/wolfib/sequencetubemap/), which contained the build of this repo as well as a vg executable for data preprocessing and extraction. We now recommend a different installation approach, either using the [online version](#online-version) or a full installation of the [local version](#local-version). However, if you would like to Dockerize the Sequence Tube Map, the repository includes a `Dockerfile`. -``` -npm run format -``` -Currently, this repo currently uses [Prettier's default options](https://prettier.io/docs/en/options.html), including double quotes and 2 space tab width for JS. +## Contributing +For information on how to develop on the Sequence Tube Map codebase, pleas see the [Development Guide](doc/development.md). ## License diff --git a/doc/data.md b/doc/data.md new file mode 100644 index 00000000..ab33acef --- /dev/null +++ b/doc/data.md @@ -0,0 +1,115 @@ +# Adding Your Own Data + +You can add your own data to the Sequence Tube Map, by placing it in the data path (by default, the `sequenceTubeMap/exampleData/` directory) of a local copy, uploading it through the web interface, or hosting it online and providing a track or BED URL. + +## Adding Full Graphs + +- The vg files you want to visualize need to contain haplotype or path info. Generating visualizations for a graph without any haplotypes or paths is not supported; only nodes covered by at least one haplotype or path will be displayed. +- If you have a `.vg` file, you can index it into an `.xg` for faster access. Go to the `sequenceTubeMap/scripts/` directory and run + ``` + ./prepare_vg.sh + ``` + `` is the file name of your vg file including path information. + If there are `.vcf.gz` and `.vcf.gz.tbi` files next to your `.vg`, they will be used to generate a GBWT index of haplotypes from the VCF. In this case, the `.vg` file must contain alt paths, from the `-a` option of vg construct. +- You can also visualize aligned reads from GAM or indexed GAF files. +- To generate an index of your GAM file, go to the `sequenceTubeMap/scripts/` directory and run: + ``` + ./prepare_gam.sh + ``` + `` is the path to your GAM file. +- The output files from the preparation scripts will be generated in the same folder as the original files. You probably will need to move everything into the `sequenceTubeMap/exampleData/` directory, which is the default data path. +- You can change the data path where the Sequence Tube Map looks for data files. To do this, edit `sequenceTubeMap/src/config.json` and modify the entry for `dataPath`: + ``` + "dataPath": "/", + ``` + If you want to use a relative path, this path should be relative to the `sequenceTubeMaps/` folder. Make sure to restart the server. +- To actually use your data, make sure to choose `custom (mounted files)` from the data dropdown in the UI, and then click the gear icon to add tracks. For more information on selecting data files in the Sequence Tube Map, see the [Usage Guide](../public/help/help.md). + +## Fetching Subgraphs in Advance + +The sequenceTubeMap will fetch the necessary data when a region is queried. +That can sometimes up to 10-20 seconds. +If you already know of regions/subgraphs that you will be looking at, you can pre-fetch the data in advance. +This will save some time during the interactive visualization, especially if there are a lot of regions to visualize. + +The net result needs to be one or more chunk directories on disk, referenced from a BED file. + +To generate each chunk, you can use the ``sequenceTubeMap/scripts/prepare_chunks.sh` script. You ought to run it from the directory containing your input files and where your output chunks will be stored (i.e. the `dataPath` in `sequenceTubeMaps/src/config.json`), which defaults to the `sequenceTubeMap/exampleData/` directory in the repo. + +For example: + +``` +cd exampleData/ +../scripts/prepare_chunks.sh -x mygraph.xg -h mygraph.gbwt -r chr1:1-100 -d 'Region A' -o chunk-chr1-1-100 -g mygam1.gam -g mygam2.gam >> mychunks.bed +../scripts/prepare_chunks.sh -x mygraph.xg -h mygraph.gbwt -r chr1:101-200 -d 'Region B' -o chunk-chr1-100-200 -g mygam1.gam -g mygam2.gam >> mychunks.bed +``` + +The BED file linking to the chunks has two additional nonstandard columns: + +- a description of the region (column 4) +- the path to the output directory of the chunk, `chunk-chr1-1-100` in the example above, (column 5). + +``` +chr1 1 100 Region A chunk-chr1-1-100 +chr1 101 200 Region B chunk-chr2-101-200 +``` +Note each column is seperated by tabs + +This BED file needs to be in the `dataPath` directory, or it can be hosted on the web along with its chunk directories and accessed via URL. + +If you want certain nodes of the graph to be colored, place the node names to be colored in a `nodeColors.tsv` file, with a node name on each line, within output directory of the chunk. When rendered, these specified nodes will be colored differently than other nodes. + +You can use `prepare_chunks.sh` script to generate this additional `nodeColors.tsv` by adding an additional option. Here is an example: + +``` +cd exampleData/ +../scripts/prepare_chunks.sh -x mygraph.xg -h mygraph.gbwt -r chr1:1-100 -d 'Region A' -o chunk-chr1-1-100 -g mygam1.gam -g mygam2.gam -n "1 2 3" >> mychunks.bed +``` + +Adding this additional `n` flag will allow a string space delimited input of node names which will be outputted to `nodeColors.tsv`. + +``` +1 +2 +3 +``` + +## Custom Local Subgraphs + +You may want to look at a graph that has already been extracted from a larger graph. +To support this, there is a `prepare_local_chunk.sh` script, which takes a subgraph rather than a full graph. +It supports most of the options that `prepare_chunks.sh` does, with the notable exception of haplotype files. +It assumes that the graph represents some region along some reference path that is present in the graph, and expects that region to be provided with the `-r` option. +It assumes that path names in the subgraph *don't* use subregion suffixes (bracket-enclosed numbers). +The path name used in the region should *exactly* match the name of one of the paths in the graph. + +`prepare_local_chunk.sh` also accepts `.gaf` files, which will automatically be converted into a GAM file using `vg convert`. + +For example, you can run it like: + +``` +cd exampleData/ +../scripts/prepare_local_chunk.sh -x subgraph.gbz -r chr5:1023911-1025911 -g subgraph_reads.gam -g other_sample_reads.gam -g another_sample_reads.gaf -o subgraph1 >> subgraphs.bed +``` + +Your graph can be a `.vg`, `.xg`, `.gfa`, or any other graph format understood by vg, but it *must* be in the same node ID space as your reads, and the script does *not* check this for you! In particular, indexing a GFA graph and mapping to it with `vg giraffe` can result in the original GFA nodes being cut into manageable pieces and assigned new numbers in the graph that the reads actually are aligned to, meaning the original GFA won't work here. You can check your reads against your graph with `vg validate subgraph.gfa --gam subgraph_reads.gam`. If your read alignments look completely absurd and jump all over the place, this is likely the problem. + +If the original subgraph file does not remain in place under the configured `dataPath` and accessible by the tube map, errors may occur complaining that it couldn't be accessed when the tube map attempts to list ist contained paths. + +The net result will be that you can select the BED file, select the region it specifies, and view a precomputed view of the subgraph, with coordinates computed assuming it covers the region provided to `prepare_local_chunk.sh`. + +A note on naming node IDs when using `.gfa` files: +VG keeps node IDs the same when all node names are strictly positive integers. However, node IDs are renamed upon encountering string-named nodes. Renaming begins at the first encounter of a string-named node, using the highest integer encountered so far (+1), or 1 if the first node is string-named in the GFA file. Future nodes are renamed in a +1 manner regardless of their datatype. + +Here's an example of a rename: + +``` +Original -> Renamed +3 -> 3 +1 -> 1 +five -> 4 +7 -> 5 +four -> 6 +``` + +You will need to account for the graph nodes having been renamed when interpreting the visualization. diff --git a/doc/development.md b/doc/development.md new file mode 100644 index 00000000..f81ed911 --- /dev/null +++ b/doc/development.md @@ -0,0 +1,57 @@ +# Development Guide + +This document describes how to work on the Sequence Tube Map code as a developer. + +## Development Server + +The `npm run build`/`npm run serve` pipeline can only produce minified code, which can be difficult to debug. In development, you should instead use: + ``` + yarn start + ``` + or + ``` + npm run start + ``` +This will use React's development mode server to serve the frontend, and run the backend in a separate process, behind React's proxy. Local ports 3000 (or set a different SERVER_PORT in .env) and 3001 must both be free. + +Running in this mode allows the application to produce human-readable stack traces when something goes wrong in the browser. + +## Running Tests + +For interactive development, you can use: + ``` + yarn test + ``` + or + ``` + npm run test + ``` + +This will start the tests in a watching mode, where files that are changed will prompt apparently-dependent tests to rerun. Note that this only looks for changes versus the currently checked-out Git commit; if you have committed your changes, you cannot test them this way. On Mac, it also requires that the `watchman` package be installed, because it needs to watch the jillions of files in `node_modules` for changes. + +If you want to run all the tests, you can run: + ``` + yarn test -- --watchAll=false + ``` + or + ``` + npm run test -- --watchAll=false + ``` + +You can also set the environment variable `CI=true`, or [look sufficiently like a kind of CI environment known to `react-scripts`](https://create-react-app.dev/docs/running-tests/#command-line-interface). + +If you want to run just a single test, based on its `describe` or `it` name argument, you can do something like: + + ``` + npm run test -- --watchAll=false -t "can retrieve the list of mounted xg files" + ``` + +## Running Prettier Formatter + +In order to format all `.js` and `.css` files you can run: + +``` +npm run format +``` +Currently, this repo currently uses [Prettier's default options](https://prettier.io/docs/en/options.html), including double quotes and 2 space tab width for JS. + diff --git a/images/example1.png b/doc/images/example1.png similarity index 100% rename from images/example1.png rename to doc/images/example1.png diff --git a/images/example2.png b/doc/images/example2.png similarity index 100% rename from images/example2.png rename to doc/images/example2.png diff --git a/images/example3.png b/doc/images/example3.png similarity index 100% rename from images/example3.png rename to doc/images/example3.png diff --git a/doc/intro.md b/doc/intro.md new file mode 100644 index 00000000..2f97b778 --- /dev/null +++ b/doc/intro.md @@ -0,0 +1,33 @@ +# Introduction to the Sequence Tube Map + +## Biological Background + +Recent scientific advances have lead to a huge increase in the amount of available genomic sequence information. In the past this sequence information consisted of a single reference sequence, which can be relatively easily visualized in a linear way. Today we often know multiple variants of a particular DNA sequence. These could be sequences from different individuals of the same species, but also homologous (= having shared ancestry) sequences from different species. The differences between the individual sequences are called polymorphisms and can range in size from variations of a single base pair to variations involving long stretches of DNA. These polymorphisms are a key focus point for all kinds of sequence analysis, since analyzing the differences between sequences and correlating them to possible differences in phenotype allows to make conclusions about the function of the analyzed sequence. + +Graph data structures allow the encoding of multiple related sequences in a single data structure. The intention is to simplify the comparison of multiple sequences by making it easy to find the sequences' similarities and differences. There are a number of approaches (and file formats) for formally encoding variants of genomic sequences and their relationships in the form of graphs. Unfortunately it is often difficult to visualize these graphs in a way which conveys the complex information yet is easy to understand. + +## Functionality + +The purpose of the Sequence Tube Map is to generate visual representations of genomic sequence graphs. The visualization aims to display the information about all sequence variants in an intuitive way and as elegantly as possible. + +Genomic sequence graphs consist of nodes and paths: + +- A **Node** represents a specific sequence of bases. The length of this sequence determines the node's width in the graphical display. +- A **Path** connects multiple nodes. Each path represents one of the sequences underlying the graph data structure and its walk along multiple nodes. + +This simple example shows two paths along three nodes: + +![Simple Example 1](images/example1.png) + +Since both paths connect the same nodes, their sequences are identical (and the three nodes could actually be merged into a single one). If the two sequences would differ somewhere in the middle, this would result in the following image: + +![Simple Example 2](images/example2.png) + +The way genomic sequences change in living organisms can lead to subsequences being inverted. For these cases, instead of creating two different nodes, a single node is traversed in two different directions: + +![Simple Example 3](images/example3.png) + +The Sequence Tube Map uses these elements as building blocks and automatically lays out and draws visualizations of graphs which are a lot bigger and more complicated. + +There already exist various JavaScript tools for the visualization of graphs (see [D3.js](https://d3js.org/) [force field graphs](https://bl.ocks.org/mbostock/4062045) or the [hierarchy-based approach](http://www.graphviz.org/content/fsm) by [Graphviz](http://www.graphviz.org/)). These tools are great at displaying typical graphs as they are usually defined. But these regular graphs consist of nodes and edges instead of paths and have significant differences compared to genomic sequence graphs. Regular graphs have edges connecting two nodes each and not continuous paths connecting multiple nodes sequentially, nor do their nodes have a forward or backward orientation. We therefore need a specialized solution for displaying genomic sequence graphs. (Internally, the Sequence Tube Map uses [D3.js](https://d3js.org/) for the actual drawing of svg graphics, after calculating the coordinates of the various components). + diff --git a/docker/config.json b/docker/config.json index 2fe19e36..764046fb 100644 --- a/docker/config.json +++ b/docker/config.json @@ -47,7 +47,7 @@ "dataType": "built-in" } ], - "vgPath": "", + "vgPath": [""], "dataPath": "/data", "internalDataPath": "exampleData/internal/", "tempDirPath": "temp", diff --git a/package-lock.json b/package-lock.json index 63fd399f..02274af9 100644 --- a/package-lock.json +++ b/package-lock.json @@ -64,6 +64,7 @@ "webpack": "^5.82.0", "webpack-dev-server": "4.11.1", "websocket": "^1.0.34", + "which": "^4.0.0", "worker-rpc": "^0.2.0" }, "devDependencies": { @@ -7054,6 +7055,20 @@ "node": ">= 8" } }, + "node_modules/cross-spawn/node_modules/which": { + "version": "2.0.2", + "resolved": "https://registry.npmjs.org/which/-/which-2.0.2.tgz", + "integrity": "sha512-BLI3Tl1TW3Pvl70l3yq3Y64i+awpwXqsGBYWkkqMtnbXgrMD+yj7rhW0kuEDxzJaYXGjEW5ogapKNMEKNMjibA==", + "dependencies": { + "isexe": "^2.0.0" + }, + "bin": { + "node-which": "bin/node-which" + }, + "engines": { + "node": ">= 8" + } + }, "node_modules/crypto-random-string": { "version": "2.0.0", "resolved": "https://registry.npmjs.org/crypto-random-string/-/crypto-random-string-2.0.0.tgz", @@ -19448,17 +19463,17 @@ "dev": true }, "node_modules/which": { - "version": "2.0.2", - "resolved": "https://registry.npmjs.org/which/-/which-2.0.2.tgz", - "integrity": "sha512-BLI3Tl1TW3Pvl70l3yq3Y64i+awpwXqsGBYWkkqMtnbXgrMD+yj7rhW0kuEDxzJaYXGjEW5ogapKNMEKNMjibA==", + "version": "4.0.0", + "resolved": "https://registry.npmjs.org/which/-/which-4.0.0.tgz", + "integrity": "sha512-GlaYyEb07DPxYCKhKzplCWBJtvxZcZMrL+4UkrTSJHHPyZU4mYYTv3qaOe77H7EODLSSopAUFAc6W8U4yqvscg==", "dependencies": { - "isexe": "^2.0.0" + "isexe": "^3.1.1" }, "bin": { - "node-which": "bin/node-which" + "node-which": "bin/which.js" }, "engines": { - "node": ">= 8" + "node": "^16.13.0 || >=18.0.0" } }, "node_modules/which-boxed-primitive": { @@ -19509,6 +19524,14 @@ "url": "https://github.com/sponsors/ljharb" } }, + "node_modules/which/node_modules/isexe": { + "version": "3.1.1", + "resolved": "https://registry.npmjs.org/isexe/-/isexe-3.1.1.tgz", + "integrity": "sha512-LpB/54B+/2J5hqQ7imZHfdU31OlgQqx7ZicVlkm9kzg9/w8GKLEcFfJl/t7DCEDueOyBAD6zCCwTO6Fzs0NoEQ==", + "engines": { + "node": ">=16" + } + }, "node_modules/word-wrap": { "version": "1.2.3", "resolved": "https://registry.npmjs.org/word-wrap/-/word-wrap-1.2.3.tgz", @@ -24925,6 +24948,16 @@ "path-key": "^3.1.0", "shebang-command": "^2.0.0", "which": "^2.0.1" + }, + "dependencies": { + "which": { + "version": "2.0.2", + "resolved": "https://registry.npmjs.org/which/-/which-2.0.2.tgz", + "integrity": "sha512-BLI3Tl1TW3Pvl70l3yq3Y64i+awpwXqsGBYWkkqMtnbXgrMD+yj7rhW0kuEDxzJaYXGjEW5ogapKNMEKNMjibA==", + "requires": { + "isexe": "^2.0.0" + } + } } }, "crypto-random-string": { @@ -34023,11 +34056,18 @@ } }, "which": { - "version": "2.0.2", - "resolved": "https://registry.npmjs.org/which/-/which-2.0.2.tgz", - "integrity": "sha512-BLI3Tl1TW3Pvl70l3yq3Y64i+awpwXqsGBYWkkqMtnbXgrMD+yj7rhW0kuEDxzJaYXGjEW5ogapKNMEKNMjibA==", + "version": "4.0.0", + "resolved": "https://registry.npmjs.org/which/-/which-4.0.0.tgz", + "integrity": "sha512-GlaYyEb07DPxYCKhKzplCWBJtvxZcZMrL+4UkrTSJHHPyZU4mYYTv3qaOe77H7EODLSSopAUFAc6W8U4yqvscg==", "requires": { - "isexe": "^2.0.0" + "isexe": "^3.1.1" + }, + "dependencies": { + "isexe": { + "version": "3.1.1", + "resolved": "https://registry.npmjs.org/isexe/-/isexe-3.1.1.tgz", + "integrity": "sha512-LpB/54B+/2J5hqQ7imZHfdU31OlgQqx7ZicVlkm9kzg9/w8GKLEcFfJl/t7DCEDueOyBAD6zCCwTO6Fzs0NoEQ==" + } } }, "which-boxed-primitive": { diff --git a/package.json b/package.json index 7d870d43..ac128460 100644 --- a/package.json +++ b/package.json @@ -58,6 +58,7 @@ "webpack": "^5.82.0", "webpack-dev-server": "4.11.1", "websocket": "^1.0.34", + "which": "^4.0.0", "worker-rpc": "^0.2.0" }, "scripts": { diff --git a/src/config.json b/src/config.json index 6d7dbc2b..ad54a359 100644 --- a/src/config.json +++ b/src/config.json @@ -77,7 +77,7 @@ "dataType": "built-in" } ], - "vgPath": "", + "vgPath": ["./", "./bin", "./vg/bin", ""], "dataPath": "exampleData", "internalDataPath": "exampleData/internal/", "tempDirPath": "temp", diff --git a/src/server.mjs b/src/server.mjs index f90c6ea8..4d8d8e57 100644 --- a/src/server.mjs +++ b/src/server.mjs @@ -35,13 +35,65 @@ import sanitize from "sanitize-filename"; import { createHash } from "node:crypto"; import cron from "node-cron"; import { RWLock, combine } from "readers-writer-lock"; +import which from "which"; if (process.env.NODE_ENV !== "production") { // Load any .env file config dotenv.config(); } -const VG_PATH = config.vgPath; +/// Return the command string to execute to run vg. +/// Checks config.vgPath. +/// An entry of "" in config.vgPath means to check PATH. +function find_vg() { + if (find_vg.found_vg !== null) { + // Cache the answer and don't re-check all the time. + // Nobody shoudl be deleting vg. + return find_vg.found_vg; + } + for (let prefix of config.vgPath) { + if (prefix === "") { + // Empty string has special meaning of "use PATH". + console.log("Check for vg on PATH"); + try { + find_vg.found_vg = which.sync("vg"); + console.log("Found vg at:", find_vg.found_vg); + return find_vg.found_vg; + } catch (e) { + // vg is not on PATH + continue; + } + } + if (prefix.length > 0 && prefix[prefix.length - 1] !== "/") { + // Add trailing slash + prefix = prefix + "/"; + } + let vg_filename = prefix + "vg"; + console.log("Check for vg at:", vg_filename); + if (fs.existsSync(vg_filename)) { + if (!fs.statSync(vg_filename).isFile()) { + // This is a directory or something, not a binary we can run. + continue; + } + try { + // Pretend we will execute it + fs.accessSync(vg_filename, fs.constants.X_OK) + } catch (e) { + // Not executable + continue; + } + // If we get here it is executable. + find_vg.found_vg = vg_filename; + console.log("Found vg at:", find_vg.found_vg); + return find_vg.found_vg; + } + } + // If we get here we don't see vg at all. + throw new InternalServerError("The vg command was not found. Install vg to use the Sequence Tube Map: https://github.com/vgteam/vg?tab=readme-ov-file#installation"); +} +find_vg.found_vg = null; + + const MOUNTED_DATA_PATH = config.dataPath; const INTERNAL_DATA_PATH = config.internalDataPath; // THis is where we will store uploaded files @@ -274,7 +326,7 @@ function indexGamSorted(req, res) { const sortedGamFile = fs.createWriteStream(prefix + ".sorted.gam", { encoding: "binary", }); - const vgIndexChild = spawn(`${VG_PATH}vg`, [ + const vgIndexChild = spawn(find_vg(), [ "gamsort", "-i", prefix + ".sorted.gam.gai", @@ -683,23 +735,22 @@ async function getChunkedData(req, res, next) { console.log(`vg ${vgChunkParams.join(" ")}`); console.time("vg chunk"); - const vgChunkCall = spawn(`${VG_PATH}vg`, vgChunkParams); + const vgChunkCall = spawn(find_vg(), vgChunkParams); // vg simplify for gam files let vgSimplifyCall = null; if (req.simplify) { - vgSimplifyCall = spawn(`${VG_PATH}vg`, ["simplify", "-"]); + vgSimplifyCall = spawn(find_vg(), ["simplify", "-"]); console.log("Spawning vg simplify call"); } - const vgViewCall = spawn(`${VG_PATH}vg`, ["view", "-j", "-"]); + const vgViewCall = spawn(find_vg(), ["view", "-j", "-"]); let graphAsString = ""; req.error = Buffer.alloc(0); vgChunkCall.on("error", function (err) { console.log( "Error executing " + - VG_PATH + - "vg " + + find_vg() + " " + vgChunkParams.join(" ") + ": " + err @@ -732,7 +783,7 @@ async function getChunkedData(req, res, next) { vgViewCall.stdin.end(); } if (code !== 0) { - console.log("Error from " + VG_PATH + "vg " + vgChunkParams.join(" ")); + console.log("Error from " + find_vg() + " " + vgChunkParams.join(" ")); // Execution failed if (!sentResponse) { sentResponse = true; @@ -745,7 +796,7 @@ async function getChunkedData(req, res, next) { if (req.simplify) { vgSimplifyCall.on("error", function (err) { console.log( - "Error executing " + VG_PATH + "vg " + "simplify " + "- " + ": " + err + "Error executing " + find_vg() + " simplify " + "- " + ": " + err ); if (!sentResponse) { sentResponse = true; @@ -767,7 +818,7 @@ async function getChunkedData(req, res, next) { console.log(`vg simplify exited with code ${code}`); vgViewCall.stdin.end(); if (code !== 0) { - console.log("Error from " + VG_PATH + "vg " + "simplify - "); + console.log("Error from " + find_vg() + " " + "simplify - "); // Execution failed if (!sentResponse) { sentResponse = true; @@ -835,14 +886,14 @@ async function getChunkedData(req, res, next) { let vgSimplifyCall = null; let vgViewArguments = ["view", "-j"]; if (req.simplify) { - vgSimplifyCall = spawn(`${VG_PATH}vg`, ["simplify", filename]); + vgSimplifyCall = spawn(find_vg(), ["simplify", filename]); vgViewArguments.push("-"); console.log("Spawning vg simplify call"); } else { vgViewArguments.push(filename); } - let vgViewCall = spawn(`${VG_PATH}vg`, vgViewArguments); + let vgViewCall = spawn(find_vg(), vgViewArguments); let graphAsString = ""; req.error = Buffer.alloc(0); @@ -852,8 +903,7 @@ async function getChunkedData(req, res, next) { vgSimplifyCall.on("error", function (err) { console.log( "Error executing " + - VG_PATH + - "vg " + + find_vg() + " " + "simplify " + filename + ": " + @@ -879,7 +929,7 @@ async function getChunkedData(req, res, next) { console.log(`vg simplify exited with code ${code}`); vgViewCall.stdin.end(); if (code !== 0) { - console.log("Error from " + VG_PATH + "vg " + "simplify " + filename); + console.log("Error from " + find_vg() + " simplify " + filename); // Execution failed if (!sentResponse) { sentResponse = true; @@ -1213,11 +1263,11 @@ function processGamFile(req, res, next, gamFile, gamFileNumber) { vgViewParams.push(gamFile); } - const vgViewChild = spawn(`${VG_PATH}vg`, vgViewParams); + const vgViewChild = spawn(find_vg(), vgViewParams); if (gamFile.endsWith(".gaf")) { // if input was a GAF, run vg convert and pipe stdout to vg view - const vgConvertChild = spawn(`${VG_PATH}vg`, vgConvertParams); + const vgConvertChild = spawn(find_vg(), vgConvertParams); vgConvertChild.stdout.on("data", function (data) { vgViewChild.stdin.write(data); @@ -1232,7 +1282,7 @@ function processGamFile(req, res, next, gamFile, gamFileNumber) { console.log(`vg convert exited with code ${code}`); vgViewChild.stdin.end(); if (code !== 0) { - console.log("Error from " + VG_PATH + "vg " + vgConvertParams.join(" ")); + console.log("Error from " + find_vg() + " " + vgConvertParams.join(" ")); // Execution failed if (!sentResponse) { sentResponse = true; @@ -1608,7 +1658,7 @@ api.post("/getPathNames", (req, res, next) => { ); } - const vgViewChild = spawn(`${VG_PATH}vg`, ["paths", "-L", "-x", graphFile]); + const vgViewChild = spawn(find_vg(), ["paths", "-L", "-x", graphFile]); vgViewChild.stderr.on("data", (data) => { console.log(`err data: ${data}`);