The XML/TXT files from the different data sources used for this specific import were downloaded on the date 12/03/2014 and can be retrieved from the following requester pays bucket:
s3://eu-west-1.raw.bio4j.com/
If you are not using AWS please go directly to the Step 3.
- Launch a new AWS EC2 instance
Preferably hi1.4xlarge
or similar.
- Format and mount the ephemeral storage devices
$ mkfs -t ext4 /dev/sdb
$ mkdir -p /mnt/sources
$ mount /dev/sdb /mnt/sources
$ mkfs -t ext4 /dev/sdc
$ mkdir -p /mnt/bio4jtitan
$ mount /dev/sdc /mnt/bio4jtitan
- Download and install official Java 8 JDK
If you you can use yum for that:
yum -y remove java-1.7.0-openjdk
yum -y install java-1.8.0-openjdk
Otherwise, here's the link to the official website for downloading Java JDK 8.
- Download bio4j-titan jars and configuration files
You need the executable jar for the current release and some xml
and properties
files. For bio4j-titan 0.4.0
these are
bio4j-titan-0.4.0-fat.jar
The jar that you will need to runexecutionsBio4jTitan.xml
This file contains the mapping between raw data and the corresponding modules; it can be changed in order to import only a subset of the available data.uniprotData.xml
This file will only be used in the case where you want to import Uniprot module. (Set the boolean flags included in the XML file to true/false depending on your choice of data you want to import from Uniprot)- property files
All the.properties
files under this folder; you will need the ones corresponding to the modules you want to import.
For the xml
and properties
files, you can just clone the bio4j/bio4j-titan repo and checkout the 0.4.0
tag.
- Get the raw input data
Download and execute (from /mnt/sources/
) the following bash script:
It will download and decompress all the raw data needed to build a full Bio4j distribution (Swissprot, TrEMBL, GO, etc..).
Once the script has finished, make sure that the final file names coincide with those specified in your XML file executionsBio4jTitan.xml
.
Since we are only interested in some of the data sources, it wouldn't make much sense to download the ones we were not going to include. We can avoid that simply removing the lines we don't need from this shell script.
- (Optional) Customize data to be imported
Bio4j is divided in modules and so is the importing process, that way you don't have to import the whole thing in the case where you are interested only in some of the data sources ( Gene Ontology, NCBI taxonomy tree, etc). However you must be coherent when importing a set modules, that's to say, for example it's not possible to import the Uniref clusters without previously importing Uniprot KB - otherwise there wouldn't be proteins to link to in the clusters!
In order to customize the modules that will be imported you have to modify the file executionsBio4jTitan.xml.
Let's imagine that we want a database including only the Gene Ontology, NCBI taxonomy tree and Uniprot KB (only Swiss-prot entries).
The corresponding executionsBio4jTitan.xml
file should look like this:
<scheduled_executions>
<execution>
<class_full_name>com.bio4j.titan.model.go.programs.ImportGOTitan</class_full_name>
<arguments>
<argument>/mnt/sources/go.xml</argument>
<argument>/mnt/bio4jtitan/bio4j</argument>
</arguments>
</execution>
<execution>
<class_full_name>com.bio4j.titan.model.uniprot.programs.ImportUniprotTitan</class_full_name>
<arguments>
<argument>/mnt/sources/uniprot_sprot.xml</argument>
<argument>/mnt/bio4jtitan/bio4j</argument>
<argument>/mnt/sources/uniprot_data.xml</argument>
</arguments>
</execution>
<execution>
<class_full_name>com.bio4j.titan.model.uniprot.programs.ImportUniprotTitan</class_full_name>
<arguments>
<argument>/mnt/sources/uniprot_trembl.xml</argument>
<argument>/mnt/bio4jtitan/bio4j</argument>
<argument>/mnt/sources/uniprot_data.xml</argument>
</arguments>
</execution>
<execution>
<class_full_name>com.bio4j.titan.model.uniprot_go.programs.ImportUniprotGoTitan</class_full_name>
<arguments>
<argument>/mnt/sources/uniprot_sprot.xml</argument>
<argument>/mnt/bio4jtitan/bio4j</argument>
</arguments>
</execution>
<execution>
<class_full_name>com.bio4j.titan.model.uniprot_go.programs.ImportUniprotGoTitan</class_full_name>
<arguments>
<argument>/mnt/sources/uniprot_trembl.xml</argument>
<argument>/mnt/bio4jtitan/bio4j</argument>
</arguments>
</execution>
<execution>
<class_full_name>com.bio4j.titan.model.ncbiTaxonomy.programs.ImportNCBITaxonomyTitan</class_full_name>
<arguments>
<argument>/mnt/sources/nodes.dmp</argument>
<argument>/mnt/sources/names.dmp</argument>
<argument>/mnt/sources/merged.dmp</argument>
<argument>/mnt/bio4jtitan/bio4j</argument>
</arguments>
</execution>
<execution>
<class_full_name>com.bio4j.titan.model.uniprot_ncbiTaxonomy.programs.ImportUniprotNCBITaxonomyTitan</class_full_name>
<arguments>
<argument>/mnt/sources/uniprot_sprot.xml</argument>
<argument>/mnt/bio4jtitan/bio4j</argument>
</arguments>
</execution>
<execution>
<class_full_name>com.bio4j.titan.model.uniprot_ncbiTaxonomy.programs.ImportUniprotNCBITaxonomyTitan</class_full_name>
<arguments>
<argument>/mnt/sources/uniprot_trembl.xml</argument>
<argument>/mnt/bio4jtitan/bio4j</argument>
</arguments>
</execution>
</scheduled_executions>
- Launch importing process
For example:
java -d64 -Xmx40G -jar bio4j-titan-0.4.0-fat.jar executionsBio4jTitan.xml &
Different log files will be created at the jar folder level.
All these tests were performed on a hi1.4xlarge instance using 40GB of memory for the Java process. The only configuration value that was changed was
"autotype" = "none"
Module | Time |
---|---|
Gene Ontology | 1m 14s |
Enzyme DB | 3s |
NCBI Taxonomy | 8m 13s |
Uniprot (SwissProt) | 2h 22m 22s |
Uniprot (TrEMBL) | - |
UniRef | 10h 37m 21s |
Protein Interactions (SwissProt) | 8m 53s |
Protein Interactions (TrEMBL) | 17h 20m |
Time spent by the following programs when using SwissProt or TrEMBL XML as source file:
SwissProt | TrEMBL | |
---|---|---|
UniprotGo | 2h 20m 35s | 2d 12h |
UniprotEnzymeDB | 6m 28s | 12h 40m |
UniprotNCBITaxonomy | 6m 32s | 1d 8h 40m |
Total time | 2h 33m 35s | 4d 1h 20m |