-
Notifications
You must be signed in to change notification settings - Fork 1
/
lmat-doc.tex
1048 lines (746 loc) · 49 KB
/
lmat-doc.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[11pt]{article}
\usepackage{underscore}
%\usepackage{geometry} % See geometry.pdf to learn the layout options. There are lots.
%\geometry{letterpaper} % ... or a4paper or a5paper or ...
%\geometry{landscape} % Activate for for rotated page geometry
%\usepackage[parfill]{parskip} % Activate to begin paragraphs with an empty line rather than an indent
%\usepackage{graphicx}
%\usepackage{amssymb}
%\usepackage{epstopdf}
%\DeclareGraphicsRule{.tif}{png}{.png}{`convert #1 `dirname #1`/`basename #1 .tif`.png}
\usepackage{ifpdf}
\ifpdf
\pdfcompresslevel=9
\pdfpagewidth=8.5 true in
\pdfpageheight=11 true in
\pdfhorigin=.25 true in
\pdfvorigin=.25 true in
\fi
\usepackage{enumerate}
\usepackage{hyperref}
\usepackage{xspace}
\textwidth=7.50in
\textheight=10.in
\topmargin=0.0in
\headsep=0.0in
\headheight=0.0in
\oddsidemargin=0.25in
\evensidemargin=0.25in
\title{Livermore Metagenomics Analysis Toolkit - LMAT}
%\author{The Author}
%\date{} % Activate to display a given date or no date
\newcommand{\lmatver}{VERSIONNO}
\begin{document}
\maketitle
\tableofcontents
\section{What's New in Version \lmatver}
Several bug fixes for the Gene Label application.
{\it Earlier updates:}
Version 1.2.5
\begin{itemize}
\item
Support for ``Compacted'' ML databases. Run get\_db.sh to see list. get_db creates ``sparse'' files.
\item
For genes reported from the gene database, report new k-mer statistics.
\item
Scripts to convert taxonomic output to MEGAN format and convert gene calls to SEED and KEGG ids for use in MEGAN
\item
New ``runtime\_inputs'' download available. This download is required to support the above-mentioned new features of Compacted ML databases and gene reporting.
\end{itemize}
Version 1.2.4
\begin{itemize}
\item
Updated reference databases.
\item
Support cross links between taxonomy profiles and gene label profiles.
\item
Support for world-region classification.
\item
Fixed bug in fs-report of genus for consistent output.
\item
Fixed bug in reported k-mer count of input database.
\item
New utility for output formatted for use with other tools.
\end{itemize}
Version 1.2.3
\begin{itemize}
\item
Database build supports human and select k-mer inputs.
\item
Updated content summarization.
\item
Permissive match feature.
\item
Summary reporting at specific ranks. Additionally, genus and species/strain level reports generated by default are combined and formatted in html.
\item
run\_lmat.sh has been split into three shell scripts for each subcomponent of LMAT query processing: run\_rl.sh for read labeling; run\_gl.sh for gene labeling; run\_cs.sh for content summarization.
\end{itemize}
Version 1.2.2
\begin{itemize}
\item
Support for reading query input from stdin in read\_label module. use ``-i -'' to enable this feature.
\item
Fixed a bug where incorrect header information was included in LMAT raw output (individual read labels).
\item New aggresive plasmid assignment. Previously plasmids were under-reported due to their placement in the NCBI taxonomy heirarchy. The same plasmid associated with multiple strains will normally be classified as a species (or higher rank) call according to the LMAT lowest common ancestor placement logic. Now in addition to the standard LMAT classification procedure the highest scoring matches are saved. If the highest scoring match list includes a plasmid consistent with the initial LMAT classification, the final classification is switched to the plasmid. The drawback
to this procedure is that multiple plasmids with tie scores are not explicitly represented in the final output label. LMAT, however, saves the scores of the competing matched sequences so a post processing step (not yet included) can easily tag plasmid calls that are non-specific to the initially reported taxonomy. \end{itemize}
Version 1.2.1
\begin{itemize}
\item
Support for full database download from the LMAT ftp site.
\item
read\_label no long re-prints the query input filename for every execution thread.
\end{itemize}
Version 1.2~includes a number of major changes, which should be described in more technical detail in forthcoming publications. Changes include:
\begin{itemize}
\item The kmer/taxonomy search database (DB) index structure was changed from gnu\_hash to a custom designed index.
This dramatically improves speed and dramatically reduces memory requirements
\item NCBI taxonomy identifiers are used instead of our own local database representation (should simplify interpretation)
\item The DB stores NCBI taxonomy identifiers in a 16-bit representation to substantially reduce memory requirements
(our current Full database is 372GB compared to the previous size of 540GB)
\item The read binning step is now included in the initial database search application to increase speed.
\item Added a content summarization step. Output from initial read labeling is re-processed to estimate abundance profiles
and to assign each read (including higher ranked reads) to the nearest most likely genome of origin. A more detailed description
of the method is given later in the documentation.
\item An gene annotation option is included.
\item The null-model has been improved select a GC context specific null model based on the GC content of the query read.
\item The run script for LMAT has been refined to better allow individually customized runs.
\item Input sets can now be ``load-balanced'' among multiple cores.
\item The module to label reads includes techniques for improving runtime with a slightly reduced accuracy by means of selectively pruning low-rank taxonomy identifiers.
\item The module to index the database can be enabled to use the same pruning technique (as above) to build databases with reduced numbers of taxonomy identifiers that should give even better runtime performance than runtime pruning.
\item Use latest je-malloc memory management library, which simplifies filemanagement
\item provide utility for retrieving organism specific reads and placing them into a single fasta file
\item The runtime-input files are now provided in a separate download from our ftp site. As a result, the source tarball is significantly smaller.
\item A script is provided to download databases and the runtime-input files from the ftp site.
\item Pre-compiled binary files are no longer being provided.
\end{itemize}
Version 1.1.2 differs from LMAT-1.1.1 in only the following:
We add support for null-model generation (Section \ref{sec:nullmodel})
where random reads are now explicitly drawn from a range of GC content
values.
\section{Overview}
This software package (LMAT) contains source code with the initial work described in
"Scalable metagenomic taxonomy classification using a reference genome
database".
The web site for LMAT is \texttt{http://computation-rnd.llnl.gov/lmat}.
The basic requirement is x86\_64 Linux (tested with 2.6.32-358 or later kernels). Running on other 64-bit platforms is unconfirmed but feasible (please share your experiences).
Our approach is divided into two steps:
\begin{enumerate}[A.]
\label{sec:over-a}
\item
{
%\setlength{\itemindent{15pt}}
{\bf Searchable K-mer/Taxonomy Database Preparation:}\\
\\
In somewhat simplistic
terms, the database is a hashtable that maps k-mers to the genomes in which they
appear, and the genomes to nodes in a taxonomy tree. (Hereafter
we abbreviate "Searchable K-mer/Taxonomy Database" as DB.)
Preparing a full DB involves several steps and takes many hours.
("Full" refers to a database that contains k-mers from
all viral, prokaryote, fungal and protist genomes.)
Our current ``Full'' database is 460 GB in size, smaller than
what we used for our July 2013 publication in Bioinformatics due
to several of the key optimizations within this release, yet
larger than previous database releases due to a large increase in
available reference sequences. We have several smaller
``marker'' database available for ftp download. See the
\nameref{sec:downloads} Section for more information about the
various LMAT databases availble for download.
An additional download required for running LMAT is the
collection of ``runtime-input'' files needed to go along with the
database (about 300MB in total). For convenience, we include a
script to perform these ftp downloads, details below.
Our databases are available for download from the ftp site:
\texttt{gdo-bioinformatics.ucllnl.org/pub/lmat}.
First download the external datafiles needed to run LMAT, these include the required NCBI taxonomy information, gene annotation and other support data.
The most recent marker library is available with or without human k-mers included and are under 20 GB.
\begin{verbatim}
cd <LMAT-root-dir>/bin
# <name-of-input> reflects current taxonomy data used - current version is: 04072014
./get_db.sh --dtype=inputs --name=<name-of-input> --outdir=<LMAT-root-dir>/runtime_inputs
# set the environment variable to point to the location
setenv LMAT_DIR=<LMAT-root-dir>/runtime_inputs
# Download the marker libary, set <db-name> to kML.v4-14.20.g10.db
# or kML+Human.v4-14.20.g10.db
./get_db.sh --dtype=db --name=<db-name> <data-dir>
# Download the gene annotation libary, set <db-name> to
# lmat.genes.7-14.db (for gene annotation only, latest requires
# 128 GB DRAM or SSD) ./get_db.sh <db-name> <data-dir>
# Note that LMAT uses a .db extension for its database files,
# omitted in the command syntax above
\end{verbatim}
Due to the time and effort involved in constructing a reference
DB, we suggest that first-time users begin with our Marker
database. In this case you can skip the sections "Searchable
K-mer/Taxonomy Database Preparation", "Custom Null Model
Generation.". Our \nameref{sec:quick-start} instructions give a
brief example of how to perform classification using the
downloadable Marker DB and included application binary
executable.
We include a shell script to download and assemble the larger
marker library as it is broken into 5 separate compressed files.
Run \texttt{./get\_db.sh} from LMAT-\lmatver/bin.
Alternatively, each file can be downloaded separately to reduce
the risk of incomplete downloads, then uncompressed and merged to
create the whole marker database file.
We highly recommend the use of a "ramdisk", Solid-State Drive, or
other flash-based storage (eg. PCIe card) to store the
uncompressed DB files. Make sure that there is enough space on
the device to hold the relatively large file(s). Save these
file(s) to a directory on that storage device and use
\texttt{gunzip} to uncompress the \texttt{*.gz}.
The files must remain with read-write ("rw") file access
permissions set or LMAT cannot open them.
Additionally, we encourage users to install
the DI-MMAP runtime (optional) if considering use of LMAT
databases on flash-based storage. Please see
https://bitbucket.org/vanessen/di-mmap for more information.
Accessing the database at run-time over a NFS mounted file system
is not recommended.
\begin{quote}
{\it Tip:} If a ramdisk is not available on a large (512GB+) server,
we recommend that the database be ``precopied'' into the system's
buffer cache (from network attached or local HDD storage) prior to
running LMAT. LMAT uses memory mapped files so pages in the
database are loaded on demand. Regular system prefetching of pages
under the memory-mapped approach is slow unless run from an SSD.
The effectiveness of precopy can depend on your system's buffer
cache policy: some evict pages early so the buffer will not hold the
LMAT database from the precopy to the following LMAT execution. To
perform the ``precopy'':
\texttt{\% cat <database-filename> > /dev/null}.
\end{quote}
}
\item
{
%\setlength{\itemindent{15pt}}
{\bf Scoring and Classification:}\\
\\
K-mers in queries are searched for in the DB
and taxonomic lineages are written to file; a PERL script subsequently
analyzes this file and outputs classification and abundancy statistics. For more details, refer to the section \nameref{sec:scoring}.
}
\end{enumerate}
\subsection{Quick-Start}
\label{sec:quick-start}
This section gives instructions in brief on how to perform scoring and classification on an input query set using the Marker DB. The example in this section is covered in more detail in subsequent sections of this manual. Be sure you have successfully downloaded the reference database files as described above in \nameref{sec:over-a} (A) above. Once that is complete, you can run ``make'' to build binary executable application and run the scoring portion. To complete the example execution below, you need to provide:
\begin{enumerate}[(1)]
\item
{
%\setlength{\itemindent{15pt}}
an input .fasta file to classify.
}
\item
{
%\setlength{\itemindent{15pt}}
a directory for output data files.
}
\end{enumerate}
\noindent Recommended configuration setting to use LMAT:
disable address space randomization (if not already set).\\
As super-user: \texttt{echo 0 > /proc/sys/kernel/randomize\_va\_space}\\
-or-\\
Use the ``setarch'' command using the -R option (see man page or setarch documentation online for more details.)
Note this step is recommened to reduce any potential memory management conflicts.
So far this step does not appear to be required and if address space randomization cannot be disabled, LMAT should
still run.
Example scoring and classification using LMAT with the Marker DB:
\begin{verbatim}
# The environment variable LMAT_DIR must be set and point to the directory containing the files given in the runtime_inputs directory (see above for download information)
# for example in bash:
export LMAT_DIR=../runtime_inputs
# or csh:
#setenv LMAT_DIR ../runtime_inputs
cd LMAT-VERSIONNO/
# build binaries
make
# create a temporary directory to store LMAT output
mkdir tmp
# unpack the simple example, includes 1000 reads from 3 organisms, and example LMAT output
cd example
tar zxvf example.tgz
cd ..
cd bin
There are three main analysis commands:
1) assign taxonomic labels to reads
2A) assign gene labels to reads
2B) compute k-mer abundance statistics
Step 1 must be run prior to running the optional steps 2A and 2B. Steps 2A and 2B can be run independetly of each other.
Note that output can vary slightly depending on the choice of the database.
## 1) Assign taxonomic labels to each read and report summary of organism contents
./run_rl.sh --db_file=<path-to-library/kML+Human.v4-14.20.g10.db --query_file=../example/simple_list.1000.fna --odir=../tmp --threads=8
# Output files:
simple_list.1000.fna.kML+Human.v4-14.20.g10.db.lo.rl_output.0.30.fastsummary.species - Bins reads by species rank and reports the top strain match (where possible)
simple_list.1000.fna.kML+Human.v4-14.20.g10.db.lo.rl_output.0.30.fastsummary.plasmid - Bins reads assigned to plasmids.
simple_list.1000.fna.kML+Human.v4-14.20.g10.db.lo.rl_output.0.30.fastsummary.genus - Bin reads by genus rank
simple_list.1000.fna.kML+Human.v4-14.20.g10.db.lo.rl_output.0.30.fastsummary - Bins for the raw LMAT search results, read bins are in rank flexible categories
simple_list.1000.fna.kML+Human.v4-14.20.g10.db.lo.rl_output.[0-7].out - individually taxonomically labeled reads; NOTE these files will take up substantial disk space as more runs are completed. They are retained for quick access for pulling reads of interest for further analysis
## optionally run the following to convert output to MEGAN format
LMAT2multi-fastsummaryTable.pl -in flst -out outfile -min_score 0.5
where flst is a file containing a list of the LMAT output files e.g. simple_list.1000.fna.kML+Human.v4-14.20.g10.db.lo.rl_output.[0-7].out
## 2A) Assign gene identifiers to each read a report sumary of gene contents
find ../tmp -name simple_list.1000.fna.kML+Human.v4-14.20.g10.db.lo.rl_output[0-9]\*.out > ../tmp/rl_output.flst
./run_gl.sh --db_file=<path-to-gene-database/gene.20mer.db --ilst=../tmp/rl_output.flst --odir=../tmp --filesum=../tmp/simple_list.1000.fna.kML+Human.v4-14.20.g10.db.lo.rl_output.0.30.fastsummary
# Output files:
simple_list.1000.fna.kML+Human.v4-14.20.g10.db.lo.rl_output.ras.flst.gene.20mer.db.rl_output.0.1.20.genesummary - gene content summary
simple_list.1000.fna.kML+Human.v4-14.20.g10.db.lo.rl_output.ras.flst.gene.20mer.db.rl_output[0-7].out - individually gene labeled reads
# modifies .genus, .species, etc.
#optionally run the following to generate KEGG and SEED profiles for input to MEGAN.
LMAT2multi-genesummaryTable.pl -in flst -out outfile -min_reads 10 -min_score 0.7
where flst is a pointer to the raw output files e.g. - simple_list.1000.fna.kML+Human.v4-14.20.g10.db.lo.rl_output.ras.flst.gene.20mer.db.rl_output[0-7].out
IMPORTANT - LMAT2multi-genesummaryTable.pl requires mapping files to retrieve KEGG and SEED IDs.
in the LMAT_DIR directory download and unzip the following mapping files:
wget ftp://gdo-bioinformatics.ucllnl.org/lmat/gi2kegg.map.gz
wget ftp://gdo-bioinformatics.ucllnl.org/lmat/gi2seed.map.gz
wget ftp://gdo-bioinformatics.ucllnl.org/pub/lmat/geneID2proteinGI.gz
## 3) Run content summarization
find ../tmp -name simple_list.1000.fna.kML+Human.v4-14.20.g10.db.lo.rl_output[0-9]\*.out > ../tmp/rl_output.flst
./run_cs.sh --db_file=<path-to-gene-database/gene.20mer.db --ilst=../tmp/rl_output.flst --odir=../tmp --filesum=../tmp/simple_list.1000.fna.kML+Human.v4-14.20.g10.db.lo.rl_output.0.30.fastsummary
# Output files with supplementary information
simple_list.1000.fna.kML.18mer.16mit.db.lo.rl_output.0.30.nomatchsum - Count reads not assigned a taxonomic ID
simple_list.1000.fna.kML+Human.v4-14.20.g10.db.lo.rl_output.0.30.fastsummary.summ.leftover - Count reads / taxnomic IDs exluded from the final *.summ file (due to not meeting minimum score/abundance thresholds)
# Description of tab delimited columns for each type of output file
*.fastsummary file - Weighted (by taxid assigned score) read count, non-weighted read count, NCBI taxonomy identifier, taxonomy rank and name
*.species - Average read score, Total weighted read count, Count of reads assigned to organism, NCBI taxnomic species ID, taxonomic name, Total strain read score, Count of reads assigned to top strain, NCBI taxonomy strain ID, taxonomy name
(When strain information is not reported it means no strain information could be recovered)
*.genus - Average read score, Total weighted read count, Count of reads assigned to organism, NCBI taxnomic genus ID, taxonomic name
*.plasmid - Average read score, Total weighted read count, Count of reads assigned to organism, taxnomic plasmid ID (frequently a custom LMAT ID), taxonomic name (frequently a fasta header from the original plasmid source sequence)
*.nomatchsum - gives basic information on all reads not included in the taxonomic profile.
ReadTooShort -- Gives number of reads with too few valid k-mers (minimum of 30 required)
NoDbHits -- Number of reads with no matching k-mers in the database.
LowScore -- Number of reads assigned a taxonomic label with a score below the minimum threshold (0)
*.genesummary - read count assigned to gene, LMAT NCBI taxonomic ID, NCBI taxonomic ID for source gene, NCBI Gene ID, NCBI Locus tag, description, gene type, protein accession, # of distinct k-mers, total number of k-mers, # of distinct gene k-mers, % of gene k-mers found (column 10 divided by column 12), total number of k-mers (column 11), divided by # of distinct k-mers (column 10) - reflects k-mer coverage.
*.lo.rl_output[0-9]*.out - query read identifier, query read sequence, scoring statistics (average,standard deviation, # of k-mers), list of taxid,score pairs, final LMAT taxid call with score and match type. Match types are MultiMatch (lowest common ancestor), DirectMatch (best match), NoMatch (no database match).
Note when run\_gl.sh is run - and the *fastsummary file is provided as input, the script will regenerate the *species file with additional gene information as follows:
Column 4) Pcnt . rRNA - Percentage of reads in bin that were assigned to a rRNA gene.
Column 5) No. Genes - Number of genes associated with the taxonomic bin. Note this number must be viewed with caution, since multiple closely related genes may be reported but may reflect a single gene.
Column 6) No. Gene Reads - Number of reads in bin that were assigned to a gene.
The original columns starting at column 4 are shifted over to the right.
simple_list.1000.fna.kML+Human.v4-14.20.g10.db.lo.rl_output.0.30.fastsummary.summ.RANK_kmer_cov
Reports k-mer abundance distribution profiles for each taxonomic bin for the selected rank - RANK (e.g. species, genus etc.).
The file reports for each taxononomic bin the number of distinct k-mers observed in the reads assigned to the bin and the total number of k-mers observed in the bin.
With enough reads in a taxonomic bin, a distribution is also reported with 4 columns, for example:
10995 14 1 50
Where the first column is the NCBI identifier of the bin, the second column is the k-mer value (here k=14) and the third column reports the number of times k-mers were observed and the fourth column is the number of k-mers that were observed. In the example above, there were 50 distinct 14-mers that were found once.
Collectively these k-mer statistics are meant to provide a measurement tool for assesing the coverage of taxonomic bin.
simple_list.1000.fna.kML+Human.v4-14.20.g10.db.lo.rl_output.0.30.fastsummary.ordered.RANK
provides a modified version of the *.fastsummary.RANK files (where RANK is species,genus, etc.)
Additional columns are included reporting k-mer coverage statistics. Each additional tab delimited column is a 4-tuple of the (a,b,c,d) where a is the value of k for the k-mer, b is the second peak in the k-mer abundance profile taken from the *.RANK_kmer_cov file described above. Value c is the distinct k-mer count and d is the total count of k-mers assigned to the taxonomic bin.
# Additional example run configurations
# run just the full database
./run_rl.sh --db_file=<path-to-database>/lmat-4-14.20mer.db --query_file=<your-fasta-file> --odir=<path-to-output-directory>
\end{verbatim}
Note that the marker libraries have a smaller range of values making 0 (or values close to 0) the best cutoff. For the full libraries the range of values is larger giving the user some flexibility to choose higher cutoff values of 1 or 2 to select only the highest confidence tax calls (although 0 is still a reasonable default cutoff). Choice of cutoff depends on the type of sample and analysis. The default (threshold of 0) is meant to maximize sensitivity to detect more distantly related organisms. It is recommended to run with 0 cutoff first, and sort the *.species file by the first column (average read score), which will give you an indication for the range of read scores and where to set the cutoff higher to reduce the potential for false positive calls.
Check https://sourceforge.net/p/lmat/wiki/ for additional example information.
If the pre-built executable does not load on your system, proceed with the \nameref{sec:compile} step below.
\section{Required Software Packages}
gcc 4.4.X\\
python 2.6\\
OpenMP\\
MPI (optional, for use in building a Reference Database)
\section{Compilation}
\label{sec:compile}
\begin{enumerate}
%\item
%
% Set up environment variables using your shell syntax; typically:
%
%\begin{verbatim}
%
% export WITH_PJMALLOC=1
% export HAVE_MPI=0
%
% or:
%
% setenv WITH_PJMALLOC 1
% setenv HAVE_MPI 0
%\end{verbatim}
%
% (note: MPI is optional, and only used in some drivers when building
% a DB from scratch; it is not used in the Scoring and Classification
% steps)
\item
If using MPI (optional; only used in some drivers when building
a DB from scratch) set up environment variable using your shell syntax; typically:
\begin{verbatim}
export HAVE_MPI=1
or:
setenv HAVE_MPI 1
\end{verbatim}
(note: MPI is not used in the Scoring and Classification
steps)
\item
%Compile by running (in the \texttt{LMAT-\lmatver} directory): \texttt{make}
% Note: the \texttt{LMAT-\lmatver/lib} includes pre-built libraries for \emph{perm-je},
% \emph{gzstream}, and \emph{gzlib}.
\texttt{LMAT-\lmatver/bin} contains prebuilt executables for all c++ drivers.
If these do not work on your system,
cd to \texttt{LMAT-\lmatver} and run {\em make}
%cd to \texttt{LMAT-\lmatver/third-party} and rebuild the three packages
%y running '\texttt{make}.'
\item
Disable address space randomization (if not already set).\\
As super-user: \texttt{echo 0 > /proc/sys/kernel/randomize\_va\_space}
\end{enumerate}
\section{Downloadable Databases}
\label{sec:downloads}
LMAT has several prebuilt databases available for download. Use the \texttt{bin/get\_db.sh} utility to automatically perform downloads from the ftp site, mentioned above, using any of the following databases.
Note that the database organization changed between releases. Current databases use the same file and key names.
Current databases:
\begin{tabular}{| c | c | c |}
\hline
Keyname/filename & Description & File size(GB) \\
\hline
lmat-4-14.20mer.db & Full-sized database for extensive read binning (LMAT-Grand) & 460 \\
\hline
kML.v4-14.20.g10.db & Microbial marker database (small database for fast microbial profiling) & 17 \\
\hline
kML+Human.v4-14.20.g10.db & Microbial marker database with explicit human read tagging (small database for fast microbial profiling) & 18 \\
\hline
lmat.genes.7-14.db & Gene database for gene name binning & 120 \\
\hline
lmat-world-region & Database for binning human reads by world region & 2 \\
\hline
\end{tabular}
There are additionally several ``compacted'' databases. For these databases, part of the index table has been reduced in size, so the index has a smaller footprint in your system's filesystem buffer cache. This is helpful for systems with 16 GB of main memory. and running from an SSD. get\_db.sh displays these database names. These databases are ``sparse'' files, so reported file size for each will be 32 GB, but the actual storage used in the file system will be less. Use du -h to report information on actual file storage.
Previous databases (using 9/4/13 sequence data):
\begin{tabular}{| c | c | c |}
\hline
Database key-name (for get\_db script) & Filename & File size (GB) \\
\hline
kML-18mer-large & kML.18mer.no\_prune.16.db & 80 \\
\hline
kML-18mer-medium & kML.18mer.16bit.db & 52 \\
\hline
kML-18mer-small & kML.18mer.16bit.reduced.db & 16 \\
\hline
kFull-20mer & m9.20mer.16bit.g1000.db & 373 \\
\hline
gene-20mer & gene.20mer.db & 52 \\
\hline
\end{tabular}
\section{Searchable K-mer/Taxonomy Database Preparation}
Note: some drivers and scripts described below contain hardcoded paths,
so execute from within the src directory.
\subsection{summary of steps required for building a database}
This section gives a step-by-step overview of the steps required to build a searchable reference database for use with LMAT.
This example assumes that you have already compiled the LMAT-\lmatver and that your working directory as specified below is also suitable for placing the database (eg. ramdisk or SSD). If not, is is fine to substitute "\texttt{WORK}" below with any suitable directory. The subsequent sections \ref{sec:upd} - \ref{sec:make-db} expand upon these steps in more detail.\\
\\
\texttt{cd <filepath>/LMAT-\lmatver/data}\\
\\
\begin{itemize}
\item
download \texttt{gi\_taxid\_nucl.dmp.gz} from ftp://ftp.ncbi.nih.gov/pub/taxonomy
\end{itemize}
\texttt{cd <filepath>/LMAT-\lmatver/src}\\
\begin{verbatim}
mkdir WORK
#assume your reference genome fasta file is: WORK/test.fa.
#Note to get started we provide a list of genome files from fall 2012, which we used for our initial database
#the identifiers can be obtained here: wget ftp://gdo-bioinformatics.ucllnl.org/pub/lmat/microbe2.gi.header_table.gz
#currently the actual genomes sequences must be retrieved from NCBI
Individual genome sequences must be mapped to a taxonomy identifier
The mapping is specified as a tab delimited file with the first column containing the tax id and the second
column should contain the header associated with sequence stored in the input fasta file (WORK/test.fa below)
For example:
418127 >ref|NC_009782.1|gnl|NCBI_GENOMES|21340|gi|156978331|Staphylococcus aureus subsp. aureus Mu3, complete genome
build_header_table.py WORK/test.fa GenomeToTaxID.txt WORK
## WORK/test.fa.int file will be generated as part of the output
The current version of LMAT requires 16-bit taxonomy identifers (to save space) so a 32-16 bit mapping must be created
If the mapping from sequence fasta headers to tax ids is store in a file tax_gen_map.txt
cut -f1 tax_gen_map.txt > tax_32bit_ids.txt
Tid16_getMapping.py tax_32bit_ids.txt ../runtime_inputs/ncbi_taxonomy.data tax_32to16.txt
The mapping will now be contained in the file tax_32to16.txt
kmerPrefixCounter -i WORK/test.fa.int -k 20 -o WORK/kmerDB -l 1 -f 0
kmerPrefixCounter -i WORK/test.fa.int -k 20 -o WORK/kmerDB -l 1 -f 1
kmerPrefixCounter -i WORK/test.fa.int -k 20 -o WORK/kmerDB -l 1 -f 2
kmerPrefixCounter -i WORK/test.fa.int -k 20 -o WORK/kmerDB -l 1 -f 3
tax_histo -o WORK/out.tax_hist.0 -d WORK/kmerDB.0 -g WORK/test.fa.gi_to_tid_map -t ../runtime_inputs/ncbi_taxonomy.dat -f 32
tax_histo -o WORK/out.tax_hist.1 -d WORK/kmerDB.1 -g WORK/test.fa.gi_to_tid_map -t ../runtime_inputs/ncbi_taxonomy.dat -f 32
tax_histo -o WORK/out.tax_hist.2 -d WORK/kmerDB.2 -g WORK/test.fa.gi_to_tid_map -t ../runtime_inputs/ncbi_taxonomy.dat -f 32
tax_histo -o WORK/out.tax_hist.3 -d WORK/kmerDB.3 -g WORK/test.fa.gi_to_tid_map -t ../runtime_inputs/ncbi_taxonomy.dat -f 32
# generate the file containing list of binary source data files
ls WORK/*.bin > WORK/table_input_list
make_db_table -i WORK/table_input_list -o WORK/my_table -k 20 -f tax_32to16.txt
\end{verbatim}
\subsection{How to update the taxonomy}
\label{sec:upd}
This release contains a prebuilt file, \texttt{LMAT-\lmatver/runtime\_inputs/ncbi\_taxonomy.dat},
so you only need to rebuild if you're using sequences with gi numbers
that are not in \texttt{ncbi\_taxonomy.dat}. (\texttt{LMAT-\lmatver/runtime\_inputs/ncbi\_taxonomy.dat}
was built from NCBI files downloaded on September 4, 2013)
to rebuild:
download \texttt{taxdmp.zip} from ftp://ftp.ncbi.nih.gov/pub/taxonomy and unzip;
you can download these to a separate directory if you wish.
\texttt{>parse\_ncbi\_taxonomy.py <dir>}
where: \texttt{dir>} is directory containing uncompressed files from \texttt{taxdmp.zip}
This will create three files:
\begin{verbatim}
../runtime_inputs/ncbi_taxonomy_rank.txt
../runtime_inputs/depth_for_ncbi_taxonomy.dat
../runtime_inputs/ncbi_taxonomy.dat
\end{verbatim}
\subsection{prepare fasta file, and generate gi to tax ID mapping}
\begin{enumerate}[a)]
\item
Download \texttt{gi\_taxid\_nucl.dmp.gz} from ftp://ftp.ncbi.nih.gov/pub/taxonomy
and place it in the \texttt{LMAT-\lmatver/data directory}. Note: this is a large file,
approximately 700M. Do not uncompress this file.
\item
run: \texttt{src/prepare\_fasta.py fasta.fn <output-directory>}\\
\\
where '\texttt{fasta.fn}' contains your fasta formatted reference genomes.
The sequence headers should contain NCBI gi numbers, else they
will be discarded.
Note users can retrieve bacterial, viral and plasmid data as follows:
\begin{verbatim}
wget ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.fna.tar.gz
wget ftp://ftp.ncbi.nih.gov/genomes/Viruses/all.fna.tar.gz
wget ftp://ftp.ncbi.nih.gov/genomes/Plasmids/all.fna.tar.gz
\end{verbatim}
A list of files used in our original database, which includes eukaryote genomes is provided here:
\begin{verbatim}
wget ftp://gdo-bioinformatics.ucllnl.org/pub/lmat/microbe2.gi.header_table.gz
\end{verbatim}
The \texttt{<output-directory>} will contain several files. \texttt{fasta.fn.int}
is the input to the next step: \texttt{kmerPrefixCounter}.
\texttt{all\_virus.gi\_to\_tid\_map} should replace \texttt{gid\_to\_tid\_kpath.txt}
for the classification phase.
\end{enumerate}
\begin{itemize}
\item
\textbf{Nice to know:} \texttt{prepare\_fasta.py} executes the following three codes:
\begin{verbatim}
get_gi_numbers.py
function: parses fasta header files to extract NCBI gi numbers;
input: fasta file
input: data/gi_taxid_nucl.dmp.gz
output: file with one gi number per line
build_gi_to_tid_map.cpp
function: maps NCBI gi numbers to NCBI tax_id numbers
input: output file from get_gi_numbers.py
input: gi_taxid_nucl.dmp.gz
output: file with 'gi tid' per line
build_header_table.py
\end{verbatim}
transforms fasta file containing reference genome into format with a
single id for header; the header is the NCBI taxonomy node ID associated
with the sequences gi number. If the input file is '\texttt{fasta.fa},'
the output files will be:
\begin{verbatim}
fasta.fa.int - same as fasta.fa, but with transformed headers
fasta.fa.gi.table - each two-line entry contains
(1) gi number
(2) header from the input file
fasta.fa.gi.table - each two-line entry contains
(1) tax node ID;
(2) header from the input file
\end{verbatim}
\end{itemize}
\subsection{run 'kmerPrefixCounter' to build reference genome database files}
Usage for \texttt{kmerPrefixCounter}:
\begin{verbatim}
-i <string> - input fasta_fn
-k <int> - k-mer length
-o <string> - output filename
-l <int> - prefix length (as in: the string representation of the prefix)
-f <int> [optional] first prefix
\end{verbatim}
This driver can build DB files for a large set of reference genomes by
incrementally building files that only contain k-mers that begin with
a given prefix. This driver can be run either sequentially or in
parallel mode.
If you invoke with '\texttt{-l 1}' you will have four output files; the first will
contain (binary encoded) k-mers that begin with '\texttt{a}', the second k-mers that
begin with '\texttt{c},' and so on. For '\texttt{-l 2}' there will be 16 output files.
In general, there will be $4^l$ output files.
The output files contain k-mers, and each k-mer has a list of genome IDs
(gi numbers) in which it appears.
\begin{description}
\item[Sequential Example:]
\begin{verbatim}
kmerPrefixCounter my_fasta_file.int -k 18 -o kmer_and_genomes -l 1 -f 0
kmerPrefixCounter my_fasta_file.int -k 18 -o kmer_and_genomes -l 1 -f 1
kmerPrefixCounter my_fasta_file.int -k 18 -o kmer_and_genomes -l 1 -f 2
kmerPrefixCounter my_fasta_file.int -k 18 -o kmer_and_genomes -l 1 -f 3
\end{verbatim}
this will produce four output files: \\
\\
\texttt{kmer\_and\_genomes.0, kmer\_and\_genomes.1}, etc.
\item[Parallel Example 1:]
Use your favorite utility to run kmerPrefixCounter jobs in parallel, eg. gnu parallel (in your path).
\begin{verbatim}
seq 0 3 | parallel kmerPrefixCounter my_fasta_file.int -k 18 -o kmer_and_genomes -l 1 -f {.}
\end{verbatim}
\item[Parallel Example 2:]
assuming you have a cluster with eight nodes:\\
\begin{verbatim}
mpirun -np 8 kmerPrefixCounter my_fasta_file.int -k 18 -o kmer_and_genomes -l 2 -f 0
mpirun -np 8 kmerPrefixCounter my_fasta_file.int -k 18 -o kmer_and_genomes -l 2 -f 8
\end{verbatim}
the first run output files \texttt{kmer\_and\_genomes.0} through \texttt{kmer\_and\_genomes.7},
and the second outputs files \texttt{kmer\_and\_genomes.0} through \texttt{kmer\_and\_genomes.15}
\end{description}
\subsection{Tax Histo}
\label{tax-histo}
Run \texttt{tax\_histo} to construct the k-mer/taxonomy database
(discussed in Section 2.1 of our paper); run with no options for usage
information. Output files -o used for running \texttt{make\_db\_table}.
\subsection{Make DB Table}
\label{sec:make-db}
Run '\texttt{make\_db\_table};' run with no options for usage information.
For the required argument to specify input '\texttt{-i <list-file>}', \texttt{<list-file>} contains the list of binary data (.bin) files generated in step \ref{tax-histo-fast-limited} above, eg. \texttt{my\_tax\_histo\_output.bin}
Once the execution has completed, the output filename specified by -o must remain with read-write ("rw") file access permissions set or the subsequent applications cannot open.
When running \nameref{sec:scoring} you would specify "\texttt{-d my\_ref\_db}"
To use make\_db\_table with 18-mers, you must use the ``-e 1827'' command line argument. This properly configures the indexing for $k=18$.
The -e option also enables "compacted" databases which can be helpful for running databases on limited main memory systems, eg. 16 gb. For these databases use ``-e 2025''. For 24gb using the default value has shown the best measured runtime performance.
For values of $k$ other than 18 or 20, set the environment variable
USE_SORTED_DB=0 prior to compilation. All LMAT binaries must be
built with this variable set to use values of $k$ other than 18 or
20. This enables LMAT indexing to use a hash table rather than two-level indexing. Future releases may support different $k$ databases without need for
these particular compile-time changes.
\section{Custom Null Model Generation}
\label{sec:nullmodel}
\begin{enumerate}
\item
Files needed for this step are placed in the rmodel subdirectory. Two
scripts must be run to generate the model files. First, need to
prepare a count of k-mers per every taxonomy identifier.
\item
\texttt{frequency\_counter}: this executable takes as input the ascii output file from step \ref{tax-histo-fast-limited}.
\item
If run for multiple input files, use \texttt{combine\_counts.py} to aggregate to a single file., eg:
\begin{verbatim}
ls *.kcnt | python combine_counts.py > my_counts.kcnt
\end{verbatim}
\item
Sort the output by count (column 2) in descending order.
\item
Run:
\begin{verbatim}
gen_rand_mod.sh --db_file=<reference-database> --read_len=<integer> --tax_histo_count=<count-kmers.py-output> --filter=<cutoff-param>
\end{verbatim}
Generated files will have the prefix "null" and will be in the form:
\texttt{null.bin.10.<database-name>.<read-length>.<number-of-reads.rl\_output.rand\_lst.gz}
Lines of the file appear as:\\
\\
\texttt{<kmers-per-line> <corresponding file>}\\
\\
and ranked in ascending order\\
\\
Eg. for read length of 80 and using k=20, the line would appear as:\\
\\
\texttt{61 null.bin.10.20merDB.80.480000.rl\_output.rand\_lst.gz}\\
\\
\end{enumerate}
\section{Scoring and Classification}
\label{sec:scoring}
\subsection{Query input files}
Input query (read) files should be in fasta format, and sequences should
be contained on a single line. If your sequences span multiple lines,
they should be preprocessed with the \texttt{reformatQueryFile.py} script.
The PhymmBL dataset reported on in our paper can be downloaded from:\\
http://www.cbcb.umd.edu/software/phymm/exp\_datasets/01\_\_10\_X\_100\_bp\_test\_sets/10\_X\_100\_bp.tgz
\subsection{Scoring (Labeling)}
The \texttt{read\_label} driver searches the Reference Database for all k-mers
in the queries, and outputs taxonomic lineages, as discussed in Section 2.2 and
Figure 2 in our paper. Output files have the format: \texttt{<prefix>N.out} - where N is 0 to
$(numthreads-1)$ - even in single threaded execution.
Input to read\_label can come from stdin (supporing piped input) instead of an input file by using: \texttt{-i -}
\texttt{run\_rl.sh} is the wrapper script for running the LMAT read labeling pipeline; it takes several required
and optional paramaters; run with no arguments for usage and example invocation.
\subsection{Classification}
LMAT bins the reads that were assigned a taxonomic label with a score greater than or equal to the value specified by -x in the read_label (min_score variable in run\_rl.sh) binary, the value defaults to 0.
The output of the initial binning is in the file \texttt{<prefix>.fastsummary}, the file \texttt{<prefix>.lineage}, gives the output as human readable lineages that Krona {http://sourceforge.net/p/krona/home/krona/}
can process. \texttt{<prefix>.lineage.html} gives the Krona output (when Krona is installed on the user system).
The user has the option re-bin the reads applying different thresholds, without re-running read\_label binary.
\begin{enumerate}
\item
\texttt{losummary\_fast\_mc.sh} is script to re-bin reads using the different thresholds
type losummary\_fast\_mc.sh for usage
\end{enumerate}
example usage:
\begin{verbatim}
losummary_fast_mc.sh --file_lst=<list_of_output_files_from_read_label.txt> --min_score=<new minimum score>
<min_score> if user wants to pick up more distantly related hits, try a value of -0.5 or -1.0,
if the user wants to focus on high confidence hits try a value of 1.0
\end{verbatim}
A similar script is available for re-binning genes with a different threshold.
type gsummary_mc.sh for usage.
\subsection{Read Analysis}
A common post processing step is to retrieve reads for a specific organism or set of taxonomic IDs.
pull\_reads\_mc.sh (type without command line arguments for usage) retrieves reads from LMAT output.
Output can come either from the initial read\_label output (the *.out output files) or the reads that
were re-assigned to potentially more specific organisms during the content summary stage (these are the *.ras.1
output files)
example usage:
\begin{verbatim}
pull_reads_mc.sh --file_lst=<list of files containing read label output> --idfile=<file listing the taxids specify the reads to store in fasta files>
\end{verbatim}
You can specify multiple taxids to be placed in one fasta file, and/or place multiple taxids in distinct fasta files.
\begin{verbatim}
Using the example included with the distribution try the following in your test working directory:
echo 136845 > tid.txt
../bin/pull_reads_mc.sh --idfile=tid.txt --file_lst=simple_list.1000.fna.kML+Human.v4-14.20.g10.db.lo.rl_output.flst
\end{verbatim}
The script will create a fasta file with the selected reads, creating a filename from the input information. Output should be a fasta file with the name (names differ with the different choice of database)
simple_list.1000.fna.kML+Human.v4-14.20.g10.db.tid.txt.136845.fna
A second utility script is included called {build_taxid_lst.sh}, which allows the user to build a taxid list using a NCBI taxonomy string key word of the form rank,rank_value.
For example kingdom,Virus will retrieve all reads explicitly tagged as viral reads in the user specified *.fastsummary files (output generated by LMAT's {read_label} and {losummary_fast_mc.sh}
example usage:
\begin{verbatim}
build_taxid_lst.sh --file_lst=fastsummary.file_lst --sstr=superkingdom,Viruses --idfile=viruses
\end{verbatim}
Using the example included with the distribution try the following in your test working directory:
\begin{verbatim}
ls -1 *.fastsummary > fastsummary.file_lst
../bin/build_taxid_lst.sh --file_lst=fastsummary.file_lst --sstr=superkingdom,Viruses --idfile=viruses
\end{verbatim}
The script will create a text file with the taxonomic IDs, creating a filename from the input information. For the example above the output file should be:
simple_list.1000.fna.kML+Human.v4-14.20.g10.db.lo.rl_output.0.30.fastsummary.viruses.idtxt
Most of the utility scripts that operate on sequencer reads require
GNU parallel (http://www.gnu.org/software/parallel/), which is included with the LMAT distribution for conveniance.
LMAT2multi-fastsummaryTable.pl : To convert .fastsummary, .species,
.genus, or .plasmid files to tabular files for analysis with tools
such as MEGAN or custom R scripts, use this script. It takes a list
of LMAT summary files to a table with organisms as rows and
metagenome samples as columns, with entries as either the fraction of
reads or read counts. Run it without arguments for further
instructions.
\section{Content Summarization}
There are major changes to the content summary module (called {content_summ} in
{run\_cs.sh}). {read\_label} generates the rank-flexible list of
taxonomic IDs (TIDs), reporting the number of reads assigned to each
TID and the weighted read count (the sum of scores from each read
assignment). Reads inititially assigned to higher ranks, are no longer
aggresively re-assigned to lower order ranks, based on the first pass
assement of sample contents. This so far has proven to be an overly aggresive
analysis - over reaching the amount of taxonomic information that could be extracted from
individual reads. Instead, a more emperical reporting is attempted by reporting the
k-mer coverage distributions for various values of k for each species call. The working
hypothesis is that this information can be used to infer the coverage of individual genomes,
obviating the need to map the reads to individual genomes and compute coverage statistics.
This option is made available for further consideration.