Reflexiv

Dependencies

Reflexiv is written in Java and built on top of the Apache Spark platform. To use Reflexiv, make sure you have a Java-1.7 or higher version installed in your operating system. The Spark framework is also required to be pre-installed (Reflexiv can all install a pre-built Spark for you). A Spark cluster should be properly configured before running Reflexiv in the cluster mode.

Spark installation

There are different ways to install Spark on your computer. Please visit the Spark Download web page and choose a download method of your preference. If you would like to build Spark from source, please visit the Building Spark web page.

You can directly jump to Reflexiv installation, where Reflexiv will download a pre-built Spark package (version 2.0.0) for you.

Here is a simple way to download the Spark pre-built package (version 2.0.0) by using command:
$ wget http://archive.apache.org/dist/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.6.tgz

After downloading, untar the package:
$ tar zxvf spark-2.0.0-bin-hadoop2.6.tgz

Go to the unpacked Spark directory. Under the ./bin folder, you should see an executable file spark-submit:
$ cd ./spark-2.0.0-bin-hadoop2.6
$ ./bin/spark-submit

Options:
--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.
--deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or
on one of the worker machines inside the cluster ("cluster")
(Default: client).
--class CLASS_NAME Your application's main class (for Java / Scala apps).
--name NAME A name of your application.
--jars JARS Comma-separated list of local jars to include on the driver
and executor classpaths.
--packages Comma-separated list of maven coordinates of jars to include
on the driver and executor classpaths. Will search the local
maven repo, then maven central and any additional remote
repositories given by --repositories. The format for the
coordinates should be groupId:artifactId:version.
--exclude-packages Comma-separated list of groupId:artifactId, to exclude while
resolving the dependencies provided in --packages to avoid
dependency conflicts.
--repositories Comma-separated list of additional remote repositories to
search for the maven coordinates given with --packages.
--py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to place
on the PYTHONPATH for Python apps.
--files FILES Comma-separated list of files to be placed in the working
directory of each executor.

--conf PROP=VALUE Arbitrary Spark configuration property.
--properties-file FILE Path to a file from which to load extra properties. If not
specified, this will look for conf/spark-defaults.conf.

--driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
--driver-java-options Extra Java options to pass to the driver.
--driver-library-path Extra library path entries to pass to the driver.
--driver-class-path Extra class path entries to pass to the driver. Note that
jars added with --jars are automatically included in the
classpath.

--executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G).

--proxy-user NAME User to impersonate when submitting the application.
This argument does not work with --principal / --keytab.

--help, -h Show this help message and exit.
--verbose, -v Print additional debug output.
--version, Print the version of current Spark.

Spark standalone with cluster deploy mode only:
--driver-cores NUM Cores for driver (Default: 1).

Spark standalone or Mesos with cluster deploy mode only:
--supervise If given, restarts the driver on failure.
--kill SUBMISSION_ID If given, kills the driver specified.
--status SUBMISSION_ID If given, requests the status of the driver specified.

Spark standalone and Mesos only:
--total-executor-cores NUM Total cores for all executors.

Spark standalone and YARN only:
--executor-cores NUM Number of cores per executor. (Default: 1 in YARN mode,
or all available cores on the worker in standalone mode)

YARN-only:
--driver-cores NUM Number of cores used by the driver, only in cluster mode
(Default: 1).
--queue QUEUE_NAME The YARN queue to submit to (Default: "default").
--num-executors NUM Number of executors to launch (Default: 2).
If dynamic allocation is enabled, the initial number of
executors will be at least NUM.
--archives ARCHIVES Comma separated list of archives to be extracted into the
working directory of each executor.
--principal PRINCIPAL Principal to be used to login to KDC, while running on
secure HDFS.
--keytab KEYTAB The full path to the file that contains the keytab for the
principal specified above. This keytab will be copied to
the node running the Application Master via the Secure
Distributed Cache, for renewing the login tickets and the
delegation tokens periodically.

Set the environment variable (ENV) SPARK_HOME to your unpacked Spark directory:
$ pwd Then, copy the listed full path.
Set the environment variable in your bash profile by:
$ vi /home/ec2-user/.bash_profile or $ vi /home/ec2-user/.profile
Where ec2-user refers to your own user name. Paste the full path in the file and set as ENV SPARK_HOME.

# bash_profile
export SPARK_HOME="/vol/ec2-user/spark-2.0.0-bin-hadoop2.6"

Reload changes in the file:
$ source /home/ec2-user/.bash_profile

Now Reflexiv should be able to use the Spark framework.

☕ Notes

If you do not have wget installed, use the curl command instead: curl http://archive.apache.org/dist/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.6.tgz -o spark-2.0.0-bin-hadoop2.6.tgz
Reflexiv uses the spark-submit executable to submit Reflexiv applications to the Spark cluster. As long as Reflexiv finds the full path of spark-submit, Reflexiv will work properly.

Reflexiv installation

Download the Reflexiv package from github in zipball:
$ wget https://github.com/rhinempi/Reflexiv/archive/latest.zip -O ./Reflexiv.zip
If you do not have wget installed:
$ curl -fsSL https://github.com/rhinempi/Reflexiv/archive/latest.zip -o ./Reflexiv.zip

Or in tarball:
$ wget https://github.com/rhinempi/Reflexiv/archive/latest.tar.gz -O ./Reflexiv.tar.gz
Alternatively:
$ curl -fsSL https://github.com/rhinempi/Reflexiv/archive/latest.tar.gz -o ./Reflexiv.tar.gz
Or press the download buttom on the top or click Download.

Once downloaded, unzip or untar the package:
$ unzip ./Reflexiv.zip
$ tar zxvf ./Reflexiv.tar.gz

Go to the Reflexiv directory, you should see an executable file reflexiv in the ./bin folder. Set its permission to:
$ cd ./Reflexiv-latest
$ chmod 755 ./bin/reflexiv

Now, you should be able to run Reflexiv:
$ ./bin/reflexiv

Reflexiv - on the cloud.
Version: 0.3

Commands:
  run             Run the entire assembly pipeline
  counter         counting Kmer frequency
  reassembler     re-assemble and extend genome fragments

Type each command to view its options, eg. Usage: ./reflexiv run

Spark cluster configuration:
  --spark-conf       Spark cluster configuration file or spark input parameters
  --spark-param      Spark cluster parameters in quotation marks "--driver-memory 4G --executor-memory 16G"
  --spark-help       View spark-submit options. You can include spark`s options directly.

Usage: reflexiv [commands] --spark-conf spark_cluster_default.conf [option...]
       reflexiv [commands] --spark-param "--driver-memory 4G --executor-memory 16G" [option...]
       reflexiv [commands] --driver-memory 4G --executor-memory 16G --executor-cores 2 [option...]

For detailed cluster submission, please refer to scripts located in:
./sbin

☕ Notes

The executable file Reflexiv is a shell script that wraps the spark-sumbit executable with the Reflexiv jar file. Examples of full commands to submit Reflexiv applications can be found in the ./sbin folder.
The input parameters for Reflexiv consist of options for both the Spark framework and the correspond Reflexiv applications. The Spark options start with two dashes -- ---> to configure the Spark cluster, whereas the Reflexiv options start with one dash- ---> to set the correspond parameters for each Reflexiv application.

Test run

To test run Reflexiv, we prepared a small sequencing data from the Human Microbiome Project (HMP). We will try to assemble a part of the E. coli. genome:

Run Reflexiv with a specific command to print out its help information:
$ ./bin/reflexiv run

Reflexiv 15:42:12 Reflexiv main initiating ... 
Reflexiv 15:42:12 interpreting parameters.
Name:
	Reflexiv Main

Options:
  -fastq <input fastq file>            Input NGS data, fastq file format, four line
                                       per unit
  -fasta <input fasta file>            Also input NGS data, but in fasta file format,
                                       two line per unit
  -outfile <output file>               Output assembly result
  -kmer <kmer size>                    Kmer length for reads mapping
  -overlap <kmer overlap>              Overlap size between two adjacent kmers
  -miniter <minimum iterations>        Minimum iterations for contig construction
  -maxiter <maximum iterations>        Maximum iterations for contig construction
  -clipf <clip front nt>               Clip N number of nucleotides from the
                                       beginning of the reads
  -clipe <clip end nt>                 Clip N number of nucleotides from the end of
                                       the reads
  -cover <minimal kmer coverage>       Minimal coverage to filter low freq kmers
  -maxcov <maximal kmer coverage>      Maximal coverage to filter high freq kmers
  -minlength <minimal read length>     Minimal read length required for assembly
  -mincontig <minimal contig length>   Minimal contig length to be reported
  -partition <re-partition number>     re generate N number of partitions
  -bubble                              Set to NOT remove bubbles.
  -cache                               weather to store data in memory or not
  -version                             show version information
  -h
  -help                                print and show this information

Usage:
	run de novo genome assembly : 
spark-submit [spark parameter] --class uni.bielefeld.cmg.reflexiv.main.Main reflexiv.jar [parameters] -fastq input.fq -kmer 63 -outfile output_file
spark-submit [spark parameter] --class uni.bielefeld.cmg.reflexiv.main.Main reflexiv.jar [parameters] -fasta input.txt -kmer 63 -outfile output_file
reflexiv run [spark parameter] [parameters] -fastq input.fq -kmer 63 -outfile output_file"

Follow the instruction, set the input sequencing data and the K-mer length accordingly:
$ ./bin/reflexiv run --driver-memory 3G --executor-memory 3G -fastq './example/paired_dat*.fq.gz' -outfile ./example/result -kmer 31 -cover 3

Reflexiv 13:58:38 Reflexiv main initiating ... 
Reflexiv 13:58:38 interpreting parameters.
Reflexiv 13:58:38 Initiating Spark context ...
Reflexiv 13:58:38 Start Spark framework
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
18/01/09 13:58:39 INFO SparkContext: Running Spark version 2.0.0
18/01/09 13:58:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

After it finishes, you can check the result in the folder ./example/result:
$ ls -al ./example/result

drwxr-xr-x  2 ec2-user users  179 Jan  9 14:00 .
drwxr-xr-x 25 ec2-user users 2146 Jan  9 14:00 ..
-rw-r--r--  1 ec2-user users 4619 Jan  9 14:00 part-00000
-rw-r--r--  1 ec2-user users   48 Jan  9 14:00 .part-00000.crc
-rw-r--r--  1 ec2-user users 4619 Jan  9 14:00 part-00001
-rw-r--r--  1 ec2-user users   48 Jan  9 14:00 .part-00001.crc
-rw-r--r--  1 ec2-user users    0 Jan  9 14:00 _SUCCESS
-rw-r--r--  1 ec2-user users    8 Jan  9 14:00 ._SUCCESS.crc

To view the result:
$ less -S ./example/result/part-00000

>Contig-4558-0
CTGAAAGGGGCGAAAGCCCCTCTGATTATCGGGTTTAGCGCGCTATTGCCTGGCTACCGCTGAGCTCCAGATTTTGAGGTGAAAACAATGAAAATGAATA
AAAGTCTCATCGTCCTCTGTTTATCAGCAGGGTTACTGGCAAGCGCGCCTGGAATTAGCCTTGCCGATGTTAACTACGTACCGCAAAACACCAGCGACGC
GCCAGCCATTCCATCTGCTGCGCTGCAACAACTCACCTGGACACCGGTCGATCAATCTAAAACCCAGACCACCCAACTGGCGACCGGCGGCCAACAACTG
AACGTTCCCGGCATCAGTGGTCCGGTTGCTGCGTACAGCGTCCCGGCAAACATTGGCGAACTGACCCTGACGCTGACCAGCGAAGTGAACAAACAAACCA
GCGTTTTTGCGCCGAACGTGCTGATTCTTGATCAGAACATGACCCCATCAGCCTTCTTCCCCAGCAGTTATTTCACCTACCAGGAACCAGGCGTGATGAG
TGCAGATCGGCTGGAAGGCGTTATGCGCCTGACACCGGCGTTGGGGCAGCAAAAACTTTATGTTCTGGTCTTTACCACGGAAAAAGATCTCCAGCAGACG
ACCCAACTGCTCGACCCGGCTAAAGCCTATGCCAAGGGCGTCGGTAACTCGATCCCGGATATCCCCGATCCGGTTGCTCGTCATACCACCGATGGCTTAC
TGAAACTGAAAGTGAAAACGAACTCCAGCTCCAGCGTGTTGGTAGGACCTTTATTTGGTTCTTCCGCTCCAGCTCCGGTTACGGTAGGTAACACGGCGGC
ACCAGCTGTGGCTGCACCCGCTCCGGCACCGGTGAAGAAAAGCGAGCCGATGCTCAACGACACGGAAAGTTATTTTAATACCGCGATCAAAAACGCTGTC
GCGAAAGGTGATGTTGATAAGGCGTTAAAACTGCTTGATGAAGCTGAACGCCTGGGATCGACATCTGCCCGTTCCACCTTTATCAGCAGTGTAAAAGGCA
AGGGGTAATTACGCCCCACAGTGCTGATTTTGCAACAACTGGTGCGTCTCCTGGCGCACCTTTTTTTATGCTTCCTTCCTGGGATATGAGCGATTTTTTA
TAGTAACTCACTTCTTCTTCACTAAGAATATCCATTATCTCAATGCCTTATCAGAGATTCTTTTCCTTTCGCCGGTAGTGTCTGGACATTCAGGCTACTT

☕ Notes

In case you want to do a test run with a larger dataset, increase the memory configuration by setting --driver-memory and--executor-memory to higher amounts.

Where to go from here

Want to know how to use each module of Reflexiv?
Read the user manual to see the specific options of each function.

Want to use Reflexiv on a local cluster?
Try setting up a Spark cluster on the Sun Grid Engine (SGE). >>>>>

Want to use Reflexiv on the Amazon Elastic Computer Cloud (EC2)?
Try setting up a Spark cluster on the Amazon AWS cloud. >>>>>

Or check out some examples on how to use different modules for varies analyses.

Reflexiv examples

Reflexiv - a fast, scalable, distributed De novo genome assembler

Dependencies

Spark installation

Reflexiv installation

Test run

Where to go from here