Dependencies

Sparkhit is written in Java and built on top of the Apache Spark platform. To use Sparkhit, make sure you have a Java-1.7 or higher version installed in your operating system. The Spark framework is also required to be pre-installed. A Spark cluster should be properly configured before running Sparkhit on the cluster mode. If you want to parallelize your own tools on Sparkhit, make sure they are available on all compute nodes (eg. copy your scripts to each node or place them in a shared file system). See use your own tool

Spark installation

There are different ways to install Spark on your computer. Please visit the Spark Download web page and choose a download method of your preference. If you would like to build Spark from source, please visit the Building Spark web page.

You can directly jump to Sparkhit installation, where Sparkhit will download a pre-built Spark package (version 2.0.0) for you.

Here is a simple way to download the Spark pre-built package (version 2.0.0) by using command:
$ wget http://archive.apache.org/dist/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.6.tgz

After downloading, untar the package:
$ tar zxvf spark-2.0.0-bin-hadoop2.6.tgz

Go to the unpacked Spark directory. Under the ./bin folder, you should see an executable file spark-submit:
$ cd ./spark-2.0.0-bin-hadoop2.6
$ ./bin/spark-submit

Options:
  --master MASTER_URL         spark://host:port, mesos://host:port, yarn, or local.
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally ("client") or
                              on one of the worker machines inside the cluster ("cluster")
                              (Default: client).
  --class CLASS_NAME          Your application's main class (for Java / Scala apps).
  --name NAME                 A name of your application.
  --jars JARS                 Comma-separated list of local jars to include on the driver
                              and executor classpaths.
  --packages                  Comma-separated list of maven coordinates of jars to include
                              on the driver and executor classpaths. Will search the local
                              maven repo, then maven central and any additional remote
                              repositories given by --repositories. The format for the
                              coordinates should be groupId:artifactId:version.
  --exclude-packages          Comma-separated list of groupId:artifactId, to exclude while
                              resolving the dependencies provided in --packages to avoid
                              dependency conflicts.
  --repositories              Comma-separated list of additional remote repositories to
                              search for the maven coordinates given with --packages.
  --py-files PY_FILES         Comma-separated list of .zip, .egg, or .py files to place
                              on the PYTHONPATH for Python apps.
  --files FILES               Comma-separated list of files to be placed in the working
                              directory of each executor.

  --conf PROP=VALUE           Arbitrary Spark configuration property.
  --properties-file FILE      Path to a file from which to load extra properties. If not
                              specified, this will look for conf/spark-defaults.conf.

  --driver-memory MEM         Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
  --driver-java-options       Extra Java options to pass to the driver.
  --driver-library-path       Extra library path entries to pass to the driver.
  --driver-class-path         Extra class path entries to pass to the driver. Note that
                              jars added with --jars are automatically included in the
                              classpath.

  --executor-memory MEM       Memory per executor (e.g. 1000M, 2G) (Default: 1G).

  --proxy-user NAME           User to impersonate when submitting the application.
                              This argument does not work with --principal / --keytab.

  --help, -h                  Show this help message and exit.
  --verbose, -v               Print additional debug output.
  --version,                  Print the version of current Spark.

 Spark standalone with cluster deploy mode only:
  --driver-cores NUM          Cores for driver (Default: 1).

 Spark standalone or Mesos with cluster deploy mode only:
  --supervise                 If given, restarts the driver on failure.
  --kill SUBMISSION_ID        If given, kills the driver specified.
  --status SUBMISSION_ID      If given, requests the status of the driver specified.

 Spark standalone and Mesos only:
  --total-executor-cores NUM  Total cores for all executors.

 Spark standalone and YARN only:
  --executor-cores NUM        Number of cores per executor. (Default: 1 in YARN mode,
                              or all available cores on the worker in standalone mode)

 YARN-only:
  --driver-cores NUM          Number of cores used by the driver, only in cluster mode
                              (Default: 1).
  --queue QUEUE_NAME          The YARN queue to submit to (Default: "default").
  --num-executors NUM         Number of executors to launch (Default: 2).
                              If dynamic allocation is enabled, the initial number of
                              executors will be at least NUM.
  --archives ARCHIVES         Comma separated list of archives to be extracted into the
                              working directory of each executor.
  --principal PRINCIPAL       Principal to be used to login to KDC, while running on
                              secure HDFS.
  --keytab KEYTAB             The full path to the file that contains the keytab for the
                              principal specified above. This keytab will be copied to
                              the node running the Application Master via the Secure
                              Distributed Cache, for renewing the login tickets and the
                              delegation tokens periodically.

Set the environment variable (ENV) SPARK_HOME to your unpacked Spark directory:
$ pwd  Then, copy the listed full path.
Set the environment variable in your bash profile by:
$ vi /home/ec2-user/.bash_profile or $ vi /home/ec2-user/.profile
Where ec2-user refers to your own user name. Paste the full path in the file and set as ENV SPARK_HOME.

# bash_profile
export SPARK_HOME="/vol/ec2-user/spark-2.0.0-bin-hadoop2.6"

Reload changes in the file:
$ source /home/ec2-user/.bash_profile

Now Sparkhit should be able to use the Spark framework.

☕  Notes
  1. If you do not have wget installed, use the curl command instead: curl http://archive.apache.org/dist/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.6.tgz -o spark-2.0.0-bin-hadoop2.6.tgz
  2. Sparkhit uses the spark-submit executable to submit Sparkhit applications to the Spark cluster. As long as Sparkhit finds the full path of spark-submit, Sparkhit will work properly.

Sparkhit installation

Download the Sparkhit package from github in zipball:
$ wget https://github.com/rhinempi/sparkhit/archive/latest.zip -O ./sparkhit.zip
If you do not have wget installed:
$ curl -fsSL https://github.com/rhinempi/sparkhit/archive/latest.zip -o ./sparkhit.zip

Or in tarball:
$ wget https://github.com/rhinempi/sparkhit/archive/latest.tar.gz -O ./sparkhit.tar.gz
Alternatively:
$ curl -fsSL https://github.com/rhinempi/sparkhit/archive/latest.tar.gz -o ./sparkhit.tar.gz
Or press the download buttom on the top or click Download.

Once downloaded, unzip or untar the package:
$ unzip ./sparkhit.zip
$ tar zxvf ./sparkhit.tar.gz

Go to the Sparkhit directory, you should see an executable file sparkhit in the ./bin folder. Set its permission to:
$ cd ./sparkhit-latest
$ chmod 755 ./bin/sparkhit

Now, you should be able to run Sparkhit:
$ ./bin/sparkhit

sparkhit - on the cloud.
Version: 1.0

Commands:
  recruiter      Fragment recruitment
  mapper         NGS short read mapping
  reporter       Summarize recruitment result
  piper          Send data to external tools, eg. bwa, bowtie2 and fr-hit
  parallelizer   Parallel a task to each worker node
  cluster        Run cluster to a table
  tester         Run Chi-square test
  converter      Convert different file format: fastq, fasta or line based
  correlationer  Run Correlation test
  decompresser   Parallel decompression to splitable compressed files, eg. bzip2
  reductioner    Run Principle component analysis
  regressioner   Run logistic regression
  statisticer    Run Hardy–Weinberg Equilibrium
  variationer    Genotype with samtools mpileup
Type each command to view its options, eg. Usage: ./sparkhit mapper

Spark cluster configuration:
  --spark-conf       Spark cluster configuration file or spark input parameters
  --spark-param      Spark cluster parameters in quotation marks "--driver-memory 4G --executor-memory 16G"
  --spark-help       View spark-submit options. You can include spark`s options directly.

Usage: sparkhit [commands] --spark-conf spark_cluster_default.conf [option...]
       sparkhit [commands] --spark-param "--driver-memory 4G --executor-memory 16G" [option...]
       sparkhit [commands] --driver-memory 4G --executor-memory 16G --executor-cores 2 [option...]

For detailed cluster submission, please refer to scripts located in:
./sbin
☕  Notes
  1. The executable file sparkhit is a shell script that wraps the spark-sumbit executable with the Sparkhit jar file. Examples of full commands to submit Sparkhit applications can be found in the ./sbin folder.
  2. The input parameters for Sparkhit consist of options for both the Spark framework and the correspond Sparkhit applications. The Spark options start with two dashes -- ---> to configure the Spark cluster, whereas the Sparkhit options start with one dash- ---> to set the correspond parameters for each Sparkhit application.

Test run

To test run Sparkhit, we prepared a small sequencing data from the Human Microbiome Project (HMP). We will try to map these sequencing data to an E. coli. genome:

Run sparkhit with a specific command to print out its help information:
$ ./bin/sparkhit recruiter

Name:
	SparkHit recruiter

Options:                             
  -fastq <input fastq file>          Input Next Generation Sequencing (NGS) data,
                                     fastq file format, four line per unit
  -line <input line file>            Input NGS data, line based text file format, one
                                     line per unit
  -reference <input reference>       Input genome reference file, usually fasta
                                     format file, as input file
  -outfile <output file>             Output line based file in text format
  -kmer <kmer size>                  Kmer length for reads mapping
  -evalue <e-value>                  e-value threshold, default 10
  -global <global or not>            Use global alignment or not. 0 for local, 1 for
                                     global, default 0
  -unmask <unmask>                   whether mask repeats of lower case nucleotides:
                                     1: yes; 0 :no; default=1
  -overlap <kmer overlap>            small overlap for long read
  -identity <identity threshold>     minimal identity for recruiting a read, default
                                     75 (sensitive mode, fast mode starts from 94)
  -coverage <coverage threshold>     minimal coverage for recruiting a read, default
                                     30
  -minlength <minimal read length>   minimal read length required for processing
  -attempts <number attempts>        maximum number of alignment attempts for one
                                     read to a block, default 20
  -hits <hit number>                 how many hits for output: 0:all; N: top N hits
  -strand <strand +/->
  -thread <number of threads>        How many threads to use for parallelizing
                                     processes,default is 1 cpu. set to 0 is the
                                     number of cpus available!local mode only, for
                                     Spark version, use spark parameter!
  -partition <re-partition number>   re generate number of partitions for .gz data,
                                     as .gz data only have one partition (spark
                                     parallelization)
  -version                           show version information
  -help                              print and show this information
  -h

Usage:
	run fragment recruitment : 
spark-submit [spark parameters] --class uni.bielefeld.cmg.sparkhit.main.Main [parameters] -fastq query.fq -reference reference.fa -outfile output_file.txt
spark-submit [spark parameters] --class uni.bielefeld.cmg.sparkhit.main.Main [parameters] -line query.txt -reference reference.fa -outfile output_file.txt

Follow the instruction, set the input sequencing data and the reference genome accordingly:
$ ./bin/sparkhit recruiter --driver-memory 1G --executor-memory 1G -fastq ./example/Stool-SRS016203.fq.gz -reference ./example/Ecoli.fa -outfile ./example/stool-result

SparkHit 16:54:44 SparkHit main initiating ... 
SparkHit 16:54:44 interpreting parameters.
SparkHit 16:54:44 Initiating Spark context ...
SparkHit 16:54:44 Start Spark framework
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/11/11 16:54:45 INFO SparkContext: Running Spark version 2.0.0
16/11/11 16:54:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

After it finishes, you can check the result in the folder ./example/stool-result:
$ ls -al ./example/stool-result

total 130
drwxr-xr-x 2 ec2-user users   118 Nov 11 17:45 .
drwxr-xr-x 3 ec2-user users   135 Nov 11 17:45 ..
-rw-r--r-- 1 ec2-user users 22246 Nov 11 17:45 part-00000
-rw-r--r-- 1 ec2-user users   184 Nov 11 17:45 .part-00000.crc
-rw-r--r-- 1 ec2-user users     0 Nov 11 17:45 _SUCCESS
-rw-r--r-- 1 ec2-user users     8 Nov 11 17:45 ._SUCCESS.crc

To view the result:
$ less -S ./example/stool-result/part-00000

@HWI-EAS324_102408434:3:100:10079:21137/2       91nt    4.26e-31        88      4       91      +       89.01   NC_000913.3     2726600 2726687
@HWI-EAS324_102408434:3:100:10079:21137/2       91nt    2.30e-29        88      4       91      +       87.91   NC_000913.3     3424199 3424286
@HWI-EAS324_102408434:3:100:10079:21137/2       91nt    4.26e-31        88      1       88      -       89.01   NC_000913.3     228256  228343
@HWI-EAS324_102408434:3:100:10079:21137/2       91nt    4.26e-31        88      1       88      -       89.01   NC_000913.3     4040017 4040104
@HWI-EAS324_102408434:3:100:10079:21137/2       91nt    4.26e-31        88      1       88      -       89.01   NC_000913.3     4171138 4171225
@HWI-EAS324_102408434:3:100:10079:21137/2       91nt    4.26e-31        88      1       88      -       89.01   NC_000913.3     4212540 4212627
@HWI-EAS324_102408434:3:100:10079:21137/2       91nt    2.30e-29        88      1       88      -       87.91   NC_000913.3     3946201 3946288
@HWI-EAS324_102408434:3:100:10083:8631/2        100nt   4.08e-24        88      9       96      +       77.00   NC_000913.3     224064  224151
@HWI-EAS324_102408434:3:100:10083:8631/2        100nt   4.08e-24        88      9       96      +       77.00   NC_000913.3     3942101 3942188
@HWI-EAS324_102408434:3:100:10083:8631/2        100nt   4.08e-24        88      9       96      +       77.00   NC_000913.3     4035824 4035911
@HWI-EAS324_102408434:3:100:10083:8631/2        100nt   4.08e-24        88      9       96      +       77.00   NC_000913.3     4166952 4167039

This file is just an intermediate result. It contains the mapping information of each read. To know more about the result, please go to the user manual page.

☕  Notes
  1. In case you want to do a test run with a larger dataset, increase the memory configuration by setting --driver-memory and--executor-memory to higher amounts.

Where to go from here

Want to know how to use each module of Sparkhit?
Read the user manual to see the specific options of each function.

Want to use Sparkhit on a local cluster?
Try setting up a Spark cluster on the Sun Grid Engine (SGE).  >>>>>

Want to use Sparkhit on the Amazon Elastic Computer Cloud (EC2)?
Try setting up a Spark cluster on the Amazon AWS cloud.  >>>>>

Or check out some examples on how to use different modules for varies analyses.