Reflexiv

Welcome to Reflexiv

Reflexiv is an open source parallel De novo genome assembler. It addresses the challenge of high memory consumption during De novo genome assembly by leveraging distributed computational resources. It also improves the run time performance using a parallel assembly algorithm. Having problem fitting a 500GB De Bruijn graph into the memory? Here, you can use 10 64GB computer nodes to solve the problem (and faster).

How it works

We developed a new data structure called Reflexible Distributed K-mer (RDK). It is built on top of the Apache Spark platform, uses Spark RDD (resilient distributed dataset) to distribute large amount of K-mers across the cluster and assembles the genome in a recursive way.

Comparing RDK to the conventional (state-of-the-art) De Bruijn graph, RDK stores only the nodes of the graph and discards all the edges. Since all K-mers are distributed in different compute nodes, RDK uses a random K-mer reflecting method to reconnect the nodes across the cluster (a reduce step of the MapReduce paradigm). This method iteratively balancing the workloads between each node and assembles the genome in parallel.

Getting started

Follow the tutorial to run a simple Reflexiv application on your laptop.

Re-assembly: Extending assembled fragments

Reflexiv can also re-assemble pre-assembled or probe-targeted genome/gene fragments. This is useful to improve the quality of assemblies, etc., to complete a gene from a gene domain using whole genome sequencing data.

Command:

/usr/bin/Reflexiv reassembler \ 
  --driver-memory 6G \                                     ## Spark parameter
  --executor-memory 60G \                                  ## Spark parameter
  -fastq '/vol/human-microbiome-project/SRS*.tar.bz2' \    # Reflexiv parameter
  -frag /vol/private/gene_fragment.fa \                    # Reflexiv parameter: gene fragments/domains
  -outfile /vol/mybucket/Reflexiv/assembly \               # Reflexiv parameter
  -kmer 31                                                 # Reflexiv parameter

Setup cluster

A Spark cluster is essential to scale-out (distribute to multiple compute nodes) Reflexiv workloads. There are three tools at your disposal: Spark-ec2, BiBiGrid, Amazon Elastic MapReduce (EMR). Follow a tutorial to setup a Spark cluster on the Amazon cloud or a local cluster.

Support or Contact

Having troubles using Reflexiv? Leave an issue on github or contact support and we will help you to sort it out.