Every seven months, the volume of genomics data doubles. As a result, scientists in this field face big data challenges such as data management and designing efficient algorithms. In this post, we will attempt to apply Glow solution to create big data architectures to genomics data. Glow is built on Apache Spark and Delta Lake, two popular big data technologies for distributed data processing and storage.

source: databricks.com

We try to read the last version of GRCh38 dbSNP and store it in delta format. To read VCF data, we need to open a Spark session with spark-shell and add dependencies. To add Glow and Delta dependencies to the Spark Session, first go to the Spark source directory, then run the command below.

./bin/spark-shell --master local[20]\ 
  --packages io.projectglow:glow-spark3_2.12:1.1.2,io.delta:delta-core_2.12:1.0.1

To choose the delta.io version, you need to find a compatible version with your spark version. The proper delta version for your spark can be found in the table below.

Table 1: Performance characteristics of sequence types
Delta.Lake.version	Apache.Spark.version
1.1.x	3.2.x
1.0.x	3.1.x
0.7.x and 0.8.x	3.0.x
Below 0.7.0	2.4.2 - 2.4.<latest>
source: docs.delta.io

Spark downloads jar file dependencies in .ivy2/jars directory at your home. It is sufficient to add the jars directory to Spark if you do not want to download dependencies again.

./bin/spark-shell --master local[20] --jars "~/.ivy2/jars/*"

Because some fields of SNP data contain very large character content, it is better to increase the Java heap memory. Look in the conf folder in the Spark directory and add the following to spark-defaults.conf:

spark.memory.offHeap.enabled     true
spark.memory.offHeap.size        2048m

We open the Scala REPL with a spark session after running the spark-shell. Although the Spark Shell can be used to build Spark Scala API codes, it lacks the user-friendliness of IDEs. It’s much easier to write codes and then paste into the REPL. Using the :paste command in the REPL provides a quick way to paste multiline codes.

scala> :paste
// Entering paste mode (ctrl-D to finish)

To add the VCF reader format to Spark, import the Glow library and register the Spark session.

import io.projectglow.Glow
Glow.register(spark)

Below, the SNP data is read and some columns are renamed and selected.

val vcf_df = spark.read.format("vcf")
  .load("./data/00-All.vcf.gz")
  .withColumn("ID",explode(col("names")))
  .drop("names","genotypes")
  .select(
    col("contigName").as("CH"),
    col("start").as("POS"),
    col("ID"),
    col("end").as("END"),
    col("referenceAllele").as("REF"),
    col("alternateAlleles").as("ALT")
    )

Spark jobs are lazy. To see results, it needs some actions, like a simple show function.

vcf_df.show
vcf_df.printSchema

Delta Lake is a data format that is based on Apache Parquet but has been optimized for analytic tasks. To write data in delta format, use the below code.

vcf_df
  .write
  .mode("overwrite")
  .format("delta")
  .save("./data/SNP_GRCh38.delta")

Now data is ready to Play!

val snp = spark
  .read
  .format("delta")
  .load("./data/SNP_GRCh38.delta")

snp.groupBy("CH").count.show(30)

Fast and Scalable Genomics Workflows with Spark

Refrences