Every seven months, the volume of genomics data doubles. As a result, scientists in this field face big data challenges such as data management and designing efficient algorithms. In this post, we will attempt to apply Glow solution to create big data architectures to genomics data. Glow is built on Apache Spark and Delta Lake, two popular big data technologies for distributed data processing and storage.
We try to read the last version of GRCh38 dbSNP and store it in delta format. To read VCF data, we need to open a Spark session with spark-shell and add dependencies. To add Glow and Delta dependencies to the Spark Session, first go to the Spark source directory, then run the command below.
./bin/spark-shell --master local[20]\
--packages io.projectglow:glow-spark3_2.12:1.1.2,io.delta:delta-core_2.12:1.0.1
To choose the delta.io version, you need to find a compatible version with your spark version. The proper delta version for your spark can be found in the table below.
Delta.Lake.version | Apache.Spark.version |
1.1.x | 3.2.x |
1.0.x | 3.1.x |
0.7.x and 0.8.x | 3.0.x |
Below 0.7.0 | 2.4.2 - 2.4.<latest> |
source: docs.delta.io |
Spark downloads jar file dependencies in .ivy2/jars
directory at your home.
It is sufficient to add the jars directory to Spark if you do not want to download dependencies again.
./bin/spark-shell --master local[20] --jars "~/.ivy2/jars/*"
Because some fields of SNP data contain very large character content, it is better to increase the Java heap memory. Look in the conf folder in the Spark directory and
add the following to spark-defaults.conf
:
spark.memory.offHeap.enabled true
spark.memory.offHeap.size 2048m
We open the Scala REPL with a spark session after running the spark-shell
.
Although the Spark Shell can be used to build Spark Scala API codes, it lacks the user-friendliness of IDEs. It’s much easier to write codes and then paste into the REPL. Using the :paste
command in the REPL provides a quick way to paste multiline codes.
scala> :paste
// Entering paste mode (ctrl-D to finish)
To add the VCF
reader format to Spark, import the Glow library and register the Spark session.
import io.projectglow.Glow
Glow.register(spark)
Below, the SNP data is read and some columns are renamed and selected.
val vcf_df = spark.read.format("vcf")
.load("./data/00-All.vcf.gz")
.withColumn("ID",explode(col("names")))
.drop("names","genotypes")
.select(
col("contigName").as("CH"),
col("start").as("POS"),
col("ID"),
col("end").as("END"),
col("referenceAllele").as("REF"),
col("alternateAlleles").as("ALT")
)
Spark jobs are lazy. To see results, it needs some actions, like a simple show
function.
vcf_df.show
vcf_df.printSchema
Delta Lake is a data format that is based on Apache Parquet but has been optimized for analytic tasks. To write data in delta format, use the below code.
vcf_df
.write
.mode("overwrite")
.format("delta")
.save("./data/SNP_GRCh38.delta")
Now data is ready to Play!
val snp = spark
.read
.format("delta")
.load("./data/SNP_GRCh38.delta")
snp.groupBy("CH").count.show(30)