HLSGUtils contains functions that can run the objects of the GenomicsUtils Scala project. This function opens a Spark session and runs a Scala class with input arguments that are given from the R function. First, we need to have Apache Spark on our machine. GenomicsUtils is based on Spark version of 3.1.2 and Hadoop 2.7. This page helps Choose your version and download the source.

On Linux, downloading and extracting Spark source files are done by these commands.:

wget https://www.apache.org/dyn/closer.lua/spark/spark-3.1.3/spark-3.1.3-bin-hadoop2.7.tgz
tar xvf spark-3.1.3-bin-hadoop2.7.tgz
mv spark-3.1.3-bin-hadoop2.7 /opt/spark 

Insert the following line into the /.bashrc file to add the spark software file location to the PATH variable.

export PATH=$PATH:/opt/spark/bin

To active new setting, use the following command for sourcing the ~/.bashrc file.

source ~/.bashrc

It is also necessary to define SPARK_HOME in system.

export SPARK_HOME="/opt/spark"
echo $SPARK_HOME #! check result

To see environments variables we can also use R command:

Sys.setenv(SPARK_HOME = "/opt/spark")
Sys.getenv("SPARK_HOME")

Run the following command to see if everything is working properly:

spark-shell

If Spark is active on your system, then we can run a Spark-based HLSGUtils functions. VCF2Delta reads VCF files and converts them to delta format .

library(HLSGUtils)
vcf_file <- system.file("data","1KG_SNP_chr19.vcf.gz", package = "HLSGUtils")
VCF2Delta(vcf_file, savePath = "~/Desktop/1KG_SNP_chr19.delta")

The sparklyr package can read delta files, as shown in the code below.

library(sparklyr)
sc <- spark_connect(master = "local", packages = "io.delta:delta-core_2.12:1.0.1")
snp <- spark_read_delta(sc, "~/Desktop/1KG_SNP_chr19.delta")
head(snp)

CSVJoinDelta is Another useful function in GenomicsUtils library. It is used to join CSV files with datasets that are stored as delta format.

csv_path <- system.file("data","1KG_p_value.csv", package = "HLSGUtils")
CSVJoinDelta(csvPath = csv_path, deltaPath = "~/Desktop/1KG_SNP_chr19.delta",
             byX = "ID", byY = "names", 
             savePath = "~/Desktop/1KG_SNP_chr19_pvalue.csv"
             )

The final result is:

read_csv("~/Desktop/1KG_SNP_chr19_pvalue.csv/part-00000-2c471c7c-e42e-4943-8afc-637c18b6254c-c000.csv") %>% 
  head()
#> Rows: 939 Columns: 9
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (3): ID, names, referenceAllele
#> dbl (5): p_value, contigName, start, end, qual
#> lgl (1): splitFromMultiAllelic
#> 
#>  Use `spec()` to retrieve the full column specification for this data.
#>  Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 6 × 9
#>   ID          p_value contigName  start    end names       referenceAllele  qual
#>   <chr>         <dbl>      <dbl>  <dbl>  <dbl> <chr>       <chr>           <dbl>
#> 1 rs114256401   0.299         19 275198 275199 rs114256401 C                 100
#> 2 rs143060471   0.201         19 330327 330328 rs143060471 C                 100
#> 3 rs139380406   0.277         19 340354 340355 rs139380406 T                 100
#> 4 rs143987486   0.948         19 358004 358005 rs143987486 G                 100
#> 5 rs4919869     0.702         19 450179 450180 rs4919869   G                 100
#> 6 rs8104683     0.820         19 458934 458935 rs8104683   A                 100
#> # … with 1 more variable: splitFromMultiAllelic <lgl>