genomicsUtils.Rmd
HLSGUtils
contains functions that can run the objects of the GenomicsUtils Scala project. This function opens a Spark session and runs a Scala class with input arguments that are given from the R function. First, we need to have Apache Spark on our machine. GenomicsUtils
is based on Spark version of 3.1.2 and Hadoop 2.7. This page helps Choose your version and download the source.
On Linux, downloading and extracting Spark source files are done by these commands.:
wget https://www.apache.org/dyn/closer.lua/spark/spark-3.1.3/spark-3.1.3-bin-hadoop2.7.tgz
tar xvf spark-3.1.3-bin-hadoop2.7.tgz
mv spark-3.1.3-bin-hadoop2.7 /opt/spark
Insert the following line into the /.bashrc
file to add the spark software file location to the PATH variable.
export PATH=$PATH:/opt/spark/bin
To active new setting, use the following command for sourcing the ~/.bashrc
file.
source ~/.bashrc
It is also necessary to define SPARK_HOME
in system.
export SPARK_HOME="/opt/spark"
echo $SPARK_HOME #! check result
To see environments variables we can also use R command:
Sys.setenv(SPARK_HOME = "/opt/spark")
Sys.getenv("SPARK_HOME")
Run the following command to see if everything is working properly:
spark-shell
If Spark is active on your system, then we can run a Spark-based HLSGUtils
functions. VCF2Delta
reads VCF files and converts them to delta format .
library(HLSGUtils)
vcf_file <- system.file("data","1KG_SNP_chr19.vcf.gz", package = "HLSGUtils")
VCF2Delta(vcf_file, savePath = "~/Desktop/1KG_SNP_chr19.delta")
The sparklyr
package can read delta files, as shown in the code below.
library(sparklyr)
sc <- spark_connect(master = "local", packages = "io.delta:delta-core_2.12:1.0.1")
snp <- spark_read_delta(sc, "~/Desktop/1KG_SNP_chr19.delta")
head(snp)
CSVJoinDelta
is Another useful function in GenomicsUtils
library. It is used to join CSV files with datasets that are stored as delta format.
csv_path <- system.file("data","1KG_p_value.csv", package = "HLSGUtils")
CSVJoinDelta(csvPath = csv_path, deltaPath = "~/Desktop/1KG_SNP_chr19.delta",
byX = "ID", byY = "names",
savePath = "~/Desktop/1KG_SNP_chr19_pvalue.csv"
)
The final result is:
read_csv("~/Desktop/1KG_SNP_chr19_pvalue.csv/part-00000-2c471c7c-e42e-4943-8afc-637c18b6254c-c000.csv") %>%
head()
#> Rows: 939 Columns: 9
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (3): ID, names, referenceAllele
#> dbl (5): p_value, contigName, start, end, qual
#> lgl (1): splitFromMultiAllelic
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 6 × 9
#> ID p_value contigName start end names referenceAllele qual
#> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
#> 1 rs114256401 0.299 19 275198 275199 rs114256401 C 100
#> 2 rs143060471 0.201 19 330327 330328 rs143060471 C 100
#> 3 rs139380406 0.277 19 340354 340355 rs139380406 T 100
#> 4 rs143987486 0.948 19 358004 358005 rs143987486 G 100
#> 5 rs4919869 0.702 19 450179 450180 rs4919869 G 100
#> 6 rs8104683 0.820 19 458934 458935 rs8104683 A 100
#> # … with 1 more variable: splitFromMultiAllelic <lgl>