Suppose we want to run different linear models on a dataset. We can write it with a for loop in R, but as a result of many R functions working with one core, this process, It will eventually run on one core and is time-consuming. Another solution is to use multiprocess packages in R, like parallel
. These packages are very good, but the time of the executions does not decrease linearly when we increase the computation cores. One solution is that we run our code in parallel in different R sessions manually, like the Jobs
option in Rstudio. At first glance, this idea is hard to implement because we need to have different R scripts and run them simultaneously. The HLSGUtils
package provides some functions to make this work simpler with dynamic system resource management of memory and thread options.
We will try to describe parallelization in HLGSUtils
step-by-step. First, we need a base script that we want to run concurrently. A simple example is shown below.
linear_fitter <- function(formula, n){
df <- data.frame(
y = rnorm(n, mean = 3),
x1 = rnorm(n),
x2 = rnorm(n),
x3 = rnorm(n)
)
fit <- lm(as.formula(formula), data = df)
write_rds(broom::tidy(fit),sprintf("~/Desktop/fittel_lm_%s.rds",n))
print(sprintf("%s modeling was done with %s samples!", formula, n))
}
Save the script in some path like ~/Desktop/modeling.R
linear_fitter
is only run in the R environment. We need to convert this function that is run from the command line.
function_to_Rscript
converts function to command line format. This function needs:
function_from_source
: The path of the saved R functionfunction_name
: The name of the function in the source filepackages
: The packages that are needed to be calledarguments
: Names of function argumentsarguments_class
: arguments function class typesscript_save_path
: The generated R script path
library(HLSGUtils)
function_to_Rscript(
function_from_source = "~/Desktop/modeling.R",
function_name = "linear_fitter",
packages = c("readr","broom"),
arguments = c("formula","n"),
arguments_class = c("character","integer"),
script_save_path = "~/Desktop/modeling_r_script.R"
)
The resulted script is ready to run on the command line. The converted code can be found below.
############################################################
# linear_fitter #
############################################################
args <- commandArgs(trailingOnly = TRUE)
if (length(args) < 2){stop('I think you forgot your parameters')}
formula <- args[1]
n <- as.integer(args[2])
flush.console()
# Load Libraries
suppressMessages(library(readr))
suppressMessages(library(broom))
source("~/Desktop/modeling.R")
# Add Function Its Arguments
linear_fitter(
formula = formula,
n = n
)
modeling_r_script.R
can be run in command line format by Rscript --vanilla
command
and set input arguments.
Rscript --vanilla ~/Desktop/modeling_r_script.R y~x1 100
## [1] "y~x1 modeling was done with 100 samples!"
After running the command, the result table is:
readr::read_rds("~/Desktop/fittel_lm_100.rds")
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3.08 0.100 30.7 4.73e-52
## 2 x1 -0.0125 0.0941 -0.133 8.95e- 1
Finally, we want to run multiple models in parallel. parallel_rscripts
allows you to run R command line functions in parallel. It needs to set input arguments and system resource management thresholds.
rscript_path
: The path of command line format scriptargs
: list of function argumentsused_memory_treshold
: The total percentage of system memory that is in use.used_cpu_treshold
: The total percentage of threads that is in use.sleep_time
: sleep time between two work in seconds
sample_size = c(100, 200, 300)
formulas = c("y~x1", "y~x1+x2", "y~x1+x2+x3")
parallel_rscripts(
rscript_path = "~/Desktop/modeling_r_script.R",
args = list(formula = formulas, n = sample_size),
used_memory_treshold = 80,
used_cpu_treshold = 80,
sleep_time = 5 )