In my work, I usually deal with dataset of products from different customers across different market places. Basically, each product has its own time series dataset. The size of each dataset is not big but we have millions of them. Before finding any convincing reasons to combine dataset from different products, now we just treat them all as independent dataset. And, since they are all independent, it is definitely a good idea to use parallel computing to push the limit of our machine and to make code executed efficiently. This post is a lite version about how I do parallel computing in R.
Details
References
Run in Parallel
#### Import Libraries #### .packages = c("foreach","doParallel", "doSNOW") .inst <- .packages %in% installed.packages() if(length(.packages[!.inst]) > 0) install.packages(.packages[!.inst], repos = "http://cran.us.r-project.org") notshow = lapply(.packages, require, character.only=TRUE) #### Define the function #### CreateDataFrame = function(value, times){ d = rep(value, times = times) return(d) } #### Run Parallelly #### num_core = detectCores() - 2 # detect # of CPU cores cl = makeCluster(num_core, outfile = "") # define the clusters registerDoSNOW(cl) # print out the progress for every iteration progress <- function(n) cat(sprintf("task %d is complete\n", n)) opts <- list(progress=progress) start.time = proc.time() # calculate the execution time output_par = foreach(i = 1:5, .options.snow = opts, .errorhandling = 'pass') %dopar% # the default .combine = list { out = CreateDataFrame(c(1:i), 3) out }
## task 1 is complete ## task 2 is complete ## task 3 is complete ## task 4 is complete ## task 5 is complete
stopCluster(cl) # stop the cluster in the end (end.time = proc.time() - start.time) # total execution time
## user system elapsed ## 0.037 0.003 0.058
output_par
## [[1]] ## [1] 1 1 1 ## ## [[2]] ## [1] 1 2 1 2 1 2 ## ## [[3]] ## [1] 1 2 3 1 2 3 1 2 3 ## ## [[4]] ## [1] 1 2 3 4 1 2 3 4 1 2 3 4 ## ## [[5]] ## [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5