In my work, I usually deal with dataset of products from different customers across different market places. Basically, each product has its own time series dataset. The size of each dataset is not big but we have millions of them. Before finding any convincing reasons to combine dataset from different products, now we just treat them all as independent dataset. And, since they are all independent, it is definitely a good idea to use parallel computing to push the limit of our machine and to make code executed efficiently. This post is a lite version about how I do parallel computing in R.

Details

  • References

    1. Getting Started with doParallel and foreach
    2. Using The foreach Package
    3. Monitoring progress of a foreach parallel job
    4. Package ‘foreach’
    5. Package ‘doParallel’
    6. Package ‘doSNOW’
  • Run in Parallel

    #### Import Libraries ####
    .packages = c("foreach","doParallel", "doSNOW")
    .inst <- .packages %in% installed.packages()
    if(length(.packages[!.inst]) > 0) install.packages(.packages[!.inst], repos = "http://cran.us.r-project.org")
    notshow = lapply(.packages, require, character.only=TRUE)
    
    #### Define the function ####
    CreateDataFrame = function(value, times){
      d = rep(value, times = times)
      return(d)
    }
    
    #### Run Parallelly ####
    num_core = detectCores() - 2 # detect # of CPU cores 
    cl = makeCluster(num_core, outfile = "") # define the clusters
    registerDoSNOW(cl)
    
    # print out the progress for every iteration
    progress <- function(n) cat(sprintf("task %d is complete\n", n))
    opts <- list(progress=progress)
    
    start.time = proc.time() # calculate the execution time
    output_par = 
      foreach(i = 1:5, .options.snow = opts, .errorhandling = 'pass') %dopar% 
      # the default .combine = list 
      {
        out = CreateDataFrame(c(1:i), 3)
        out
      }
    ## task 1 is complete
    ## task 2 is complete
    ## task 3 is complete
    ## task 4 is complete
    ## task 5 is complete
    stopCluster(cl) # stop the cluster in the end
    (end.time = proc.time() - start.time) # total execution time
    ##    user  system elapsed 
    ##   0.037   0.003   0.058
    output_par
    ## [[1]]
    ## [1] 1 1 1
    ## 
    ## [[2]]
    ## [1] 1 2 1 2 1 2
    ## 
    ## [[3]]
    ## [1] 1 2 3 1 2 3 1 2 3
    ## 
    ## [[4]]
    ##  [1] 1 2 3 4 1 2 3 4 1 2 3 4
    ## 
    ## [[5]]
    ##  [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5