My goal is to generate a large sparse matrix with majority (~99%) zeros and ones. Ideally, I would be working with 10,000 rows and 10,000,000 columns. Additionally, each column is generated as a sequence of Bernoulli samples with a **column-specific** probability. So far, I’ve implemented 3 ways to generate the data:

**Function 1**

Creating basic dense matrix of 0/1:

`spMat_dense <- function(ncols,nrows,col_probs){ matrix(rbinom(nrows*ncols,1,col_probs), ncol=ncols,byrow=T) } `

**Function 2**

Using `Rcpp`

:

`#include <RcppArmadillo.h> // [[Rcpp::depends(RcppArmadillo)]] using namespace std; using namespace Rcpp; using namespace arma; // [[Rcpp::export]] arma::sp_mat spMat_cpp(const int& ncols, const int& nrows, const NumericVector& col_probs){ IntegerVector binom_draws = no_init(nrows); IntegerVector row_pos; IntegerVector col_pos; int nz_counter=0; //Generate (row,cell)-coordinates of non-zero values for(int j=0; j<ncols; ++j){ binom_draws = rbinom(nrows,1,col_probs[j]); for(int i=0; i<nrows; ++i){ if(binom_draws[i]==1){ row_pos.push_back(i); col_pos.push_back(j); nz_counter += 1; } } } //Create a 2 x N matrix - indicates row/col positions for N non-zero entries arma::umat loc_mat(2,nz_counter); for(int i=0;i<nz_counter; ++i){ loc_mat(0,i) = row_pos[i]; loc_mat(1,i) = col_pos[i]; } IntegerVector x_tmp = rep(1,nz_counter); arma::colvec x = Rcpp::as<arma::colvec>(x_tmp); //sparse matrix constructor arma::sp_mat out(loc_mat,x); return out; } `

**Function 3**

Using `dgCMatrix`

construction in `Matrix`

package:

`spMat_dgC <- function(ncols,nrows,col_probs){ #Credit to Andrew Guster (https://stackoverflow.com/a/56348978/4321711) require(Matrix) mat <- Matrix(0, nrows, ncols, sparse = TRUE) #blank matrix for template i <- vector(mode = "list", length = ncols) #each element of i contains the '1' rows p <- rep(0, ncols) #p will be cumsum no of 1s by column for(r in 1:nrows){ row <- rbinom(ncols, 1, col_probs) #random row p <- p + row #add to column identifier if(any(row == 1)){ for (j in which(row == 1)){ i[[j]] <- c(i[[j]], r-1) #append row identifier } } } p <- c(0, cumsum(p)) #this is the format required i <- unlist(i) x <- rep(1, length(i)) mat@i <- as.integer(i) mat@p <- as.integer(p) mat@x <- x return(mat) } `

**Benchmarking**

`ncols = 100000 nrows = 1000 col_probs = runif(ncols, 0.001, 0.002) microbenchmark::microbenchmark(generate_SpMat1(ncols=ncols,nrows=nrows,col_probs=col_probs), generate_SpMat2(ncols=ncols,nrows=nrows,col_probs = col_probs), generate_spMat(ncols=ncols,nrows=nrows,col_probs=col_probs), times=5L) Unit: seconds expr spMat_dense(ncols = ncols, nrows = nrows, col_probs = col_probs) spMat_cpp(ncols = ncols, nrows = nrows, col_probs = col_probs) spMat_dgC(ncols = ncols, nrows = nrows, col_probs = col_probs) min lq mean median uq max neval 6.527836 6.673515 7.260482 7.13241 7.813596 8.155053 5 56.726238 57.038976 57.841693 57.24435 58.325564 59.873333 5 6.541939 6.599228 6.938952 6.62452 7.402208 7.526867 5 `

Interestingly, my `Rcpp`

code is not as optimal as I thought it would be. I’m not entirely sure why it’s not as efficient as the basic, dense construction. The advantage however in the `Rcpp`

and `dgCMatrix`

construction is that they don’t create a dense matrix first. The memory used is much less:

`ncols = 100000 nrows = 1000 col_probs = runif(ncols, 0.001, 0.002) mat1 <- spMat_dense(ncols=ncols,nrows=nrows,col_probs=col_probs) mat2 <- spMat_cpp(ncols=ncols,nrows=nrows,col_probs = col_probs) mat3 <- spMat_dgC(ncols=ncols,nrows=nrows,col_probs=col_probs) object.size(mat1) object.size(mat2) object.size(mat3) > object.size(mat1) 400000216 bytes > object.size(mat2) 2199728 bytes > object.size(mat3) 2205920 bytes `

**Question**

What is it about my `Rcpp`

code that makes it slower than the other two? Is it possible to optimize or is the well-written R code with `dgCMatrix`

as good as it gets?