This functions let's the user select one or multiple statistical learning algorithms (penalized regression and generalized boosted regression). This is intended for survival data, and the mehods can handle left truncated survival data with a high number of various of predictors. The inputed data should be a data frame with columns representing the variables of interest while the rows should correspond to patients. All methods selected will all return a list of length equal to the number of crossvalidation performed, including the predicted risk score at each cross validation for all the patients falling in the test set.

OncoCast(
  data,
  family = "cox",
  formula,
  method = c("ENET"),
  runs = 100,
  cores = 1,
  pathResults = ".",
  studyType = "",
  save = T,
  nonPenCol = NULL,
  nTree = 500,
  interactions = c(1, 2),
  shrinkage = c(0.001, 0.01),
  min.node = c(10, 20),
  rf_gbm.save = F,
  out.ties = F,
  cv.folds = 5,
  rf.node = 5,
  mtry = floor(ncol(data)/3),
  replace = T,
  sample.fraction = 1,
  max.depth = NULL,
  epsilon.svm = seq(0, 0.2, 0.02),
  cost.svm = 2^(2:9),
  layers = NULL,
  norm.nn = T
)

Arguments

data

Data frame with variables as columns and patients as rows. Must have no missing data and should contain only the outcome and the predictors to be used. We recommend the time variables to use the month unit. Moreover note that these methods cannot handle categorical variables. Dummy binary variables must be created prior to input.

family

Name of the distribution of the outcome. Options are "cox", "binomial". Default is "cox".

formula

A formula with the names of the variables to be used in the data frame provided in the first argument. eg : Surv(time,status)~. for cox; y ~ . for binomial (Note all the variable available will be used regardless of the right side of the formula).

method

Character vector of the names of the method(s) to be used, options are : 1) LASSO ("LASSO") 2) Ridge ("RIDGE") 3) Elastic Net ("ENET"), 4) Random Forest ("RF"), 5) Support vector machine ("SVM") 6) Boosted Forest ("GBM") 7) Neural network ("NN"). Note that GBM requires training in order to perform optimally. Arguments in that objectives are listed below. Default is ENET.

runs

Number of cross validation iterations to be performed. Default is 100. We highly recommend performing at least 50.

cores

If you wish to run this function in parallel set the number of cores to be used to be greater than 1. Default is 1. CAUTION : Overloading your computer can lead to serious issues, please check how many cores are available on your machine before selecting an option!

pathResults

String of where the users wishes to output the results. Note that paths are relative in this context. Default is current directory.

studyType

String that will be the prefix to the name of the outputed results. Default is empty.

save

Boolean value : Default is TRUE, the results will be saved with the specified name in the specified path. If FALSE the results will be returned directly from the function and won't be saved. Be sure to save them in an object in your environment.

nonPenCol

Name of variables you do not with to penalize (available only for LASSO, RIDGE and ENET methods). Default is NULL.

nTree

Argument required for RF and GBM. Designates the number of trees to be built in each forest. Default is 500.

interactions

For GBM only. Integer specifying the maximum depth of each tree (i.e., the highest level of variable interactions allowed). A value of 1 implies an additive model, a value of 2 implies a model with up to 2-way interactions, etc. Default is 1 and 2 (as a vector c(1,2)).

shrinkage

For GBM only. a shrinkage parameter applied to each tree in the expansion. Also known as the learning rate or step-size reduction; 0.001 to 0.1 usually work, but a smaller learning rate typically requires more trees. Default are c(0.001,0.01).

min.node

For GBM only. Integer specifying the minimum number of observations in the terminal nodes of the trees. Note that this is the actual number of observations, not the total weight. Beware that nodes must be small if n is small. Default are c(10,20).

rf_gbm.save

For RF, GBM, SVM and NN only. In order to perform proper validation we must save the entire fit. This will require more memory space to save. If you plan to perform validation set to TRUE. Default is FALSE.

out.ties

phcpe argument to calculate the concordance index. If out.ties is set to FALSE, pairs of observations tied on covariates will be used to calculate the CPE. Otherwise, they will not be used.

cv.folds

Number of internal cross-validations to be performed in GBM.

rf.node

The minimal size of terminal nodes for random forest. Default is 5 (recommended for regression trees).

mtry

Number of features to include in each tree (for random forest). Default is 1/3 of all features.

replace

Boolean to sample with or without replacement for the samples (for random forest). Default is T.

sample.fraction

Fraction of samples to be used in random forest (default is 1).

max.depth

Maximum of depth of trees to be grown. Default trees are grown as much as possible.

epsilon.svm

epsilon sequence for SVM. Default is seq(0,0.2,0.02)

cost.svm

cost sequence for SVM. Default is 2^(2:9)

layers

a vector input for neural network method, each entry representing the number of nodes to be used at the given hidden layer. The number of hidden layers is determined by the length of the vector. Default is NULL in which case a single hidden layer will be used with diverse neuron numbers (recommended).

norm.nn

Boolean value to decide if the data for the nerual network should be normalized. Default is true (recommended).

Value

CI : For each iteration the concordance index of the generated model will be calculated on the testing set

fit : For LASSO, RIDGE and ENET methods return the refitted cox proportional hazard model with the beta coefficients found in the penalized regression model.

The formula used to perform the analysis.

predicted : The predicted risk for all the patients in the testing set.

means : The mean value of the predictors that were not shrunken to zero in the penalized regression method.

method : The name of the method that was used to generate the output.

data : The data used to fit the model (available only in the first element of the list).

Examples

library(OncoCast) test <- OncoCast(data=survData[1:100,],formula = Surv(time,status)~., method = "LASSO",runs = 30, save = FALSE,nonPenCol = NULL,cores =2)
#> Warning: We do not recommend using a number of cross-validation/bootstraps lower than 50.
#> [1] "Data check performed, ready for analysis." #> [1] "LASSO SELECTED"