library(OncoCast)
library(dplyr)
library(knitr)

Data format

OncoCast() expects as input a dataframe with samples (or patients) as rows and features of interest as columns. Each row name must correspond to a unique sample ID unique to the obsservation of interest. Similarly each column name must be unique to the feature it represents.

Several example datasets are included in the package and will be loaded with the package. These are survData, survData.LT, binData and normData. If you have a doubt on what the input data should look like please refer to these tables. As you can see in the code below these are stored as dataframes.

typeof(survData)
#> [1] "list"
is.data.frame(survData)
#> [1] TRUE

Outcome of interest

OncoCast() can handle (at the time being) 3 different common outcomes:

  • Time to event data, which was the original target of the OncoCast package and most of the features present in this software are oriented with this outcome as the primary target.
kable(survData[1:5,1:5])
time status ImpCov1 ImpCov2 ImpCov3
Patient1 9.933107 0 0 1 1
Patient2 32.944307 1 0 0 0
Patient3 2.264857 1 0 0 0
Patient4 3.658523 1 0 0 0
Patient5 12.197182 1 0 1 0
  • Binary data, for outcomes that are based on success and failures.
kable(binData[1:5,1:5])
y Covs1 Covs2 Covs3 Covs4
Patient1 1 -0.6264538 0.8936737 -0.3410670 -1.5414026
Patient2 0 0.1836433 -1.0472981 1.5024245 0.1943211
Patient3 1 -0.8356286 1.9713374 0.5283077 0.2644225
Patient4 1 1.5952808 -0.3836321 0.5421914 -1.1187352
Patient5 1 0.3295078 1.6541453 -0.1366734 0.6509530
  • Normal data, for outcomes that follow a gaussian distribution.
kable(normData[1:5,1:5])
y Covs1 Covs2 Covs3 Covs4
Patient1 5.760664 -0.6264538 0.8936737 -0.3410670 -1.5414026
Patient2 -5.447879 0.1836433 -1.0472981 1.5024245 0.1943211
Patient3 6.074560 -0.8356286 1.9713374 0.5283077 0.2644225
Patient4 5.380191 1.5952808 -0.3836321 0.5421914 -1.1187352
Patient5 5.014123 0.3295078 1.6541453 -0.1366734 0.6509530

Features coding

The OncoCast() function will consider that all features that are not part of the outcome are covariates (independent variables) and thus will be included in the ensemble model as predictors. Due to machine learning algorithm restrictions only numeric predictors can be considered. For continuous and binary predictors this will not be an issue. However for dichotomous variables with more than 2 levels we require the creation of “dummy” binary variables for each level of a character vector (expect one level that is left out). A warning will be thrown if such a variable is recognized and recoded by OncoCast() as shown in the example below. We recommend the user does this prior to creating the ensemble learning model in order to control which factor becomes the baseline.

data <- as.data.frame(matrix(c("a","b","c","x","y","z"), nrow = 3, ncol = 2))

dums <- apply(data,2,function(x){anyNA(as.numeric(as.character(x)))})
data.save = F
if(sum(dums) > 0){
  tmp <- data %>%
    select(which(dums)) %>%
    fastDummies::dummy_cols(remove_first_dummy = T) %>%
    select(-one_of(names(which(dums))))
  data <- as.data.frame(cbind(
    data %>% select(-one_of(names(which(dums)))),
    tmp
  ) %>% mutate_all(as.character) %>%
    mutate_all(as.numeric)
  )
  data.save = T
  warning("Character variables were transformed to dummy numeric variables. If you didn't have any character variables make sure all columns in your input data are numeric. The transformed data will be saved as part of the output.")
}

kable(data, row.names = TRUE)
V1_b V1_c V2_y V2_z
1 0 0 0 0
2 1 0 1 0
3 0 1 0 1

Note that a warning is thrown and the modified data will be saved and returned with the ensemble model.