OncoCast-Data.Rmd
OncoCast()
expects as input a dataframe with samples (or patients) as rows and features of interest as columns. Each row name must correspond to a unique sample ID unique to the obsservation of interest. Similarly each column name must be unique to the feature it represents.
Several example datasets are included in the package and will be loaded with the package. These are survData
, survData.LT
, binData
and normData
. If you have a doubt on what the input data should look like please refer to these tables. As you can see in the code below these are stored as dataframes.
typeof(survData) #> [1] "list" is.data.frame(survData) #> [1] TRUE
OncoCast()
can handle (at the time being) 3 different common outcomes:
OncoCast
package and most of the features present in this software are oriented with this outcome as the primary target.kable(survData[1:5,1:5])
time | status | ImpCov1 | ImpCov2 | ImpCov3 | |
---|---|---|---|---|---|
Patient1 | 9.933107 | 0 | 0 | 1 | 1 |
Patient2 | 32.944307 | 1 | 0 | 0 | 0 |
Patient3 | 2.264857 | 1 | 0 | 0 | 0 |
Patient4 | 3.658523 | 1 | 0 | 0 | 0 |
Patient5 | 12.197182 | 1 | 0 | 1 | 0 |
kable(binData[1:5,1:5])
y | Covs1 | Covs2 | Covs3 | Covs4 | |
---|---|---|---|---|---|
Patient1 | 1 | -0.6264538 | 0.8936737 | -0.3410670 | -1.5414026 |
Patient2 | 0 | 0.1836433 | -1.0472981 | 1.5024245 | 0.1943211 |
Patient3 | 1 | -0.8356286 | 1.9713374 | 0.5283077 | 0.2644225 |
Patient4 | 1 | 1.5952808 | -0.3836321 | 0.5421914 | -1.1187352 |
Patient5 | 1 | 0.3295078 | 1.6541453 | -0.1366734 | 0.6509530 |
kable(normData[1:5,1:5])
y | Covs1 | Covs2 | Covs3 | Covs4 | |
---|---|---|---|---|---|
Patient1 | 5.760664 | -0.6264538 | 0.8936737 | -0.3410670 | -1.5414026 |
Patient2 | -5.447879 | 0.1836433 | -1.0472981 | 1.5024245 | 0.1943211 |
Patient3 | 6.074560 | -0.8356286 | 1.9713374 | 0.5283077 | 0.2644225 |
Patient4 | 5.380191 | 1.5952808 | -0.3836321 | 0.5421914 | -1.1187352 |
Patient5 | 5.014123 | 0.3295078 | 1.6541453 | -0.1366734 | 0.6509530 |
The OncoCast()
function will consider that all features that are not part of the outcome are covariates (independent variables) and thus will be included in the ensemble model as predictors. Due to machine learning algorithm restrictions only numeric predictors can be considered. For continuous and binary predictors this will not be an issue. However for dichotomous variables with more than 2 levels we require the creation of “dummy” binary variables for each level of a character vector (expect one level that is left out). A warning will be thrown if such a variable is recognized and recoded by OncoCast()
as shown in the example below. We recommend the user does this prior to creating the ensemble learning model in order to control which factor becomes the baseline.
data <- as.data.frame(matrix(c("a","b","c","x","y","z"), nrow = 3, ncol = 2)) dums <- apply(data,2,function(x){anyNA(as.numeric(as.character(x)))}) data.save = F if(sum(dums) > 0){ tmp <- data %>% select(which(dums)) %>% fastDummies::dummy_cols(remove_first_dummy = T) %>% select(-one_of(names(which(dums)))) data <- as.data.frame(cbind( data %>% select(-one_of(names(which(dums)))), tmp ) %>% mutate_all(as.character) %>% mutate_all(as.numeric) ) data.save = T warning("Character variables were transformed to dummy numeric variables. If you didn't have any character variables make sure all columns in your input data are numeric. The transformed data will be saved as part of the output.") } kable(data, row.names = TRUE)
V1_b | V1_c | V2_y | V2_z | |
---|---|---|---|---|
1 | 0 | 0 | 0 | 0 |
2 | 1 | 0 | 1 | 0 |
3 | 0 | 1 | 0 | 1 |
Note that a warning is thrown and the modified data will be saved and returned with the ensemble model.