library(OncoCast)
library(dplyr)
library(knitr)

Data format

OncoCast() expects as input a dataframe with samples (or patients) as rows and features of interest as columns. Each row name must correspond to a unique sample ID unique to the obsservation of interest. Similarly each column name must be unique to the feature it represents.

Several example datasets are included in the package and will be loaded with the package. These are survData, survData.LT, binData and normData. If you have a doubt on what the input data should look like please refer to these tables. As you can see in the code below these are stored as dataframes.

typeof(survData)
#> [1] "list"
is.data.frame(survData)
#> [1] TRUE

Outcome of interest

OncoCast() can handle (at the time being) 3 different common outcomes:

Time to event data, which was the original target of the OncoCast package and most of the features present in this software are oriented with this outcome as the primary target.

kable(survData[1:5,1:5])

	time	status	ImpCov2	ImpCov3
Patient1	9.933107	0	1	1
Patient2	32.944307	1	0	0
Patient3	2.264857	1	0	0
Patient4	3.658523	1	0	0
Patient5	12.197182	1	1	0

Binary data, for outcomes that are based on success and failures.

kable(binData[1:5,1:5])

	y	Covs1	Covs2	Covs3	Covs4
Patient1	1	-0.6264538	0.8936737	-0.3410670	-1.5414026
Patient2	0	0.1836433	-1.0472981	1.5024245	0.1943211
Patient3	1	-0.8356286	1.9713374	0.5283077	0.2644225
Patient4	1	1.5952808	-0.3836321	0.5421914	-1.1187352
Patient5	1	0.3295078	1.6541453	-0.1366734	0.6509530

Normal data, for outcomes that follow a gaussian distribution.

kable(normData[1:5,1:5])

	y	Covs1	Covs2	Covs3	Covs4
Patient1	5.760664	-0.6264538	0.8936737	-0.3410670	-1.5414026
Patient2	-5.447879	0.1836433	-1.0472981	1.5024245	0.1943211
Patient3	6.074560	-0.8356286	1.9713374	0.5283077	0.2644225
Patient4	5.380191	1.5952808	-0.3836321	0.5421914	-1.1187352
Patient5	5.014123	0.3295078	1.6541453	-0.1366734	0.6509530

Features coding

The OncoCast() function will consider that all features that are not part of the outcome are covariates (independent variables) and thus will be included in the ensemble model as predictors. Due to machine learning algorithm restrictions only numeric predictors can be considered. For continuous and binary predictors this will not be an issue. However for dichotomous variables with more than 2 levels we require the creation of “dummy” binary variables for each level of a character vector (expect one level that is left out). A warning will be thrown if such a variable is recognized and recoded by OncoCast() as shown in the example below. We recommend the user does this prior to creating the ensemble learning model in order to control which factor becomes the baseline.

data <- as.data.frame(matrix(c("a","b","c","x","y","z"), nrow = 3, ncol = 2))

dums <- apply(data,2,function(x){anyNA(as.numeric(as.character(x)))})
data.save = F
if(sum(dums) > 0){
  tmp <- data %>%
    select(which(dums)) %>%
    fastDummies::dummy_cols(remove_first_dummy = T) %>%
    select(-one_of(names(which(dums))))
  data <- as.data.frame(cbind(
    data %>% select(-one_of(names(which(dums)))),
    tmp
  ) %>% mutate_all(as.character) %>%
    mutate_all(as.numeric)
  )
  data.save = T
  warning("Character variables were transformed to dummy numeric variables. If you didn't have any character variables make sure all columns in your input data are numeric. The transformed data will be saved as part of the output.")
}

kable(data, row.names = TRUE)

	V1_b	V1_c	V2_y	V2_z
1	0	0	0	0
2	1	0	1	0
3	0	1	0	1

Note that a warning is thrown and the modified data will be saved and returned with the ensemble model.

OncoCast-Data

Data format

Outcome of interest

Features coding