datasets¶

deepsurvk.datasets.load_metabric(partition='complete', **kwargs)[source]¶

Data from the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC), which uses gene and protein expression profiles to determine new breast cancer subgroups

It consists of clinical features of 1980 patients, of which 57.72% have an observed death due to breast cancer with a median survival time of 116 months. However, the file only contains data of 1904 patients.

For more information, see 1 as well as the accompanying README.

Parameters

partition (string) – Partition of the data to load.
Possible values are:
- complete - The whole dataset (default)
- training or train - Training partition as used in the original DeepSurv
- testing or test - Testing partition as used in the original DeepSurv
data_type (string) – Data type of the data.
Possible values are:
- pandas or pd or dataframe or df- pandas DataFrame (default)
- numpy or np - NumPy array
Note

NumPy is supported as an option, but DeepSurvK is built with pandas in mind.

Returns

X - Features
Y - Target variable
E - Event variable

Return type

tuple of pandas DataFrames

References

1: Curtis, Christina, et al. “The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups.” Nature 486.7403 (2012): 346-352.

deepsurvk.datasets.load_rgbsg(partition='complete', **kwargs)[source]¶

The training partition belongs to the Rotterdam tumor bank dataset 2. It contains records of 1546 patients with node-positive breast cancer. Nearly 90% of the patients have an observed death time.

The testing partitiong belongs to the German Breast Cancer Study Group (GBSG) 3. It contains records for 686 patients (of which 56 % are censored) in a randomized clinical trial that studied the effects of chemotherapy and hormone treatment on survival rate.

For more information, see 2 and 3, as well as the accompanying README.

Parameters

partition (string) – Partition of the data to load.
Possible values are:
- complete - The whole dataset (default)
- training or train - Training partition as used in the original DeepSurv
- testing or test - Testing partition as used in the original DeepSurv
data_type (string) – Data type of the data.
Possible values are:
- pandas or pd or dataframe or df- pandas DataFrame (default)
- numpy or np - NumPy array
Note

NumPy is supported as an option, but DeepSurvK is built with pandas in mind.

Returns

X - Features
Y - Target variable
E - Event variable

Return type

tuple of pandas DataFrames

References

2(1,2): Foekens, John A., et al. “The urokinase system of plasminogen activation and prognosis in 2780 breast cancer patients.” Cancer research 60.3 (2000): 636-643.
3(1,2): Schumacher, M., et al. “Randomized 2 x 2 trial evaluating hormonal treatment and the duration of chemotherapy in node-positive breast cancer patients. German Breast Cancer Study Group.” Journal of Clinical Oncology 12.10 (1994): 2086-2093.

deepsurvk.datasets.load_simulated_gaussian(partition='complete', **kwargs)[source]¶

Synthetic data with a Gaussian (non-linear) log-risk function.

For more information, see 4 as well as the accompanying README.

Parameters

partition (string) – Partition of the data to load.
Possible values are:
- complete - The whole dataset (default)
- training or train - Training partition as used in the original DeepSurv
- testing or test - Testing partition as used in the original DeepSurv
data_type (string) – Data type of the data.
Possible values are:
- pandas or pd or dataframe or df- pandas DataFrame (default)
- numpy or np - NumPy array
Note

NumPy is supported as an option, but DeepSurvK is built with pandas in mind.

Returns

X - Features
Y - Target variable
E - Event variable

Return type

tuple of pandas DataFrames

References

4: Katzman, Jared L., et al. “DeepSurv: personalized treatment recommender system using a Cox proportional hazards deep neural network.” BMC medical research methodology 18.1 (2018): 24.

deepsurvk.datasets.load_simulated_linear(partition='complete', **kwargs)[source]¶

Synthetic data with a linear log-risk function.

For more information, see 5 as well as the accompanying README.

Parameters

partition (string) – Partition of the data to load.
Possible values are:
- complete - The whole dataset (default)
- training or train - Training partition as used in the original DeepSurv
- testing or test - Testing partition as used in the original DeepSurv
data_type (string) – Data type of the data.
Possible values are:
- pandas or pd or dataframe or df- pandas DataFrame (default)
- numpy or np - NumPy array
Note

NumPy is supported as an option, but DeepSurvK is built with pandas in mind.

Returns

X - Features
Y - Target variable
E - Event variable

Return type

tuple of pandas DataFrames

References

5: Katzman, Jared L., et al. “DeepSurv: personalized treatment recommender system using a Cox proportional hazards deep neural network.” BMC medical research methodology 18.1 (2018): 24.

deepsurvk.datasets.load_simulated_treatment(partition='complete', **kwargs)[source]¶

Synthetic data similar to the simulated_gaussian one, with an additional column representing treatment.

For more information, see 6 as well as the accompanying README.

Parameters

partition (string) – Partition of the data to load.
Possible values are:
- complete - The whole dataset (default)
- training or train - Training partition as used in the original DeepSurv
- testing or test - Testing partition as used in the original DeepSurv
data_type (string) – Data type of the data.
Possible values are:
- pandas or pd or dataframe or df- pandas DataFrame (default)
- numpy or np - NumPy array
Note

NumPy is supported as an option, but DeepSurvK is built with pandas in mind.

Returns

X - Features
Y - Target variable
E - Event variable

Return type

tuple of pandas DataFrames

References

6: Katzman, Jared L., et al. “DeepSurv: personalized treatment recommender system using a Cox proportional hazards deep neural network.” BMC medical research methodology 18.1 (2018): 24.

deepsurvk.datasets.load_support(partition='complete', **kwargs)[source]¶

Data from the Study to Understand Prognoses Preferences Outcomes and Risks of Treatment (SUPPORT), which studied the survival time of seriously ill hospitalized adults.

Originally, it consists of 14 clinical features of 9105 patients. However, patients with missing features were dropped, leaving a total of 8873 patients.

For more information, see 7 as well as the accompanying README.

Parameters

partition (string) – Partition of the data to load.
Possible values are:
- complete - The whole dataset (default)
- training or train - Training partition as used in the original DeepSurv
- testing or test - Testing partition as used in the original DeepSurv
data_type (string) – Data type of the data.
Possible values are:
- pandas or pd or dataframe or df- pandas DataFrame (default)
- numpy or np - NumPy array
Note

NumPy is supported as an option, but DeepSurvK is built with pandas in mind.

Returns

X - Features
Y - Target variable
E - Event variable

Return type

tuple of pandas DataFrames

References

7: Knaus, William A., et al. “The SUPPORT prognostic model: Objective estimates of survival for seriously ill hospitalized adults.” Annals of internal medicine 122.3 (1995): 191-203.

deepsurvk.datasets.load_whas(partition='complete', **kwargs)[source]¶

Data from the Worcester Heart Attack Study (WHAS), which investigates the effects of a patient’s factors on acute myocardial infraction (MI) survival.

It consists of 1638 observations and 5 features: age, sex, body-mass-index (BMI), left heart failure complications (CHF), and order of MI (MIORD).

For more information, see 8 as well as the accompanying README.

Parameters

partition (string) – Partition of the data to load.
Possible values are:
- complete - The whole dataset (default)
- training or train - Training partition as used in the original DeepSurv
- testing or test - Testing partition as used in the original DeepSurv
data_type (string) – Data type of the data.
Possible values are:
- pandas or pd or dataframe or df- pandas DataFrame (default)
- numpy or np - NumPy array
Note

NumPy is supported as an option, but DeepSurvK is built with pandas in mind.

Returns

X - Features
Y - Target variable
E - Event variable

Return type

tuple of pandas DataFrames

References

8: Hosmer Jr, David W., Stanley Lemeshow, and Susanne May. Applied survival analysis: regression modeling of time-to-event data. Vol. 618. John Wiley & Sons, 2011.