Libraries like lifelines provide a plethora of example datasets that one can work with. However, for many tasks you need to simulate specific behaviour in survival curves.

In this post, we demonstrate a simple algorithm to generate survival data in a format comparable to the one used in the lifelines example datasets like `load_leukemia()`

.

The generation algorithm is based on the following assumptions:

- There is a strict
**survival plateau**with a given survival probability starting at a given point in time - The
**progression**from 100% survival, t=0 to the survival plateau is**approximately linear**(i.e. if you would generate an infinite number of datapoints, the survival curve would be linear) **No censoring events**shall be generated except for censoring all surviving participants at the end point of the timeline.

Code:

import numpy as np import random from lifelines import KaplanMeierFitter def simulate_survival_data_linear(N, survival_plateau, t_plateau, t_end): """ Generate random simulated survival data using a linear model Keyword parameters ------------------ N : integer Number of entries to generate survival_plateau : float The survival probability of the survival plateau t_plateau : float The time point where the survival plateau starts t_end : float The time point where all surviving participants will be censored. Returns ------- A dict with "Time" and "Event" numpy arrays: 0 is censored, 1 is event """ data = {"Time": np.zeros(N), "Event": np.zeros(N)} for i in range(N): r = random.random() if r <= survival_plateau: # Event is censoring at the end of the time period data["Time"][i] = t_end data["Event"][i] = 0 else: # Event occurs # Normalize where we are between 100% and the survival plateau p = (r - survival_plateau) / (1 - survival_plateau) # Linear model: Time of event linearly depends on uniformly & randomly chosen position # in range (0...tplateau) t = p * t_plateau data["Time"][i] = t data["Event"][i] = 1 return data # Example usage data1 = simulate_survival_data_linear(250, 0.2, 18, 24) data2 = simulate_survival_data_linear(250, 0.4, 17.2, 24)

Given `data1`

and `data2`

(see the usage example at the end of the code) you can plot them using

# Plot bad subgroup kmf1 = KaplanMeierFitter() kmf1.fit(data1["Time"], event_observed=data1["Event"], label="Bad subgroup") ax = kmf1.plot() # Plot good subgroup kmf2 = KaplanMeierFitter() kmf2.fit(data2["Time"], event_observed=data2["Event"], label="Good subgroup") ax = kmf2.plot(ax=ax) # Set Y axis to fixed scale ax.set_ylim([0.0, 1.0])

#### Do not want a survival plateau?

Just set `t_end = t_survival`

:

# Example usage data1 = simulate_survival_data_linear(250, 0.2, 24, 24) data2 = simulate_survival_data_linear(250, 0.4, 24, 24) # Code to plot: See above

#### What happens if you have a low number of participants?

Let’s use `25`

instead of `250`

as above:

# Example usage data1 = simulate_survival_data_linear(25, 0.2, 24, 24) data2 = simulate_survival_data_linear(25, 0.4, 24, 24) # Plot code: See above

Although we generated the data with the same data, the difference is much less clear in this example, especially towards the end of the timeline (note however that the data is generated randomly, so you might see a different result). You can see a large portion of the confidence intervals overlappings near `t=24`

. In other words, based on this data it is not clear that the two groups of patients are significantly different (in other words, P \geq 0.05)