Method description: survival models

The object of primary interest in Survival Analysis is the survival function which is defined as where t denotes time. That is: the survival function is the probability that death occur after a specified moment in time. The survival function is also called the survivor function or survivorship function in problems of biological survival, and the reliability function in mechanical survival problems (usually the reliability function is denoted as R(t)).

The hazard function is defined as the event rate at time t conditional on survival until time t or later: The force of mortality is a synonym for the hazard function which is used particularly in demographics. The term hazard rate is also used. The hazard function can alternatively be represented in terms of the cumulative hazard function:

Other functions are defined in terms of the survival and hazard functions. The lifetime distribution function is defined as the complement of the survival function:

and the derivative of F, the probability density function of the lifetime distribution, is sometimes called the event density; it is the rate of death or failure events per a unit of time.

Censored observations

Observation are called censored when the dependent variable represents the time to a terminal event and the duration of the study is limited in time.

Type I censoring describes the situation when a process is terminated at a particular point in time and the remaining individuals are known to have survived up to that moment. In this case the censoring time is fixed and the number of events is a random variable.

Type II censoring concerns cases when the experiment would be continued until a fixed proportion of events has occurred, so the number of events is fixed and the survival time is the random variable.

There are situations in which censoring can occur at different times (multiple censoring), or only at a particular point in time (single censoring).

A distinction can also be made to reflect the "side" of the timescale at which censoring occurs. In most cases the researcher knows when exactly the experiment started, and the censoring always occurs on the right side of the timescale. Sometimes the researcher does not know the starting time (for example, when exactly the symptoms of a disease first occurred) and in this case the censoring occurs on the left side (left censoring).

The censoring mechanism in the Survival Analysis module is assumed to be type I single censoring on the right side of the timescale.

Nonparametric models

The two nonparametric estimators: Kaplan-Meier and Life-Table are the simplest cases of survival analysis. The distribution of the survival time is estimated for the population as a whole, solely on the basis of the distribution of the dependent variable and the values of the censor attribute. The outcome of the non-parametric survival analysis is an empirical (baseline) survivorship function and an empirical (baseline) hazard function.

Let: be the k-th ordered time point (event time) in the data, be the number of alive observations, for which the event time or censored time is at least , be the number of deaths (uncensored events) in the interval , be the number of censored events in the interval , be the number of observations that survived in the interval and

be the length of the k-th time interval.

The Kaplan-Meier model

In the Kaplan-Meier model we estimate the conditional probability of an event (death) occurrence in the interval by

and let the probability of a non-event be equal to

. Then the non-parametric survivorship function estimator can be expressed as

and its standard error is

. The non-parametric hazard function estimator can be expressed as

and its standard error is

. The survival probability density function can be expressed as

and its standard error is

The Life-Table model

The Life-Table model takes into account the censoring mechanism. Let

be the estimator of the number of observations alive in the interval

. The conditional probability of an event (death) occurrence in the interval

can be estimated by

and the opposite probability is

. Then the non-parametric survivorship function estimator can be expressed as

and its standard error is

. The non-parametric hazard function estimator can be expressed as

and its standard error is

. The survival probability density function can be expressed as

and its standard error is

Confidence levels for the estimated functions

100(1-Confidence Level)% confidence interval for all of the above estimators is calculated for Confidence Level= specified in the non-parametric survival algorithm settings. The upper and lower bounds are calculated as:

where is the percentile of the standard normal distribution, and is the standard error of the given estimator.

The Cox model

The Cox model or the proportional hazard model is the most general regression model used for survival time modeling. The Cox model tries to predict the distribution of the survival time for individuals in a given population, investigate the strength of influence of particular variables on the expected survival time, and compare survival time distributions among different subpopulations. The model is not based on any assumptions about the underlying survival distribution. It assumes that hazard is a function of the independent variables (covariates); no assumptions are made about the nature or shape of the hazard function.

Semi-parametric survival (survivorship) function

In the Cox model the survivorship function for an individual

can be expressed as a combination of a non-parametric baseline survivorship function and a parametric part (in the form analogous to that found in regression models):

where is the baseline survivorship function.

Semi-parametric hazard function

The Cox model provides an estimate of the treatment effect on survival after adjustment to other explanatory variables. It allows us to estimate the hazard (or risk) of death or other event of interest for individuals, given their prognostic variables. The Cox model is a semi-parametric model and incorporates the effect of covariates, uses proportional hazards assumptions and does not require a specification of the parametric survival distribution.

The survival (hazard) function of survival time for each individual is assumed to have the form

where is a vector of unknown parameters, is the baseline hazard function and is a vector of explanatory variables characterizing the given individual. The baseline hazard function characterizes how the hazard function changes as a function of survival time.

It is possible to linearize this model by dividing both sides of the equation by and taking a logarithm of both sides.

Assumptions

The Cox model has no implicit assumptions about the shape of the underlying hazard function, but the model equations do imply two assumptions:

  1. Proportionality assumption: there is a multiplicative relationship between the underlying hazard function and the log-linear function of the covariates.

  2. There is a log-linear relationship between the independent variables and the underlying hazard function.

The proportionality assumption means that for given two observations with different values of the independent variables the ratio of the hazard functions for those two observations does not depend on time.

Partial likelihood function

To estimate a vector a partial likelihood function that depends only on the parameters of interest has been introduced by Cox 1972,1975 (see also Cox and Oakes 1984). In the present settings the exact partial likelihood function is:

In the present settings the Breslow approximation of the log partial likelihood function is:

where is the group of dead subjects at the time and is a group of all subjects in the risk set at the time , is number of events (deaths) in the i-th time interval, N is the number of observations.

In order to find a solution of the maximization problem for the partial likelihood the module applies the widely used Newton-Raphson algorithm, which tries to zero the first order partial derivatives of the log-likelihood function:

where

Partial likelihood allows to employ time-dependent variables in the survival model (i.e. variables whose value for any given individual can change over time). On one hand, it allows to include and describe more complex factors (such as blood pressure), on the other hand, time-dependent variables can be used to test the validity of the proportional hazards assumption. If hazards among different groups (subpopulations) are proportional then the plots of the log hazard function estimates for each group versus log(time) should be parallel. Automatic or semi-automatic variable selection methods, helping to choose the optimal subset of available variables (the one which describes the target variable most accurately), are analogous to those known from regression models. Among the implemented heuristics are: Forward Selection, Backward Elimination and Stepwise.

The Berndt, Hall, Hall, and Hausman estimator (so-called BHHH estimator: for details see Berndt et al. 1974) is used for the estimation of the model parameters:

The Information matrix is approximated by the outer product of the gradient, calculated as:

where

and the BHHH estimators uses the inverse matrix of

Model fit

The Survival model quality measures are, to some extent, similar to those of non-linear regression and are based on residual error analysis (for details see the descriptions of Lift and ROC). Among the most frequently used residuals are: Schoenfeld, score, and martingale. The residual is defined as the difference between the actual and predicted survival time. The mean value of the residual is also assessed to check if the model fit is not "skewed". The mean values of the Schoenfeld residuals and the Scaled Schoenfeld residuals are calculated. Under the assumption of proportional hazards model the Schoenfeld residuals have the sample path of a random walk.

Model significance

For testing the global significance of the estimated parameters (null hypothesis that ) three statistics are calculated:

All three statistics have an asymptotic chi-squared distribution with the number of degrees of freedom equal to the dimension of the vector .

For testing a linear hypothesis about the estimated parameters (a null hypothesis that , where is the matrix of linear coefficients for the null hypothesis), the Wald statistic for parameter estimator is calculated. Under the null hypothesis the Wald statistic has an asymptotic chi-squared distribution with the number of degrees of freedom equal to the rank of .

Confidence limits for the estimated parameters

100(1-Confidence Level)% confidence interval for the parameters estimates is calculated for Confidence Level= specified in the current algorithm settings. The upper and lower bounds are calculated as:

where is the percentile of the standard normal distribution, and is the i-th diagonal element of the estimated covariance matrix.