Clinical Trials & Design of Experiments

Download Report

Transcript Clinical Trials & Design of Experiments

Introduction to Biostatistics for Clinical and Translational Researchers

KUMC Departments of Biostatistics & Internal Medicine University of Kansas Cancer Center FRONTIERS: The Heartland Institute of Clinical and Translational Research

Course Information

 Jo A. Wick, PhD  Office Location: 5028 Robinson  Email: [email protected]

 Lectures are recorded and posted at http://biostatistics.kumc.edu

Lectures’ under ‘Events &

Objectives

 Understand the

role of statistics

in the scientific process and how it is a

core component of evidence-based medicine

 Understand features, strengths and limitations of

descriptive, observational and experimental studies

 Distinguish between

association

and

causation

 Understand roles of

chance, bias

and

confounding

in the evaluation of research

Course Calendar

 July 5:

Introduction to Statistics

: Core Concepts  July 12:

Quality of Evidence

: Considerations for Design of Experiments and Evaluation of Literature  July 19:

Hypothesis Testing

& Application of Concepts to Common Clinical Research Questions  July 26: (Cont.)

Hypothesis Testing

& Application of Concepts to Common Clinical Research Questions

Why is there conflicting evidence?

Answer:

There is no perfect research study.

 Every study has

limitations

.

 Every study has

context

.

 Medicine (and research!) is an

art

as well as a

science.

 Unfortunately, the literature is full of

poorly designed

,

poorly executed

and

improperly interpreted

studies —it is up to you, the

consumer

, to critically evaluate its merit.

Critical Evaluation

Validity and Relevance

 Is the article from a peer-reviewed journal?

 How does the location of the study reflect the larger context of the population?

 Does the sample reflect the targeted population?

 Is the study sponsored by an organization that may influence the study design or results?

 Is the intervention feasible? available?

Critical Evaluation

Intent

Therapy

: testing the efficacy of drug treatments, surgical procedures, alternative methods of delivery, etc. (RCT)  

Diagnosis

: demonstrating whether a new diagnostic test is valid (Cross-sectional survey)

Screening

survey) : demonstrating the value of tests which can be applied to large populations and which pick up disease at a presymptomatic stage (Cross-sectional 

Prognosis

: determining what is likely to happen to someone whose disease is picked up at an early stage (Longitudinal cohort)

Critical Evaluation

Causation

: determining whether a harmful agent is related to development of illness (Cohort or case-control)

Critical Evaluation

Validity

based on intent  What is the study design? Is it appropriate and optimal for the intent?

 Are all participants who entered the trial accounted for in the conclusion?

 What protections against bias were put into place? Blinding? Controls? Randomization?

 If there were treatment groups, were the groups similar at the start of the trial?

 Were the groups treated equally (aside from the actual intervention)?

Critical Evaluation

 If statistically significant, are the results clinically meaningful?

 If negative, was the study powered prior to execution?

 Were there other factors not accounted for that could have affected the outcome?

Miser, WF

Primary Care 2006.

Experimental Design

 Statistical analysis, no matter how intricate, cannot rescue a poorly designed study.

 No matter how efficient, statistical analysis cannot be done overnight.

 A researcher should

plan

and

state

going to do,

do it

, and then

report

what they are those results.

 Be transparent!

Types of Samples

  

Random sample

: each person has equal chance of being selected.

Convenience sample

:

population sample

persons are selected The principal way to guarantee that the sample because they are convenient or readily available.

Systematic sample

: persons selected based on a pattern.

Stratified sample

: persons selected from within subgroup.

Random Sampling

 For studies, it is optimal (but not always possible) for the sample providing the data to be representative of the population under study.

 Simple random sampling provides a

representative sample

(theoretically) and protections against

selection bias

.

 A sampling scheme in which every possible sub-sample of size

n

from a population is equally likely to be selected  • Assuming the sample is representative, the summary statistics (e.g., mean) should be ‘good’ estimates of the true quantities in the population.

The larger

n

is, the better estimates will be.

Random Samples

 The

Fundamental Rule of Using Data for Inference

requires the use of random sampling or random assignment.

 Random sampling or random assignment ensures control over “

nuisance

” variables.

 We can randomly select individuals to ensure that the population is well-represented.

 Equal sampling of males and females  Equal sampling from a range of ages  Equal sampling from a range of BMI, weight, etc.

Random Samples

 Randomly assigning subjects to treatment levels to ensure that the levels differ only by the treatment administered.

 weights  ages  risk factors

Nuisance Variation

 Nuisance variation is any undesired sources of variation that affect the outcome.

 Can systematically distort results in a particular direction —referred to as

bias

.

 Can increase the variability of the outcome being measured —results in a less powerful test because of too much ‘

noise

’ in the data.

Example: Albino Rats

 It is hypothesized that exposing albino rats to microwave radiation will decrease their food consumption.

 Intervention: exposure to radiation  Levels exposure or non-exposure  Levels 0, 20000, 40000, 60000 uW  Measurable outcome: amount of food consumed  Possible nuisance variables: sex, weight, temperature, previous feeding experiences

Experimental Design

 Types of data collected in a clinical trial: 

Treatment

– the patient’s assigned treatment and actual treatment received 

Response

– measures of the patient’s response to treatment including side-effects 

Prognostic factors (covariates)

– details of the patient’s initial condition and previous history upon entry into the trial

Experimental Design

 Three basic types of outcome data: 

Qualitative

– nominal or ordinal, success/failure, CR, PR, Stable disease, Progression of disease 

Quantitative

– interval or ratio, raw score, difference, ratio, % 

Time to event

– survival or disease-free time, etc.

Experimental Design

 Formulate statistical hypotheses that are germane to the scientific hypothesis.

 Determine:  experimental conditions to be used (independent variable(s))  measurements to be recorded  extraneous conditions to be controlled (nuisance variables)

Experimental Design

 Specify the number of subjects required and the population from which they will be sampled.

 Power, Type I & II errors  Specify the procedure for assigning subjects to the experimental conditions.

 Determine the statistical analysis that will be performed.

Experimental Design

 Considerations:  Does the design permit the calculation of a

valid

estimate of treatment effect?

 Does the data-collection procedure produce

reliable

results?

 Does the design possess

sufficient power

to permit and adequate test of the hypotheses?

Experimental Design

 Considerations:  Does the design provide

maximum efficiency

within the constraints imposed by the experimental situation?

 Does the experimental procedure

conform

to accepted practices and procedures used in the research area?

• Facilitates comparison of findings with the results of other investigations

Types of Studies

 Purpose of research 1) To explore 2) 3) 4) To describe or classify To establish relationships To establish causality  Strategies for accomplishing these purposes: 1) Naturalistic observation 2) 3) 4) 5) Case study Survey Quasi-experiment Experiment

Generating Evidence

Studies Descriptive Studies Analytic Studies Populations Individuals Observational Experimental Case Reports Case Series Cross Sectional Case Control Complexity and Confidence Cohort RCT

Observation versus Experiment

 A

designed experiment

involves the investigator assigning (preferably randomly) some or all conditions to subjects.

 An

observational study

includes conditions that are observed, not assigned.

Example: Heart Study

Question

: How does serum total cholesterol vary by age, gender, education, and use of blood pressure medication? Does smoking affect any of the associations?

 Recruit

n

= 3000 subjects over two years  Take blood samples and have subjects answer a CVD risk factor survey 

Outcome

: Serum total cholesterol 

Factors

: BP meds (observed, not assigned) 

Confounders?

Example: Diabetes

Question:

Will a new treatment help overweight people with diabetes lose weight?

N = 40

obese adults with Type II (non-insulin dependent) diabetes (20 female/20 male) 

Randomized

,

double-blind

,

placebo-controlled

study of treatment versus placebo 

Outcome

: Weight loss 

Factor

: Treatment versus placebo

Cross-Sectional Studies

 Designed to assess the

association

between an independent variable (exposure?) and a dependent variable (disease?)  Selection of study subjects is based on both their

exposure

and

outcome

status, thus there is

no direction

of inquiry

Cross-Sectional Studies

Gather data on Exposure & Disease Exposed Diseased Exposed No Disease Not Exposed Diseased Not Exposed No Disease

Cross-Sectional Studies

 Cannot determine

causal

relationships between exposure and outcome  Cannot determine

temporal

relationship between exposure and outcome

“Exposure

is associated with

Disease” “Exposure

causes

Disease” “Disease

follows

Exposure”

Analysis of Cross-Sectional Data

+ Prevalence of

disease

compared in exposed versus non-exposed groups: a ( + |

E c

( + |

E

) = + a c Disease b d Prevalence of

exposure

compared in diseased versus non-diseased groups: a ( + |

D

+ | ) =

b

( + |

D

) =

Case-Control Studies

 Designed to assess the

association

disease and past exposures between  Selection of study subjects is based on their

disease status

 Direction of inquiry is

backward

Case-Control Studies

Exposed Unexposed Direction of Inquiry Diseased Gather data on Disease No Disease Exposed Unexposed Time

Analysis of Case-Control Data

+ -

Total

+ Disease a c a + c b d b + d

Total

a c Odds ratio: odds of case exposure .

odds of control exposure + + b d

OR

= a

c b d

=

ad bc

Cohort Studies

 Designed to assess the

association

exposures and disease occurrence between  Selection of study subjects is based on their

exposure status

 Direction of inquiry is

forward

Cohort Studies

Direction of Inquiry Gather data on Exposure Exposed Not Exposed Disease No Disease Disease No Disease Time

Cohort Studies

 Attrition or loss to follow-up  Time and money!

 Inefficient for very rare outcomes  Bias  Outcome ascertainment  Information bias  Non-response bias

Analysis of Cohort Data

+ Total + a c a + c Disease b d b + d

Total

a + b c + d Relative Risk: risk of disease in exposed .

risk of disease in unexposed

RR

= a

a b c c d

Randomized Controlled Trials

 Designed to test the

association

exposures and disease between  Selection of study subjects is based on their

assigned exposure status

 Direction of inquiry is

forward

Randomized Controlled Trials

Direction of Inquiry Randomize to Exposure Exposed (Treated) Not Exposed (Control) Disease No Disease Disease No Disease Time

Why do we randomize?

 Suppose we wish to compare

surgery

for CAD to a

drug

used to treat CAD. We know that such major heart surgery is invasive and complex — some people die

during

surgery. We may assign the patients with less severe CAD (on purpose or not) to the surgery group.

 If we see a difference in patient survival, is it due to surgery versus drugs or to less severe disease versus more severe disease?

 Such a study would be inconclusive and a waste of time, money and patients.

How could we fix it?

Randomize!

 Randomization is critical because there is

no way for a researcher to be aware of all possible confounders

.

 Observational studies have little to no formal control for any confounders —thus we cannot conclude cause and effect based on their results.

 Randomization forms the basis of

inference

.

Other Protections Against Bias

Blinding

 Single (patient only), double (patient and evaluator), and triple (patient, evaluator, statistician) blinding is possible  Eliminates biases that can arise from knowledge of treatment 

Control

 Null (no treatment), placebo (no active treatment), active (current standard of care) controls are used  Eliminates biases that can arise from the natural progression of disease (null control) or simply from the act of being treated (placebo)

Analysis of RCT Data

 What kind of outcome do you have?

 Continuous? Categorical?

 How many samples (groups) do you have?

 Are they related or independent?

Types of Tests

Parametric methods

: make assumptions about the distribution of the data (e.g., normally distributed) and are suited for sample sizes large enough to assess whether the distributional assumption is met 

Nonparametric methods

: make no assumptions about the distribution of the data and are suitable for small sample sizes or large samples where parametric assumptions are violated  Use ranks of the data values rather than actual data values themselves  Loss of power when parametric test is appropriate

Analysis of RCT Data

 Two independent percentages?

Fisher’s Exact

test, chi-square test, logistic regression  Two independent means?

Mann-Whitney

, Two sample

t

-test, analysis of variance, linear regression  Two independent time-to-event outcomes?

Log rank

test, Wilcoxon test, Cox regression  Any adjustments for other prognostic factors can be accomplished with the appropriate regression models (e.g., logistic for yes/no outcomes, linear for continuous, Cox for time-to)

Threats to Valid Inference

 Statistical Conclusion Validity • •

Low statistical power

- failing to reject a false hypothesis because of inadequate sample size, irrelevant sources of variation that are not controlled, or the use of inefficient test statistics.

Violated assumptions

- test statistics have been derived conditioned on the truth of certain assumptions. If their tenability is questionable, incorrect inferences may result.

 Many methods are based on approximations to a normal distribution or another probability distribution that becomes more accurate as sample size increases —using these methods for small sample sizes may produce unreliable results.

Threats to Valid Inference

 Statistical Conclusion Validity  Reliability of measures and treatment implementation.

 Random variation in the experimental setting and/or subjects.

• Inflation of variability may result in not rejecting a false hypothesis (loss of power).

Threats to Valid Inference

 Internal Validity  Uncontrolled events - events other than the administration of treatment that occur between the time the treatment is assigned and the time the outcome is measured.

 The passing of time - processes not related to treatment that occur simply as a function of the passage of time that may affect the outcome.

Threats to Valid Inference

 Internal Validity  Instrumentation - changes in the calibration of a measuring instrument, the use of more than one instrument, shifts in subjective criteria used by observers, etc.

• The “John Henry” effect - compensatory rivalry by subjects receiving less desirable treatments.

• The “placebo” effect - a subject behaves in a manner consistent with his or her expectations.

Threats to Valid Inference

 External Validity —Generalizability  Reactive arrangements - subjects who are aware that they are being observed may behave differently that subjects who are not aware.

 Interaction of testing and treatment - pretests may sensitize subjects to a topic and enhance the effectiveness of a treatment.

Threats to Valid Inference

 External Validity —Generalizability  Self-selection - the results may only generalize to volunteer populations.

 Interaction of setting and treatment - results obtained in a clinical setting may not generalize to the outside world.

Clinical Trials—Purpose

Prevention trials

look for more effective/safer ways to prevent a disease in individuals who have never had it, or to prevent a disease from recurring in individuals who have.

Screening trials

attempt to identify the best methods for detecting diseases or health conditions.

Diagnostic trials

are conducted to distinguish better tests or procedures for diagnosing a particular disease or condition.

Clinical Trials—Purpose

Treatment trials

assess experimental treatments, new combinations of drugs, or new approaches to surgery or radiation therapy for efficacy and safety.

Quality of life (supportive care) trials

explore means to improve comfort and quality of life for individuals with chronic illness.

Classification according to the U.S. National Institutes of Health

Clinical Trials—Phases

Pre-clinical studies

involve in vivo and in vitro testing of promising compounds to obtain preliminary efficacy, toxicity, and pharmacokinetic information to assist in making decisions about future studies in humans.

Clinical Trials—Phases

Phase 0 studies

are exploratory, first-in-human trials, that are designed to establish very early on whether the drug behaves in human subjects as was anticipated from preclinical studies.

 Typically utilizes N = 10 to 15 subjects to assess pharmacokinetics and pharmacodynamics.

 Allows the go/no-go decision usually made from animal studies to be based on preliminary human data.

Clinical Trials—Phases

Phase I studies

assess the safety, tolerability, pharmacokinetics, and pharmacodynamics of a drug in healthy volunteers (industry standard) or patients (academic/research standard).

 Involves dose-escalation studies which attempt to identify an appropriate therapeutic dose.

 Utilizes small samples, typically N = 20 to 80 subjects.

Clinical Trials—Phases

Phase II studies

assess the efficacy of the drug and continue the safety assessments from phase I.

 Larger groups are usually used, N = 20 to 300.

 Their purpose is to confirm efficacy (i.e., estimation of effect), not necessarily to compare experimental drug to placebo or active comparator.

Clinical Trials—Phases

Phase III studies

are the definitive assessment of a drug’s effectiveness and safety in comparison with the current gold standard treatment.

 Much larger sample sizes are utilized, N = 300 to 3,000, and multiple sites can be used to recruit patients.

 Because they are quite an investment, they are usually randomized, controlled studies.

Clinical Trials—Phases

Phase IV studies

are also known as post marketing surveillance trials and involve the ongoing or long-term assessment of safety in drugs that have been approved for human use.

 Detect any rare or long-term adverse effects in a much broader patient population

The Size of a Clinical Trial

 Lasagna’s Law  Once a clinical trial has started, the number of suitable patients dwindles to a tenth of what was calculated before the trial began.

The Size of a Clinical Trial

 “How many patients do we need?”  Statistical methods can be used to determine the required number of patients to meet the trial’s principal scientific objectives.

 Other considerations that must be accounted for include availability of patients and resources and the ethical need to prevent any patient from receiving inferior treatment.

 We want the minimum number of patients required to achieve our principal scientific objective.

The Size of a Clinical Trial

Estimation trials

involve the use of point and interval estimates to describe an outcome of interest.

Hypothesis testing

is typically used to detect a difference between competing treatments.

The Size of a Clinical Trial

 Type I error rate ( α): the risk of concluding a significant difference exists between treatments when the treatments are actually equally effective.

 Type II error rate ( β): the risk of concluding no significant difference exists between treatments when the treatments are actually different.

The Size of a Clinical Trial

 Power (1 – β): the probability of correctly detecting a difference between treatments —more commonly referred to as the power of the test.

Conclusion

H 1 H 0 H 1 1 – β β

Truth

H 0 α 1 – α

The Size of a Clinical Trial

 Setting three determines the fourth:  For the chosen level of significance ( α), a clinically meaningful difference ( δ) can be detected with a minimally acceptable power (1 – β) with n subjects.

 Depending on the nature of the outcome, the same applies: For the chosen level of significance ( α), an outcome can be estimated within a specified margin of error (ME) with n subjects.

Example: Detecting a Difference

 The Anturane Reinfarction Trial Research Group (1978) describe the design of a randomized double-blind trial comparing anturan and placebo in patients after a myocardial infarction.

 What is the main purpose of the trial?

 What is the principal measure of patient outcome?

 How will the data be analyzed to detect a treatment difference?

 What type of results does one anticipate with standard treatment?

 How small a treatment difference is it important to detect and with what degree of uncertainty?

Example: Detecting a Difference

 Primary objective: To see if anturan is of value in preventing mortality after a myocardial infarction.

 Primary outcome: Treatment failure is indicated by death within one year of first treatment (0/1).

 Data analysis: Comparison of percentages of patients dying within first year on anturan ( π 1 ) versus placebo ( π 2 ) using a χ 2 test at the α = 0.05 level of significance.

Example: Detecting a Difference

 Expected results under placebo: One would  expect about 10% of patients to die within a year (i.e., π 2 = .1).

Difference to detect ( δ): It is

clinically interesting

to be able to determine if anturan can halve the mortality —i.e., 5% of patients die within a year— and we would like to be 90% sure that we detect this difference as statistically significant.

Example: Detecting a Difference

 We have:  

H

0 : π 1 = π 2 α = 0.05

versus H1: π 1  π 2 (two-sided test)

n = 583 patients per group is required

 1 – β = 0.90

 δ = π 2 – π 1 = 0.05

 The estimate of power for this test is a function of sample size: 1 -      -

z

 2

SE p q n

1 1 1 +

p q n

2 2 2   +    

z

 2

SE p q n

1 1 1 +

p q n

2 2 2  

Example: Detecting a Difference

α/2

-

z α/2

1 -

α

β

α/2

1 β

z α/2

Reject

H

0 Conclude difference Fail to reject

H

0 Conclude no difference Reject

H

0 Conclude difference

Power and Sample Size

   n is roughly inversely proportional to δ 2 ; for fixed α and β, halving the difference in rates requiring detection results in a fourfold increase in sample size.

n depends on the choice of β such that an increase in power from 0.5 to 0.95 requires around 3 times the number of patients.

Reducing α from 0.05 to 0.01 results in an increase in sample size of around 40% when β is around 10%.

 Using a one-sided test reduces the required sample size.

Example: Detecting a Difference

 Primary objective: To see if treatment A increases outcome W.

 Primary outcome: The primary outcome, W, is continuous.

 Data analysis: Comparison of mean response of patients on treatment A ( μ 1 ) versus placebo ( μ 2 ) using a two-sided

t

-test at the α = 0.05 level of significance.

Example: Detecting a Difference

  Expected results under placebo: One would expect a mean response of 10 (i.e., μ 2 = 10).

Difference to detect ( δ): It is clinically interesting to be able to determine if treatment A can increase response by 10% —i.e., we would like to see a mean response of 11 (10 + 1) in patients getting treatment A and we would like to be 80% sure that we detect this difference as statistically significant.

Example: Detecting a Difference

 We have:  

H

0 : μ 1 = μ 2 α = 0.05

versus

H

1 : μ 1  μ 2 (two-sided test)  1 – β = 0.80

 δ = 1  For continuous outcomes we need to determine what difference would be clinically meaningful, but specified in the form of an

effect size

which takes into account the variability of the data.

Example: Detecting a Difference

 Effect size is the difference in the means divided by the standard deviation, usually of the control or comparison group, or the pooled standard deviation of the two groups where

d

  - 

1 2

   

1 2

n

1

+ 

2 2

n

2

Example: Detecting a Difference

α/2

-

z α/2

1 -

α

β

α/2

1 β

z α/2

Reject

H

0 Conclude difference Fail to reject

H

0 Conclude no difference Reject

H

0 Conclude difference

Example: Detecting a Difference

 Power Calculations  an interesting interactive web-based tool to show the relationship between power and the sample size, variability, and difference to detect.

 A decrease in the variability of the data results in an increase in power for a given sample size.

  An increase in the effect size results in a decrease in the required sample size to achieve a given power.

Increasing α results in an increase in the required sample size to achieve a given power.

Prognostic Factors

 It is reasonable and sometimes essential to collect information of personal characteristics and past history at baseline when enrolling patient’s onto a clinical trial.

 These variables allow us to determine how generalizable the results are.

Prognostic Factors

Prognostic factors

known to be related to the desired outcome of the clinical trial must be collected and in some cases randomization should be stratified upon these variables.

 Many

baseline characteristics

may not be known to be related to outcome, but may be associated with outcome for a given trial.

Comparable Treatment Groups

 All baseline prognostic and descriptive factors of interest should be summarized between the treatment groups to insure that they are comparable between treatments. It is generally recommended that these be descriptive comparisons only, not inferential  Note: Just because a factor is balanced does not mean it will not affect outcome and vice versa.

Subgroup Analysis

 Does response differ for differing types of patients? This is a natural question to ask.  To answer this question one should test to see if the factor that determines type of patient interacts with treatment.  Separate significance tests for different subgroups do not provide direct evidence of whether a prognostic factor affects the treatment difference: a test for interaction is much more valid.

 Tests for interactions may also be designed a priori.

Multiplicity of Data

Multiple Treatments

– the number of possible treatment comparisons increases rapidly with the number of treatments. (Newman-Keuls, Tukey’s HSD or other adjustment should be designed) 

Multiple end-points

– there may be multiple ways to evaluate how a patient responds. (Bonferroni adjustment, Multivariate test, combined score, or reduce number of primary end-points)

Multiplicity of Data

  

Repeated Measurements

– patient’s progress may be recorded at several fixed time points after the start of treatment. One should aim for a single summary measure for each patient outcome so that only one significance test is necessary.

Subgroup Analyses

– patients may be grouped into subgroups and each subgroup may be analyzed separately.

Interim Analyses

– repeated interim analyses may be performed after accumulating data while the trial is in progress.

Summary

 Statistics plays a key role in pre-clinical and clinical research  Statistics helps us determine how ‘confident’ we should be in the results of a study  Confidence in a study is based on (1) the size of the study, (2) its safeguards against biases (complexity), (3) how it was actually undertaken  Statistical support is available and should be sought out as early as possible in the process of designing a study

Next Time . . .

 Basic Descriptive and Inferential Methods  Hypothesis Testing 

P

-values  Confidence Intervals  Interpretation  Examples