2. Statistics
 2.2. Descriptive statistics

Descriptive statistics

[MG1:Chp2, p7-p18]

 

 

Summaries of sample data (statistics) are defined by Roman letters (sample mean)

Summaries of population data (parameters) are defined by Greek letters (mu, variance)

 

 

 

Central tendency = The extent that observations cluster

Degreee of dispersion = The spread of the observations about a central location

 

Measures of central tendency

  • Mode = The most common value
  • Median = The middle value
  • (Arithmetic) Mean = The average value

 

Degree of dispersion

  • Range = Difference between the maximum and minimum value
  • Percentile = Rank observations into 100 equal parts
    * Mean = 50th percentile
    * Interquartile range = 25th to 75th percentile
  • Sample Variance = Sum of squares divided by degree of freedom
    * Sum of squares = sum of the square of each differences (between each observation and the mean)
    * Degree of freedom = number of observation minus 1
  • Population variance = Sum of squares divided by number of observation
  • Standard deviation = Square root of variance
  • Coefficient of variation (CV) = SD / mean x 100%

NB:

  • Degree of freedom is used when calculating the variance of a sample
    * Because each observation is free to vary except for the last one which must be a defined value in order for the mean match the fixed sample mean value

Sources of variability

  • Biological variability
  • Measurement imprecision
    --> Resulting in random error
  • Mistakes or biases in measurement
    --> Systemic error

Standard error (SE)

[MG1:p9]

  • Standard error (SE)
    = aka standard error of the mean
  • SE = SD / square root of n
  • SE is NOT meant to be used to describe variability of sample data
  • SE is a measure of precision (of how well sample data can be used to predict population mean (a population parameter))
    * Used to calculate confidence interval
    * Often derived from one sample
    * Reliability of sample mean in predicting population mean [Chris Flynn]
  • SE is the standard deviation of the sample means
  • Increasing sample size can be a way of reducing SE
    * But need to increase sample 4 times to reduce SE by half

Confidence interval

  • Derived from SE
  • 95% confidence interval of the mean = sample mean +/- (1.96 x SE)
  • 99% confidence interval = sample mean +/- (2.58 x SE)
  • Definition of 95% CI
    = The range within which there is 95% probability the true population mean may lie

NB:

  • In a normal distribution, 95% of the observations lie within 1.96 standard deviation of the mean

Frequency distributions

  • Kurtosis describes how peaked the distribution is
    * Kurtosis of a normal distribution = 0
  • Median is a better measurement of central tendency in a skewed distribution
    * Skew to the right, median will be smaller than the mean
  • Bimodal distribution = Distribution with two peaks
    --> Suggests that the sample is not homogeneous and may represent two different populations

Normal distribution

  • Sometimes referred to as a Gaussian distribution
  • Two parameters define the curve, mu (the mean), and sigma (the standard deviation)
  • Mode = median = mean
  • Formula at [MG1:p13]

NB:

  • Mean +/- 1 SD includes 68% of total area 
  • Mean +/- 1.96 SD includes 95% of total area
  • Mean +/- 2 SD includes 95.4% of total area
  • Mean +/- 3 SD includes 99.7% of total area

Z distribution

In a STANDARD normal distribution

  • Mean = 0
  • Standard deviation = 1
  • aka the z distribution
  • A z transformation converts any normal distribution curve (with different mean and SD) to a standard normal distribution curve (mean = 0, SD = 1)
    * z = (x - mu)/SD

Central limit theorem

[MG1:p14]

  • As the number of observations increase (n>100)
    --> The shape of a sampling distribution will approximate a normal distribution curve
    * Even if the distribution of the variable is not normal

Binomial distribution

[MG1:p14-p15]

Formula at [MG1:p15]

A binomial distribution exists if a population contains items which belong to one of two mutually exclusive categories
* e.g. gender, complication

Conditions include:

  • Fixed number of observations (trials)
  • Only two outcomes are possible
  • Trials are independent
  • Constant probability for occurrence of each event

Poisson distribution

  • A binomial distribution approximates Poisson distribution when
    * The number of observation is very large, AND
    * Probability of an event is small (<0.05)
  • A single parameter (lamda) which is both mean and the variance

Conditions:

  • Events occur randomly
  • Events occur independently
  • Events occur uniformly (same probability) and singly

Example used in [MG1:p15] is for calculation of probability of more than one admission on late night admission

Incidence and prevalence

  • Incidence = the number of individuals who develop a condition (i.e. new cases) in a given time period
    --> An estimation of probability of developing a disease in a specified time period
  • Prevalence = the number of individuals with a condition at a point of time (i.e. total cases, pre-existing and new)

Presentation of data

[MG1:p17]

  • For a normal distribution, mean and standard deviation are the best statistics to describe data
    * But mean can be affected by extreme values
  • A bimodal distribution is best described with mode
  • Ordinal data should be described with mode or median

Box and whisker plot

  • Used to depict mean, interquartile range and range
  • Middle line = median
  • Box = 25th to 75th percentiles
  • Whiskers = minimum and maximum, or 5th and 95th percentiles

 



Table of contents  | Index