Descriptive statistics
[MG1:Chp2, p7-p18]
Summaries of sample data (statistics) are defined by Roman letters (sample mean)
Summaries of population data (parameters) are defined by Greek letters (mu, variance)
Central tendency = The extent that observations cluster
Degreee of dispersion = The spread of the observations about a central location
Measures of central tendency
- Mode = The most common value
- Median = The middle value
- (Arithmetic) Mean = The average value
Degree of dispersion
- Range = Difference between the maximum and minimum value
- Percentile = Rank observations into 100 equal parts
* Mean = 50th percentile
* Interquartile range = 25th to 75th percentile
- Sample Variance = Sum of squares divided by degree of freedom
* Sum of squares = sum of the square of each differences (between each observation and the mean)
* Degree of freedom = number of observation minus 1
- Population variance = Sum of squares divided by number of observation
- Standard deviation = Square root of variance
- Coefficient of variation (CV) = SD / mean x 100%
NB:
- Degree of freedom is used when calculating the variance of a sample
* Because each observation is free to vary except for the last one which must be a defined value in order for the mean match the fixed sample mean value
Sources of variability
- Biological variability
- Measurement imprecision
--> Resulting in random error
- Mistakes or biases in measurement
--> Systemic error
Standard error (SE)
[MG1:p9]
- Standard error (SE)
= aka standard error of the mean
- SE = SD / square root of n
- SE is NOT meant to be used to describe variability of sample data
- SE is a measure of precision (of how well sample data can be used to predict population mean (a population parameter))
* Used to calculate confidence interval
* Often derived from one sample
* Reliability of sample mean in predicting population mean [Chris Flynn]
- SE is the standard deviation of the sample means
- Increasing sample size can be a way of reducing SE
* But need to increase sample 4 times to reduce SE by half
Confidence interval
- Derived from SE
- 95% confidence interval of the mean = sample mean +/- (1.96 x SE)
- 99% confidence interval = sample mean +/- (2.58 x SE)
- Definition of 95% CI
= The range within which there is 95% probability the true population mean may lie
NB:
- In a normal distribution, 95% of the observations lie within 1.96 standard deviation of the mean
Frequency distributions
- Kurtosis describes how peaked the distribution is
* Kurtosis of a normal distribution = 0
- Median is a better measurement of central tendency in a skewed distribution
* Skew to the right, median will be smaller than the mean
- Bimodal distribution = Distribution with two peaks
--> Suggests that the sample is not homogeneous and may represent two different populations
Normal distribution
- Sometimes referred to as a Gaussian distribution
- Two parameters define the curve, mu (the mean), and sigma (the standard deviation)
- Mode = median = mean
- Formula at [MG1:p13]
NB:
- Mean +/- 1 SD includes 68% of total area
- Mean +/- 1.96 SD includes 95% of total area
- Mean +/- 2 SD includes 95.4% of total area
- Mean +/- 3 SD includes 99.7% of total area
Z distribution
In a STANDARD normal distribution
- Mean = 0
- Standard deviation = 1
- aka the z distribution
- A z transformation converts any normal distribution curve (with different mean and SD) to a standard normal distribution curve (mean = 0, SD = 1)
* z = (x - mu)/SD
Central limit theorem
[MG1:p14]
- As the number of observations increase (n>100)
--> The shape of a sampling distribution will approximate a normal distribution curve
* Even if the distribution of the variable is not normal
Binomial distribution
[MG1:p14-p15]
Formula at [MG1:p15]
A binomial distribution exists if a population contains items which belong to one of two mutually exclusive categories
* e.g. gender, complication
Conditions include:
- Fixed number of observations (trials)
- Only two outcomes are possible
- Trials are independent
- Constant probability for occurrence of each event
Poisson distribution
- A binomial distribution approximates Poisson distribution when
* The number of observation is very large, AND
* Probability of an event is small (<0.05)
- A single parameter (lamda) which is both mean and the variance
Conditions:
- Events occur randomly
- Events occur independently
- Events occur uniformly (same probability) and singly
Example used in [MG1:p15] is for calculation of probability of more than one admission on late night admission
Incidence and prevalence
- Incidence = the number of individuals who develop a condition (i.e. new cases) in a given time period
--> An estimation of probability of developing a disease in a specified time period
- Prevalence = the number of individuals with a condition at a point of time (i.e. total cases, pre-existing and new)
Presentation of data
[MG1:p17]
- For a normal distribution, mean and standard deviation are the best statistics to describe data
* But mean can be affected by extreme values
- A bimodal distribution is best described with mode
- Ordinal data should be described with mode or median
Box and whisker plot
- Used to depict mean, interquartile range and range
- Middle line = median
- Box = 25th to 75th percentiles
- Whiskers = minimum and maximum, or 5th and 95th percentiles