Introduction to Biostatistics
-statistics are simply a collection of tools that researchers employ to help answer research questions
MEASURES OF CENTRAL TENDANCY
- A measure of central tendency is a single number used to represent the centre of a grouped data.
- The basic measures are;
- For any symmetrical distribution, the mean, median, and mode will be identical.
- Each measure is designed to represent a typical score.
- The choice of which measure to use depends on:
- the shape of the distribution (whether normal or skewed), and
- the variable’s “level of measurement” (data are nominal, ordinal or interval).
- The mean (or average) is found by adding all the numbers and then dividing by how many numbers you added together.
- Most common measure of central tendency.
- Formula for calculation of mean:
- Best for making predictions.
- Applicable under two conditions:
- scores are measured at the interval level, and
- distribution is more or less normal [symmetrical].
- 3+4+5+6+7= 25
- 25 divided by 5 = 5
- The mean is 5
- Advantages of mean
- Mathematical center of a distribution.
- Good for interval and ratio data.
- Does not ignore any information.
- Inferential statistics is based on mathematical properties of the mean.
- Disadvantages of mean
- Influenced by extreme scores and skewed distributions.
- May not exist in the data.
- When the numbers are arranged in numerical order, the middle one is the median.
- 50% of observations are above the Median, 50% are below it.
- Formula Median = n + 1 / 2.
- Arrange in order 2,3,5,6,7
- The number in the middle is 5
- The median is 5
- Not influenced by extreme scores or skewed distribution.
- Good with ordinal data.
- Easier to compute than the mean.
- Considered as the typical observation.
- May not exist in the data.
- Does not take actual values into account.
- The number that occurs most frequently is the mode.
- We usually find the mode by creating a frequency distribution in which we count how often each value occurs.
- If we find that every value occurs only once, the distribution has no mode.
- If we find that two or more values are tied as the most common, the distribution has more than one mode.
- The number that occurs most frequently is 7
- The mode is 7
- Good with nominal data.
- Bimodal distribution might verify clinical observations (pre and post-menopausal breast cancer).
- Easy to compute and understand.
- The score exists in the data set.
- Ignore most of the information in a distribution.
- Small samples may not have a mode
- More than one mode might exist.
Appropriate Measures of Central Tendency
- Nominal variables - Mode
- Ordinal variables - Median
- Interval level variables - Mean
- If the distribution is normal (median is better with skewed distribution)
MEASURES OF VARIABILITY
“If there is no variability within populations there would be no need for statistics.”
- Three indices are used to measure variation or dispersion among scores:
- variance, and
- standard deviation (Cozby, 2000).
- These indices answer the question: How Spread out is the distribution?
- Dispersion/Deviation/Spread tells us a lot about how a variable is distributed.
- Range is the simplest method of examining variation among scores
- It refers to the difference between the highest and lowest values produced.
- For continuous variables, the range is the arithmetic difference between the highest and lowest observations in the sample. In the case of counts or measurements, 1 should be added to the difference because the range is inclusive of the extreme observations.
- Another statistic, known as the interquartile range, describes the interval of scores bounded by the 25th and 75th percentile ranks; the interquartile range is bounded by the range of scores that represent the middle 50 percent of the distribution.
Percentiles (or quartiles)
- The First quartile is the 25th percentile (noted Q1),
- the Median value is the 50th percentile (noted Median), and
- the Third quartile is the 75th percentile (noted Q3).
- ‘’ A percentile is a value at or below which a given percentage or fraction of the variable values lie.”
- The p-th percentile is the value that has p% of the measurements below it and (100-p)% above it.
- Thus, the 20th percentile is the value such that one fifth of the data lie below it. It is higher than 20% of the data values and lower than 80% of the data values.’’
- E.g. if you are in the 80th percentile on a real GMAT result, you scored better on that section than 80% of the students taking the GMAT.
- The standard deviation is the most widely applied measure of variability.
- It shows how much variation there is from the "average" (mean).
- Large standard deviations suggest that scores are probably widely scattered.
- Small standards deviations suggest that there is very little deference among scores.
- Computational formula for S.D:
Example: (Adapted from Wikipedia)
- Consider a population consisting of the following values:
- There are eight data points in total, with a mean (or average) value of 5:
- To calculate the population standard deviation, first compute the difference of each data point from the mean, and square the result:
- Next divide the sum of these values by the number of values and take the square root to give the standard deviation:
- Therefore, the above has a population standard deviation of 2.
- The squire of the standard deviation is the variance.