Descriptive Statistics

CSC 223 - Advanced Scientific Programming

Overview

  • Statistics is the science of collecting, organizing, analyzing, and interpreting data in order to make decisions.

  • Data consists of information coming from observations, counts, measurements, or responses.

  • A population is the collection of all outcomes, responses, measurements, or counts that are of interest.

  • A sample is a subset of the population.

  • A parameter is a numerical description of a population characteristic.

  • A statistic is a numerical description of a sample characteristic.

Branches of Statistics

  • Descriptive statistics is the branch of statistics that involves the organization, summarization, and display of data.

  • Inferential statistics is the branch of statistics that involves using a sample to draw conclusions about a population. A basic tool in the study of inferential statistics is probability.

Data Classification

  • Types of data:

    • Qualitative data consist of attributes, labels, or nonnumerical entries.

    • Quantitative data consist of numerical measurements or counts.

  • Levels of measurement:

    • Nominal: categorized using names, labels, or qualities.

    • Ordinal: can be arranged in order or ranked.

    • Interval: can be ordered and meaningful differences between entries can be calculated.

    • Ratio: similar to interval, but there is a zero entry that is an inherent zero (implies none).

Measures of Central Tendency

  • The mean of a data set is the sum of the data entries divided by the number of entries.

    • Population mean: \[\mu = \frac{\sum x}{N}\]

    • Sample mean: \[\bar{x} = \frac{\sum x}{n}\]

  • The median of a data set is the value that lies in the middle of the data when the data is in sorted order.

  • The mode of a data set is the data entry that occurs with the greatest frequency.

Measures of Central Tendency

  • An outlier is a data entry that is far removed from the other entries in the data set.

  • A weighted mean is the mean of a data set whose entries have varying weights. A weighted mean is given by: \[\bar{x} = \frac{\sum x \cdot w}{\sum w}\] where \(w\) is the weight of each entry \(x\).

Measures of Variation

  • The range of a data set is the difference between the maximum and minimum data entries in the set.

  • The deviation of an entry \(x\) in a population data set is the difference between the entry and the mean \(\mu\) of the data set. \[\text{Deviation of x} = x - \mu\]

  • The population variance of a population data set of \(N\) entries is \[\text{Population variance} = \sigma^2 = \frac{\sum (x - \mu)^2}{N}\] where the symbol \(\sigma\) is a lowercase Greek letter Sigma.

Measures of Variation

  • The population standard deviation of a population data set of \(N\) entries is the square root of the population variance \[\sigma = \sqrt{\sigma^2} = \sqrt{\frac{\sum (x - \mu)^2}{N}}\]

Finding Population Variance and Standard Deviation

1. Find the mean of the population data set. \(\mu = \frac{\sum x}{N}\)
2. Find the devation of each entry. \(x - \mu\)
3. Square each deviation. \((x - \mu)^2\)
4. Add to get the sum of squares \(SS_x = \sum (x - \mu)^2\)
5. Divide by \(N\) to get the population variance. \(\sigma^2 = \frac{\sum (x - \mu)^2}{N}\)
6. Find the square root of the variance to get
the population standard deviation. \(\sigma = \sqrt{\frac{\sum (x - \mu)^2}{N}}\)

Measures of Variation

  • The sample variance and sample standard deviation of a sample data set of \(n\) entries are \[\text{Sample variance} = s^2 = \frac{\sum (x - \bar{x})^2}{n-1}\] \[\text{Sample standard deviation} = s = \sqrt{\frac{\sum (x - \bar{x})^2}{n-1}}\]

Measures of Variation Symbols

Population Sample
Variance \(\sigma^2\) \(s^2\)
Standard deviation \(\sigma\) \(s\)
Mean \(\mu\) \(\bar{x}\)
Number of entries \(N\) \(n\)
Deviation \(x - \mu\) \(x - \bar{x}\)
Sum of squares \(\sum (x - \mu)^2\) \(\sum (x - \bar{x})^2\)