Descriptive Statistics

CSC 223 - Advanced Scientific Programming

Statistics is the science of collecting, organizing, analyzing, and interpreting data in order to make decisions.
Data consists of information coming from observations, counts, measurements, or responses.
A population is the collection of all outcomes, responses, measurements, or counts that are of interest.
A sample is a subset of the population.
A parameter is a numerical description of a population characteristic.
A statistic is a numerical description of a sample characteristic.

Descriptive statistics is the branch of statistics that involves the organization, summarization, and display of data.
Inferential statistics is the branch of statistics that involves using a sample to draw conclusions about a population. A basic tool in the study of inferential statistics is probability.

Types of data:
- Qualitative data consist of attributes, labels, or nonnumerical entries.
- Quantitative data consist of numerical measurements or counts.
Levels of measurement:
- Nominal: categorized using names, labels, or qualities.
- Ordinal: can be arranged in order or ranked.
- Interval: can be ordered and meaningful differences between entries can be calculated.
- Ratio: similar to interval, but there is a zero entry that is an inherent zero (implies none).

The mean of a data set is the sum of the data entries divided by the number of entries.
- Population mean: \[\mu = \frac{\sum x}{N}\]
- Sample mean: \[\bar{x} = \frac{\sum x}{n}\]
The median of a data set is the value that lies in the middle of the data when the data is in sorted order.
The mode of a data set is the data entry that occurs with the greatest frequency.

An outlier is a data entry that is far removed from the other entries in the data set.
A weighted mean is the mean of a data set whose entries have varying weights. A weighted mean is given by: \[\bar{x} = \frac{\sum x \cdot w}{\sum w}\] where \(w\) is the weight of each entry \(x\).

The range of a data set is the difference between the maximum and minimum data entries in the set.
The deviation of an entry \(x\) in a population data set is the difference between the entry and the mean \(\mu\) of the data set. \[\text{Deviation of x} = x - \mu\]
The population variance of a population data set of \(N\) entries is \[\text{Population variance} = \sigma^2 = \frac{\sum (x - \mu)^2}{N}\] where the symbol \(\sigma\) is a lowercase Greek letter Sigma.

The population standard deviation of a population data set of \(N\) entries is the square root of the population variance \[\sigma = \sqrt{\sigma^2} = \sqrt{\frac{\sum (x - \mu)^2}{N}}\]

1. Find the mean of the population data set.	\(\mu = \frac{\sum x}{N}\)
2. Find the devation of each entry.	\(x - \mu\)
3. Square each deviation.	\((x - \mu)^2\)
4. Add to get the sum of squares	\(SS_x = \sum (x - \mu)^2\)
5. Divide by \(N\) to get the population variance.	\(\sigma^2 = \frac{\sum (x - \mu)^2}{N}\)
6. Find the square root of the variance to get
the population standard deviation.	\(\sigma = \sqrt{\frac{\sum (x - \mu)^2}{N}}\)

The sample variance and sample standard deviation of a sample data set of \(n\) entries are \[\text{Sample variance} = s^2 = \frac{\sum (x - \bar{x})^2}{n-1}\] \[\text{Sample standard deviation} = s = \sqrt{\frac{\sum (x - \bar{x})^2}{n-1}}\]