Expert reviewed • 22 November 2024 • 18 minute read
Note:
Video coming soon!
A sample is a subset of a larger group that is used to represent and analyse the characteristics of that population. When displaying data, samples are used to make assumptions about the general population, without having to collect data from every individual. The sample should ideally be representative of the population to ensure accurate generalisations.
Grouping data involves organising raw numeric data, discrete or continuous, into classes or intervals to make it easier to analyse and interpret. This is particularly useful when dealing with large datasets. It is an essential step that enhances the clarity, efficiency and effectiveness of mathematical calculations and statistical analysis.
Class intervals divide a dataset into non-overlapping groups or ranges. Each interval contains a subset of the data values, allowing us to count how many data points fall within each range. As such, it can be highly useful when we are required to create frequency distributions and histograms.
The following points, are components of class intervals:
The following example presents a dataset and a corresponding frequency table that refers to its class intervals and their components. The point of creating a table like this is to help us see the bigger picture and give us a better understanding of the information, to assist us in answering a given question.
The dataset below explores the ages of 30 people that are employed to work for a shoe store.
Interval | Class Centre | Frequency | Cumulative Frequency |
---|---|---|---|
10-14 | 12 | 1 | 1 |
15-19 | 17 | 5 | 6 |
20-24 | 22 | 6 | 12 |
25-29 | 27 | 7 | 19 |
30-34 | 32 | 4 | 23 |
35-39 | 37 | 5 | 28 |
40-44 | 42 | 2 | 30 |
45-49 | 47 | 1 | 31 |
50-54 | 52 | 1 | 32 |
A frequency histogram is a type of bar graph that represents the frequency distribution of a dataset. It displays the frequency of each class interval or range of values occurring in each dataset. An example of a frequency histogram is displayed below:
where,
In the graph each bar represents a seperate class interval. They must be shown to be touching to indicate that the data is continuous.
A frequency polygon is a line graph that represents the frequencies of the data. It is often used in conjunction with histograms, where the graph includes bars and lines. The bars represent the class intervals. The line (a feature of the polygon) travels to the midpoint of each bar. This means that the line travels to the class centre of each group. An example of a frequency polygon is displayed below:
A cumulative histogram is a histogram that shows the cumulative frequency distribution of a dataset. Instead of displaying the frequency of each class interval, it displays the cumulative frequency up to and including each class interval. Thus, the only graphical difference from a frequency histogram is seen on the y-axis. In this case, it measures the cumulative frequency rather than the frequency of each class. These graphs still use bars to represent each class.
A cumulative frequency polygon is an expansion on the graphical representation of a cumulative frequency distribution. It is similar to a frequency polygon but uses cumulative frequencies instead of individual class frequencies. This graph still uses bars and lines to represent the class intervals. The following graph shows the aspects of a cumulative frequency polygon and cumulative histogram.
where,
As seen in the graph, the bars continuously get larger in value, indicating that cumulative frequency of each class being added together.
A Pareto chart is a type of graph that combines both a bar graph and a line graph. This is seen as a frequency histogram arranged in descending order, combined with a cumulative frequency polygon. An example of a Pareto chart is displayed below.
The mean of a sample provides an average value of the data points in the sample, indicating where the centre of the data distribution lies. This is different to calculating the mean of a distribution, which will be explored in the following module.
To calculate the mean of a sample, we first need the values of the sample. For example, the frequency table below provides information to calculate the mean of a sample.
Interval | Class Centre (x) | Frequency (f) | Sum (Σfx) |
---|---|---|---|
10-14 | 12 | 1 | 12 |
15-19 | 17 | 5 | 85 |
20-24 | 22 | 6 | 132 |
25-29 | 27 | 7 | 189 |
30-34 | 32 | 4 | 128 |
35-39 | 37 | 5 | 185 |
40-44 | 42 | 2 | 84 |
45-49 | 47 | 1 | 47 |
50-54 | 52 | 1 | 52 |
Total | 32 | 914 |
The formula used to calculate the mean of this sample is thus show below:
Where,
For the information provided in the table above, calculate the mean of the provided sample.
Looking at the formula above, we can see that it is easy to calculate the mean. We must first however, determine all the variables in the formula using the table above.
Finding the sum of products of class centres and frequencies:
As we can see from the table, the total number of frequencies is:
Now, we can calculate the mean using the given formula
The mean of the sample is approximately
The variance of a sample, represents the average of the squared differences between each data point and the sample mean. Variance is essential for understanding how much the data points differ from the mean and from each other.
The sample variance is denoted by and is calculated using the following formula:
The standard deviation of a sample quantifies how spread out the data points are around the mean. A low standard deviation indicates that the data points are close to the mean, while a high standard deviation indicates that the data points are spread over a larger range. It is found by taking the square root of the variance.
Using the frequency table provided, calculate the variance of the sample. Use the mean found in the previous example to assist your calculations.
Interval | Class Centre (x) | Frequency (f) | Sum (Σfx) | Sum (Σfx²) |
---|---|---|---|---|
10-14 | 12 | 1 | 12 | 144 |
15-19 | 17 | 5 | 85 | 1445 |
20-24 | 22 | 6 | 132 | 2904 |
25-29 | 27 | 7 | 189 | 5103 |
30-34 | 32 | 4 | 128 | 4096 |
35-39 | 37 | 5 | 185 | 6845 |
40-44 | 42 | 2 | 84 | 3528 |
45-49 | 47 | 1 | 47 | 2209 |
50-54 | 52 | 1 | 52 | 2704 |
Total | 32 | 914 | 29058 |
Seen from the previous example question, we know that the mean of this sample is approximately . Additionally from the table, we know that the sum of the class centre squared, multiplied by the frequency is . The total number of frequencies is .