Calculating and Graphing Grouped Data

Expert reviewed 22 November 2024 18 minute read


  • classify data relating to a single random variable
  • organise, interpret and display data into appropriate tabular and/or graphical representations including Pareto charts, cumulative frequency distribution tables or graphs, parallel box-plots and two-way tables
    • compare the suitability of different methods of data presentation in real-world contexts
  • summarise and interpret grouped and ungrouped data through appropriate graphs and summary statistics

Note:

Video coming soon!

What is a Sample?

A sample is a subset of a larger group that is used to represent and analyse the characteristics of that population. When displaying data, samples are used to make assumptions about the general population, without having to collect data from every individual. The sample should ideally be representative of the population to ensure accurate generalisations.

What is the Importance of Grouping Data?

Grouping data involves organising raw numeric data, discrete or continuous, into classes or intervals to make it easier to analyse and interpret. This is particularly useful when dealing with large datasets. It is an essential step that enhances the clarity, efficiency and effectiveness of mathematical calculations and statistical analysis.

What are Class Intervals and Their Components?

Class intervals divide a dataset into non-overlapping groups or ranges. Each interval contains a subset of the data values, allowing us to count how many data points fall within each range. As such, it can be highly useful when we are required to create frequency distributions and histograms.

The following points, are components of class intervals:

  • Lower Class Limit: The smallest value that can belong to a class interval.
  • Upper Class Limit: The largest value that can belong to a class interval.
  • Class Width: The difference between the upper limit of one class and the lower limit of the next class. The class width could also be the difference between the upper and lower boundaries within the same class.
Class  Width  =  Upper  Limit    Lower  LimitClass \;Width\;=\;Upper\;Limit\;-\;Lower\;Limit
  • Class Midpoint: The average of the upper and lower limits of a class.
Class  Midpoint  =  Lower  Limit  +  Upper  Limit2Class \;Midpoint\;=\;\frac{Lower\;Limit\;+\;Upper\; Limit}{2}
  • Class Boundaries: The points that separate the classes without gaps. For continuous data, class boundaries can be slightly adjusted to prevent gaps between intervals.

The following example presents a dataset and a corresponding frequency table that refers to its class intervals and their components. The point of creating a table like this is to help us see the bigger picture and give us a better understanding of the information, to assist us in answering a given question.

The dataset below explores the ages of 30 people that are employed to work for a shoe store.

12,15,16,17,18,19,20,21,22,23,23,24,25,25,26,27,27,28,29,30,32,33,35,36,37,38,40,42,45,5012, 15, 16, 17, 18, 19, 20, 21,\\ 22, 23, 23, 24, 25, 25, 26, 27,\\ 27, 28, 29, 30, 32, 33, 35, 36,\\ 37, 38, 40, 42, 45, 50
IntervalClass CentreFrequencyCumulative Frequency
10-141211
15-191756
20-2422612
25-2927719
30-3432423
35-3937528
40-4442230
45-4947131
50-5452132

Histograms and Polygons

A frequency histogram is a type of bar graph that represents the frequency distribution of a dataset. It displays the frequency of each class interval or range of values occurring in each dataset. An example of a frequency histogram is displayed below:

placeholder

where,

  • ff (the y-axis) is the frequency of each class interval
  • xx (the x-axis) is the class interval

In the graph each bar represents a seperate class interval. They must be shown to be touching to indicate that the data is continuous.

A frequency polygon is a line graph that represents the frequencies of the data. It is often used in conjunction with histograms, where the graph includes bars and lines. The bars represent the class intervals. The line (a feature of the polygon) travels to the midpoint of each bar. This means that the line travels to the class centre of each group. An example of a frequency polygon is displayed below:

placeholder

A cumulative histogram is a histogram that shows the cumulative frequency distribution of a dataset. Instead of displaying the frequency of each class interval, it displays the cumulative frequency up to and including each class interval. Thus, the only graphical difference from a frequency histogram is seen on the y-axis. In this case, it measures the cumulative frequency rather than the frequency of each class. These graphs still use bars to represent each class.

A cumulative frequency polygon is an expansion on the graphical representation of a cumulative frequency distribution. It is similar to a frequency polygon but uses cumulative frequencies instead of individual class frequencies. This graph still uses bars and lines to represent the class intervals. The following graph shows the aspects of a cumulative frequency polygon and cumulative histogram.

placeholder

where,

  • cfcf (the y-axis) is the cumulative frequency of all class intervals
  • xx (the x-axis) is the class interval

As seen in the graph, the bars continuously get larger in value, indicating that cumulative frequency of each class being added together.

What is a Pareto Chart?

A Pareto chart is a type of graph that combines both a bar graph and a line graph. This is seen as a frequency histogram arranged in descending order, combined with a cumulative frequency polygon. An example of a Pareto chart is displayed below.

placeholder

How to Calculate the Mean of a Sample

The mean of a sample provides an average value of the data points in the sample, indicating where the centre of the data distribution lies. This is different to calculating the mean of a distribution, which will be explored in the following module.

To calculate the mean of a sample, we first need the values of the sample. For example, the frequency table below provides information to calculate the mean of a sample.

IntervalClass Centre (x)Frequency (f)Sum (Σfx)
10-1412112
15-1917585
20-24226132
25-29277189
30-34324128
35-39375185
40-4442284
45-4947147
50-5452152
Total32914

The formula used to calculate the mean of this sample is thus show below:

x=xfn\overline{x}=\frac{\sum xf}{n}

Where,

  • x\overline{x} is the sample mean,
  • xx represents each individual data point,
  • nn is the total number of observations in the sample (this is the total frequencies of the sample),
  • xf\sum xf is the sum of all the data points.

Practice Question 1

For the information provided in the table above, calculate the mean of the provided sample.

Looking at the formula above, we can see that it is easy to calculate the mean. We must first however, determine all the variables in the formula using the table above.

Finding the sum of products of class centres and frequencies:

xf=12+85+132+189+128+185+84+47+52=914\sum xf = 12+85+132+189+128+185+84+47+52\\=914

As we can see from the table, the total number of frequencies is:

n=32n=32

Now, we can calculate the mean using the given formula

x=91432=28.5625\overline{x}=\frac{914}{32}\\=28.5625

\therefore The mean of the sample is approximately 28.5628.56

How to Calculate the Variance and Standard Deviation of a Sample

The variance of a sample, represents the average of the squared differences between each data point and the sample mean. Variance is essential for understanding how much the data points differ from the mean and from each other.

The sample variance is denoted by s2s^2 and is calculated using the following formula:

s2=x2fnx2s^2=\frac{\sum x^2f}{n}-\overline{x}^2

The standard deviation of a sample quantifies how spread out the data points are around the mean. A low standard deviation indicates that the data points are close to the mean, while a high standard deviation indicates that the data points are spread over a larger range. It is found by taking the square root of the variance.

Practice Question 2

Using the frequency table provided, calculate the variance of the sample. Use the mean found in the previous example to assist your calculations.

IntervalClass Centre (x)Frequency (f)Sum (Σfx)Sum (Σfx²)
10-1412112144
15-19175851445
20-242261322904
25-292771895103
30-343241284096
35-393751856845
40-44422843528
45-49471472209
50-54521522704
Total3291429058

Seen from the previous example question, we know that the mean of this sample is approximately 28.5628.56. Additionally from the table, we know that the sum of the class centre squared, multiplied by the frequency is 2905829058. The total number of frequencies is 3232.

s2=290583228.562=908.0625815.673692.20s^2=\frac{29058}{32}-28.56^2\\=908.0625-815.6736\\\approx92.20

Return to Module 9: Displaying and Interpreting Data