A Histogram is used to display in bar graph format measurement data
distributed by categories.
A HISTOGRAM IS USED FOR:
Making decisions about a process, product, or procedure that could
be improved after examining the variation (example: Should the school invest
in a computer-based tutoring program for low achieving students in Algebra
I after examining the grade distribution? Are more shafts being produced
out of specification that are too big rather than too small?)
Displaying easily the variation in the process (example: Which units
are causing the most difficulty for students? Is the variation in a process
due to parts that are too long or parts that are too short?)
STEPS IN CONSTRUCTING A HISTOGRAM:
Gather and tabulate data on a process, product, or procedure. This
could be time, weight, size, frequency of occurrences, test scores, GPA's,
pass/fail rates, number of days to complete a cycle, diameter of shafts
built, etc.
Calculate the range of the data by subtracting the smallest number
in the data set from the largest. Call this value R.
Decide about how many bars (or classes) you want to display in your
eventual histogram. Call this number K. This number should never be less
than four and seldom exceeds 12. With 100 numbers, K=7 generally works
well. With 1000 pieces of data, K=11 works well.
Determine the fixed width
of each class by dividing the range, R by the number of classes K. This value
should
be rounded to a "nice"
number, generally a number ending in a zero. For example 11.3 would not
be a "nice" number. 10 would be considered a "nice"
number. Call this number i, for interval width. It is important to use
"nice" numbers else the histogram created will have wierd scales
on the X axis.
Create a table of upper
and lower class limits. Add the interval width
i to the first "nice" number less than the lowest value in the
data set to determine the upper limit of the first class. This first "nice" number
becomes the lowest lower limit of the first class. The upper limit of the first
class becomes the lower limit of the second class. Adding
the internal width (i) to the lower limit of the second class determines
the upper limit for the second class. Repeat this process until the largest
upper limit exceeds the biggest piece of data. You should have appriximately
K classes or categories in total.
Sort, organize, or categorize the data in such a way that you can count
or tabulate how many pieces of data fall into each of the classes or categories
in your table above. These are the frequency counts and will be plotted
on the Y axis of the histogram.
Create the framework for
the horizontal and vertical axes of the histogram. On the horizontal axis
plot the lower
and upper limits of each class determined
above. The scale on the vertical axis should run from zero to the first "nice" number
greater than the largest frequency count determined
above.
Plot the frequency data on the histogram framework by drawing vertical
bars for each class. The height of each bar represents the number o
r frequency of values occuring between the lower and upper limits of
that class.
Interpret the histogram for skew and clustering problems:
Interpreting skew problems:
Data may be skewed to the
left or right. If the histogram shows a long tail of data on the left side
of the histogram, the data is termed left
or negatively skew. If a tail appears on the right side, the data is termed
right or positively skew. Most process data should not typically appear
skew. Data that is seriously skew either to the left or right may be an
indication that there are inconsistencies in the process or procedures,
etc. Decisions may need to be made to determine the appropriateness of
the direction of the skew.
It should be noted, however,
that some process data is, by its very nature, skew. This situation occurs
in arrival processes (for example,
people arriving at a McDonalds within a fixed unit of time) and in service
processes (for example, the time it takes to wait on a customer in a bank).
Interpreting clustering
problems:
Data may be clustered on
opposite ends of the scale or display two or more peaks indicating serious
inconsistencies in the process or procedure
or the measurement of a mixture of two or more distinct groups or processes
that behave very differently.
A discussion of histograms and exploratory data analysis at NIST