******THIS WEBSITE IS CURRENTLY UNDER CONSTRUCTION******
|
Exploratory Data Analysis (EDA) |
| Frequently Asked Questions | |
| Examples with TI-83 Plus and JMP | |
| Measures of Dispersion | |
| Standard Deviation details with examples | |
| Normal Distribution | |
| Test for Normality |

|
| What
are some of the statistical tools used by EDA?
Some of the graphical summaries used by EDA include Bar Graphs, Pie Charts, Stemplots, Histograms, Time Plots, Boxplots, Scatter Plots (static or dynamic), among others. To learn more about the graphical summaries, go to Statistical Graphs. The numerical summaries usually involve measurements of the mean, median, mode, standard deviation, variance, minimum, maximum, frequency tables, and others. Please scroll down if you would
like to read details on how to find the mean, median, mode, variance, standard
deviation, and how to form frequency tables.
|
|
| Top of Page
Suppose the goal of our study is to learn about the potential relationships between student attendance and poor academic performance at a certain school. As analysts, however, we should first study the possible patterns in the distribution of missed school days among students in the past few months. The numerical values listed below represent the number of days each student had missed. What we want to do is describe that list using EDA. Data: 0, 0, 1, 1, 2, 3, 4, 4, 4, 5, 6, 6, 6, 6, 7, 7, 8, 8, 8, 10, 11, 30 days out |
|
| Descriptive
Measures of a TI-83 Plus graphing calculator:
Some of the descriptive measures include: mean # of days missed ~ 6, standard deviation = 6.133, min = 0 days, median = 6, max = 30, and one possible outlier (30). Last histogram does not include the outlier. Was this analysis worth the effort involved? It depends on what output we intended to see. We definitely detected one possible outlier, 30 days out. This outlier can be seen on both the boxplot (isolated point) and the first histogram (isolated bar). We also saw a dramatic change in the shape of the histograms above. The first histogram included the outlier and the second did not. The distribution in both cases did not seem to be symmetric, suggesting non-normality of the data set. JMP Output
Distribution
of Y, we were able to generalize similar results as with the TI-83
Plus graphing calculator.
The new general layout is different, since we used different software, but we generalized the same graphs, as well as some new ones. First we see a histogram and a boxplot, side by side. Next, there are the quartiles with the median, maximum and minimum. In the Moments box, there are the mean and standard deviation. There is also a stem-and-leaf plot. We also generated a test for normality called Shapiro-Wilkins W Test. Here is something worth mentioning when comparing the TI-83Plus output with the JMP output: 1. 30 and 40 are both outliers; 2. medians remain the same; 3. none of the histograms indicate normality of distributions; 4. many descriptive measures differ (ex. mean, standard deviation).
The Shapiro-Wilkins W Test for normality assumes that data is normal. The null hypothesis of this test, then, states that data is normal. The alternative hypothesis obviously negates that statement and says that the data is not normal. The test provides evidence for the truth of the null hypothesis. In this case, there is enough evidence to claim that the null hypothesis is false. Our data is not normal. Why? Well, if we say that this data set is normally distributed, there is less than 0.01% chance that we will be wrong (look in the Test for Normality box under Prob<W or P-value <.0001). In practice, you must determine the maximum risk you are willing to take by rejecting the null hypothesis, called the significance level because you might be rejecting something that is in fact true. The risk factor depends on the importance of your results. If you have to be very careful about your analysis because of the consequences that your decision will have on others, especially if you are in a medical field, you want to choose a very low level of significance, 1% or so. Generally speaking, the most common levels of significance are 1%, 5% and 10%. The decision, however, is yours. Your level of significance could be 2.333% or even 40%. Everything is up to you. Links to: |
Scroll down for information on the Normal Distribution
Useful Details
Arithmetic Mean
The mean of n numbers in a sample or a population is their sum divided by n. For example, the mean of numbers 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10 is (1+2+3+4+5+6+7+8+9+10) / 10, which equals 55/10=5.5. Why find the mean? The mean is a simple and familiar measure as it represents the average value of a sample. It can be calculated for any set of numerical data. The means of several samples can be combined into the overall mean of all the data. And finally, it is a reliable measure in the sense that sample means drawn from the same population generally do not vary as widely as other statistical measures.
What is
a weighted mean? A weighted mean is calculated when we want to give measurements
being studied their proper degree of importance. It is necessary then to
assign weights to each measurement, and calculate a weighted mean. How
do we find a weighted mean? We first find the sum of the products obtained
by multiplying each measurement by its corresponding weight. Then we divide
this sum by the sum of the weights. Please note that if the weights are
all equal, this formula simplifies to that of the ordinary sample mean,
described in the first paragraph.

Median and Mode
For information
on the median and mode, go to
Statistical
Graphs.

1. Range
The range is the difference between the highest value and the lowest value. Please note that when studying dispersion the range may be misleading because it depends only on the maximum and minimum values. Go to Statistical Graphs for information on IQR.
2. Standard Deviation
The standard deviation measures spread by looking at how far the observations are from their mean. When the standard deviation is 0, then there is no spread and all values are the same.
Technically
speaking, standard deviation is the square root of variance, where variance
is the average of the squares of the deviations of the observations from
the mean of a population. As shown above, to find the average we take the
sum of all values and divide that number by the number of elements in our
set. In the case of standard
deviation and variance, however, it is best to divide by 1 less than the
number of observations. Only (n-1) of the squared deviations can vary freely
because the sum of all deviations is 0, where n is the total number of
observations in our sample. The knowledge of the first (n-1) deviations
allows us to figure out the last one. The number (n-1) is called the number
of degrees of freedom of variance or standard deviation. When given
a choice, always choose the (n-1) option for unbiased estimates; that is
divide by (n-1).
| Example 2 |
Here is an example prepared on a TI-83 Plus graphing calculator, JMP, and SAS. Consider the following sixteen observations. Suppose the INDEX values represent your friend's sixteen consecutive, bi-weekly food shopping expenses in dollars $. Let us use different software packages to do some calculations for us.
OBS 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
INDEX 48 75 69 58 60 68 59 66 71 52 49 60 54 55 70 57
| Why does the sum of the differences between each observation in the set and the mean equal 0? Try adding the following differences: (48-60.6875)+(75-60.6875)+...+(70-60.6875)+(57-60.6875, where 60.6875 is the mean of the observations. The sum will be 0. |
| TI-83 PLUS OUTPUT |
The unbiased standard deviation, Sx, equals 8.2600994834. |
| SAS OUTPUT | Variable
Mean Std Dev
Std Error
T Prob>|T|
----------------------------------------- INDEX 60.6875000 8.2600948 2.0650237 29.3882824 0.0001 ----------------------------------------- The standard deviation is 8.2600994834...Same as above. |
| JMP OUTPUT | The
Moments
part of this output shows the same standard deviation and mean as the SAS
and TI-83 Plus. We also included other methods used in EDA such as a histogram,
a boxplot, a stem-and-leaf plot, the interquartile range points, and others.
To
generate this output you must do the following:
Just like before, with example 1, we performed a test for normality to see our data was normally distributed. Null hypothesis:
Alternative hypothesis: Data is not normal. Conclusion: At the 5 % level of significance, there is enough evidence to claim that our data is normally distributed. Why? The maximum risk we were
willing to take, in this case, was 5%. According to the Test for Normality
(Shapiro-Wilkins W Test), the
Links to: |
When we make a histogram or a stemplot, we can
see the shape of the distribution of our data set. We do that by looking
at the peaks of the vertical bars in a histogram or horizontal rows in
a stemplot. For example, consider the following three histograms. Notice
that only one of them has all of the following characteristics: symmetry,
single peak, and bell shape. It is the last one.
Right - Skewed
Normal |
All normal distributions share the same overall bell-shaped curve. Each normal curve can be described by the mean and standard deviation. The mean is in the "middle" of all the values. One of the reasons why we study normal distributions is their application in statistical inference. Many statistical tests require data sets to be normally distributed or else the results will not be reliable, unless the sample size is really big or we are dealing with sets of sample means. EDA allows us to determine normality.
*EXAMPLES 1 AND 2 ABOVE SHOW TESTS FOR NORMALITY*
EXAMPLE 3
Consider a normally distributed data set with the mean of 0 and the standard deviation of 1 (graph shown below). According to the 68-95-99.7 Rule, in any normally distributed population, 68% of the observations fall within one standard deviation of the mean, 95% fall within two standard deviations from the mean, and 99.7% fall within three standard deviations of the mean. That last statement of 99.7% says that almost all observations are within three standard deviations of the mean. Now, how does that apply to our graph?
The cursor located
in the right corner of the density curve indicates that the x-value at
that point is 2.9965. Since the mean of the population represented by the
above graph is 0 and the standard deviation is 1, three standard deviations
away from 0 will indicate points of -3 and 3, one on each side of the mean.
So, according to the 68-95-99.7 Rule, 99.7% of all observations are between
-3 and 3. We agree with that statement because x equal to 2.9965 is very
close to 3, and it is evidently located on the far "end" of the curve.
It is then clear that most of the area under the curve (99.7%) is between
-3 and 3.