******THIS WEBSITE IS CURRENTLY UNDER CONSTRUCTION******

STATISTICAL CONSULTING PROGRAM

Exploratory Data Analysis (EDA)

Department of Science and Mathematics
Montclair State University

 

Frequently Asked Questions
Examples with TI-83 Plus and JMP
Measures of Dispersion 
Standard Deviation details with examples
Normal Distribution
Test for Normality

 

 

What is EDA and how is it useful?
Top of Page

Exploratory Data Analysis (EDA) is a general heading given to some descriptive techniques developed in statistics. When we collect data that seems interesting, we usually decide to examine it further. Prior to the examination, we do not know what the data will reveal, and what questions will arise in the process. EDA is an "informal" examination of data which allows us to reveal interesting features of the variables in a data set.

The goal of EDA is to gain a deeper understanding of the nature of the data and to seek patterns that raise interesting questions for further study. For those purposes, EDA uses graphical and numerical summaries as well as some formal statistical procedures to determine the distribution and structure of the data set. With EDA we explore data rather than use a statistical analysis to confirm some claim.

Exploratory Data Analysis involves plotting the data. It seeks patterns and interesting new features. However, EDA cannot provide convincing evidence for conclusions. Statistical tools used by EDA only allow us to proceed with further data analysis, called formal statistical inference. It is statistical inference that answers specific questions raised before the experiment or study is actually conducted. EDA is primarily graphical, not numerical.

 

What are some of the statistical tools used by EDA?

Some of the graphical summaries used by EDA include Bar Graphs, Pie Charts, Stemplots, Histograms, Time Plots, Boxplots, Scatter Plots (static or dynamic), among others. To learn more about the graphical summaries, go to Statistical Graphs.

The numerical summaries usually involve measurements of the mean, median, mode, standard deviation, variance, minimum, maximum, frequency tables, and others.

Please scroll down if you would like to read details on how to find the mean, median, mode, variance, standard deviation, and how to form frequency tables.

Example1

 
Top of Page

Suppose the goal of our study is to learn about the potential relationships between student attendance and poor academic performance at a certain school. As analysts, however, we should first study the possible patterns in the distribution of missed school days among students in the past few months. The numerical values listed below represent the number of days each student had missed. What we want to do is describe that list using EDA.

Data: 0, 0, 1, 1, 2, 3, 4, 4, 4, 5, 6, 6, 6, 6, 7, 7, 8, 8, 8, 10, 11, 30 days out

Descriptive Measures of a TI-83 Plus graphing calculator:

4.tif (1070 bytes)5.tif (1070 bytes)6.tif (1070 bytes)7.tif (1070 bytes)8.tif (1070 bytes)

Some of the descriptive measures include: mean # of days missed ~ 6, standard deviation = 6.133, min = 0 days, median = 6, max = 30, and one possible outlier (30). Last histogram does not include the outlier.

Was this analysis worth the effort involved? It depends on what output we intended to see. We definitely detected one possible outlier, 30 days out. This outlier can be seen on both the boxplot (isolated point) and the first histogram (isolated bar). We also saw a dramatic change in the shape of the histograms above. The first histogram included the outlier and the second did not. The distribution in both cases did not seem to be symmetric, suggesting non-normality of the data set.

JMP Output

Here is yet another way of describing a similar sample (outlier 30 is now equal to 40), using a piece of software known as JMP. With the Analysis menu of JMP, namely

Distribution of Y, we were able to generalize similar results as with the TI-83 Plus graphing calculator.
 

The new general layout is different, since we used different software, but we generalized the same graphs, as well as some new ones. First we see a histogram and a boxplot,

side by side. Next, there are the quartiles with the median, maximum and minimum. In the Moments box, there are the mean and standard deviation. There is also a

stem-and-leaf plot. We also generated a test for normality called Shapiro-Wilkins W Test.

Here is something worth mentioning when comparing the TI-83Plus output with the JMP output:

1. 30 and 40 are both outliers; 2. medians remain the same; 3. none of the histograms indicate

normality of distributions; 4. many descriptive measures differ (ex. mean, standard deviation).

Test for Normality

The Shapiro-Wilkins W Test for normality assumes that data is normal. The null hypothesis of this test, then, states that data is normal. The alternative hypothesis obviously negates that statement and says that the data is not normal. The test provides evidence for the truth of the null hypothesis. In this case, there is enough evidence to claim that the null hypothesis is false. Our data is not normal.

Why? Well, if we say that this data set is normally distributed, there is less than 0.01% chance that we will be wrong (look in the Test for Normality box under Prob<W or P-value <.0001). In practice, you must determine the maximum risk you are willing to take by rejecting the null hypothesis, called the significance level because you might be rejecting something that is in fact true.

The risk factor depends on the importance of your results. If you have to be very careful about your analysis because of the consequences that your decision will have on others, especially if you are in a medical field, you want to choose a very low level of significance, 1% or so. Generally speaking, the most common levels of significance are 1%, 5% and 10%.

The decision, however, is yours. Your level of significance could be 2.333% or even 40%. Everything is up to you.

Links to:

Top of Page

Hypothesis Testing

P-values

Scroll down for information on the Normal Distribution

rd_bar.gif (1154 bytes)
 

Useful Details

Arithmetic Mean

The mean of n numbers in a sample or a population is their sum divided by n. For example, the mean of numbers 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10 is (1+2+3+4+5+6+7+8+9+10) / 10, which equals 55/10=5.5. Why find the mean? The mean is a simple and familiar measure as it represents the average value of a sample. It can be calculated for any set of numerical data. The means of several samples can be combined into the overall mean of all the data. And finally, it is a reliable measure in the sense that sample means drawn from the same population generally do not vary as widely as other statistical measures.

What is a weighted mean? A weighted mean is calculated when we want to give measurements being studied their proper degree of importance. It is necessary then to assign weights to each measurement, and calculate a weighted mean. How do we find a weighted mean? We first find the sum of the products obtained by multiplying each measurement by its corresponding weight. Then we divide this sum by the sum of the weights. Please note that if the weights are all equal, this formula simplifies to that of the ordinary sample mean, described in the first paragraph.

 

Median and Mode

For information on the median and mode, go to Statistical Graphs.


 
Basic measures of dispersion:

Top of Page

1. Range

The range is the difference between the highest value and the lowest value. Please note that when studying dispersion the range may be misleading because it depends only on the maximum and minimum values. Go to Statistical Graphs for information on IQR.

2. Standard Deviation

The standard deviation measures spread by looking at how far the observations are from their mean. When the standard deviation is 0, then there is no spread and all values are the same.

Technically speaking, standard deviation is the square root of variance, where variance is the average of the squares of the deviations of the observations from the mean of a population. As shown above, to find the average we take the sum of all values and divide that number by the number of elements in our set. In the case of standard deviation and variance, however, it is best to divide by 1 less than the number of observations. Only (n-1) of the squared deviations can vary freely because the sum of all deviations is 0, where n is the total number of observations in our sample. The knowledge of the first (n-1) deviations allows us to figure out the last one. The number (n-1) is called the number of degrees of freedom of variance or standard deviation. When given a choice, always choose the (n-1) option for unbiased estimates; that is divide by (n-1).


 
Example 2
Top of Page
 

Here is an example prepared on a TI-83 Plus graphing calculator, JMP, and SAS. Consider the following sixteen observations. Suppose the INDEX values represent your friend's sixteen consecutive, bi-weekly food shopping expenses in dollars $. Let us use different software packages to do some calculations for us.

OBS 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

INDEX 48 75 69 58 60 68 59 66 71 52 49 60 54 55 70 57

 
Why does the sum of the differences between each observation in the set and the mean equal 0? Try adding the following differences: (48-60.6875)+(75-60.6875)+...+(70-60.6875)+(57-60.6875, where 60.6875 is the mean of the observations. The sum will be 0.

TI-83 PLUS OUTPUT e.tif (1070 bytes)

The unbiased standard deviation, Sx, equals 8.2600994834.

SAS OUTPUT Variable      Mean         Std Dev     Std Error          T        Prob>|T|

-----------------------------------------

INDEX    60.6875000 8.2600948 2.0650237 29.3882824 0.0001

-----------------------------------------

The standard deviation is 8.2600994834...Same as above.

JMP OUTPUT The Moments part of this output shows the same standard deviation and mean as the SAS and TI-83 Plus. We also included other methods used in EDA such as a histogram, a boxplot, a stem-and-leaf plot, the interquartile range points, and others.

To generate this output you must do the following:
1. input data into a table
2. go to the Analysis menu
3. choose Distribution of Y
 

Just like before, with example 1, we performed a test for normality to see our data was normally distributed.

Null hypothesis:
Data is normal.

Alternative hypothesis: Data is not normal.

Conclusion: At the 5 % level of significance, there is enough evidence to claim that our data is normally distributed.

Why?

The maximum risk we were willing to take, in this case, was 5%. According to the Test for Normality (Shapiro-Wilkins W Test), the
risk involved is 57.30%. It is greater than 5%, and thus we do not have sufficient evidence to say that the null hypothesis is false.
 
 

Links to:

Hypothesis Testing

P-values

NORMAL DISTRIBUTION

Top of Page

When we make a histogram or a stemplot, we can see the shape of the distribution of our data set. We do that by looking at the peaks of the vertical bars in a histogram or horizontal rows in a stemplot. For example, consider the following three histograms. Notice that only one of them has all of the following characteristics: symmetry, single peak, and bell shape. It is the last one.
 
 

g1.tif (1070 bytes)                           Right - Skewed 

g2.tif (1070 bytes)                    Left - Skewed


g3.tif (1070 bytes)                        Normal 
 

 

All normal distributions share the same overall bell-shaped curve. Each normal curve can be described by the mean and standard deviation. The mean is in the "middle" of all the values. One of the reasons why we study normal distributions is their application in statistical inference. Many statistical tests require data sets to be normally distributed or else the results will not be reliable, unless the sample size is really big or we are dealing with sets of sample means. EDA allows us to determine normality.

*EXAMPLES 1 AND 2 ABOVE SHOW TESTS FOR NORMALITY*

EXAMPLE 3

Consider a normally distributed data set with the mean of 0 and the standard deviation of 1 (graph shown below). According to the 68-95-99.7 Rule, in any normally distributed population, 68% of the observations fall within one standard deviation of the mean, 95% fall within two standard deviations from the mean, and 99.7% fall within three standard deviations of the mean. That last statement of 99.7% says that almost all observations are within three standard deviations of the mean. Now, how does that apply to our graph?

g4.tif (1070 bytes)

The cursor located in the right corner of the density curve indicates that the x-value at that point is 2.9965. Since the mean of the population represented by the above graph is 0 and the standard deviation is 1, three standard deviations away from 0 will indicate points of -3 and 3, one on each side of the mean. So, according to the 68-95-99.7 Rule, 99.7% of all observations are between -3 and 3. We agree with that statement because x equal to 2.9965 is very close to 3, and it is evidently located on the far "end" of the curve. It is then clear that most of the area under the curve (99.7%) is between -3 and 3.


 Top of Page