Fall Semester 2003

Lecture Notes One: Getting started.
Searching for truth is an on-going struggle.

Historically four methods have been employed to acquire knowledge and thus settle our questions and bring us closer to what is true and what is not. We summarize these methods below:

1. AUTHORITY

When using this method we believe something is true because an authority says so. For example physicist claim there are electrons, and I believe them, although I haven't seen any (electron) myself. Likewise, the Surgeon General says smoking is bad for your health, and I believe him (although that's not the real reason I don't smoke).

2. RATIONALISM

The method of rationalism uses reasoning alone to arrive at knowledge. This is the way to go, but reason alone can't always take you all the way. In most cases you need some evidence (data) as well. So reasoning is only part of the process, and not synonymous with it.

3. INTUITION

Where do all the (crazy) ideas come from? By intuition we mean sudden insight. This, however, is a very mysterious process, about which we have only the most rudimentary understanding.

4. SCIENTIFIC METHOD

This method uses reasoning and intuition, but relies on objective assessment. By rationalism and intuition a scientist forms a hypothesis about some fact, or some reality. An experiment is then designed, resulting in measurements. The data from the experiment is analyzed, the hypothesis is either supported or rejected.

This will be a course in data analysis.

There are seven parts to this class, fourteen lectures, about as many lab assignments, five homework assignments, one midterm exam, two practical exams, and a final written exam. To learn the material described in the notes you don't need to be a whiz in calculus or differential equations. To be successful you must be able to do elementary algebra and a few other mathematical operations. To help you review, the lab notes for tomorrow will cover the prerequisite mathematics for this class. Most (if not all) of the material should be pretty basic, but please review it. As they say, it's better to be sure than sorry.
Where do all the numbers come from?

Scientific research may be divided into two categories:

A. OBSERVATIONAL STUDIES

In these studies all you can do is take notes. Included in this category of research are:

1. Naturalistic Observation Much anthropological and etiological research is of this type. In this research the main goal is to find out what's going on (that is, to obtain an accurate description of the situation that is being studied).

2. Parameter Estimation This kind of research is conducted on samples to estimate the level of one or more population characteristics. Surveys, public opinion polls, and much market research falls into this category.

3. Correlational Studies In these studies the investigator focuses attention on two or more variables to determine whether they are related.

B. TRUE EXPERIMENTS

In this type of research an attempt is made to determine if changes in one variable produce changes in another variable(s). In this case you have the freedom to make changes and observe results.

Let's make a summary: we keep searching for truth. We need knowledge, as much as we can acquire. Rationalism, intuition, authority and scientific experiments are our tools. Of paramount importance is the data (evidence) that we collect. We do that through a process of measurement.

Let's now look at measurement scales.

NOMINAL SCALES

This is the lowest level of measurement, used with variables that are qualitative in nature. Objects are measured by the category that they belong to. On a used car lot we have all the Mazdas, Toyotas, Chevys, and so forth. Or, we could sort the cars by the type of car they are: small sedans, family sedans, SUVs, minivans, trucks, and so forth. Or we can sort them by the year they were produced in.

There's no direct relationship between categories.

ORDINAL SCALES

This is the next higher level of measurement. On such a scale we could say that Michael Jordan was a better basketball player than Rik Smits, and Rik Smits was a better basketball player than your lab instructor. Chances are that the difference between MJ and Rik Smits is not as big as between Rik Smits and your lab instructor, but on an ordinal scale, this does not matter.

For another example the Sears Tower in Chicago is taller than the Empire State Building in NY, and the Empire State Building is taller than Ballantine. An ordinal scale only cares about who's taller, but not by how much.

INTERVAL SCALES and RATIO SCALES

The Celsius and Fahrenheit scales of temperature are interval scales. On such a scale we would be able to say that a temperature of 93F is greater than one of 91F, but the difference is not as big as when we compare a temperature of 91F to one of 80F. Same goes for Celsius.

A ratio scale is one that has an absolute zero point. The Kelvin scale of temperature is such a scale. As a consequence a temperature of 200K is twice as hot as a temperature of 100K. The Celsius and Fahrenheit scales have their zeros in various places and are not absolute in any way (the Celsius scale is ideal for cooking, while the Fahrenheit scale is mostly oriented towards human body temperatures and weather temperatures).

The Kelvin scale, though, is an absolute (ratio) scale of measurement.

Data analysis (or statistical analysis) has been divided into two areas:

• descriptive statistics
• inferential statistics
Both involve analyzing data. If the analysis is done for the purpose of describing or characterizing the data that have been collected, then we are in the area of descriptive statistics. For example, when we record the scores from an exam, such as the one we talked about last time, we hand the tests back and then we want to describe the scores. We might decide to

1. calculate the average of the distribution so as to describe its central tendency.

2. determine its range, so as to characterize its variability.

3. plot the scores on a graph (histogram) so as to show the shape of the distribution.

Since all of these procedures are for the purpose of describing or characterizing the data already collected, they fall within the realm of descriptive statistics. Inferential statistics, on the other hand, is not concerned with just describing the obtained data. Rather, it embraces techniques that allow one to use obtained sample data to infer to or draw conclusions about populations.

Descriptive Statistics
is concerned with techniques that are used to describe or characterize data.

Inferential Statistics
involves techniques that use the obtained sample data to infer to populations.

In our discussions this semester we shall be using certain technical terms.

The terms and their definitions will be given in due time.

Here are some of the definitions to get us started:

1. the mean

2. the median

The median is the scale value below which 50% of the scores fall.

It is therefore the same thing as the percentile point for 50% (P50).

3. the mode

The mode is the most frequent score in the distribution.

Homework One, that will be posted today (due on Friday in lab) will help you clearly distinguish the relative merits of each of these three measures of central tendency.

I started last year with the following minute paper question:

The deVoe Report (June 2, 1980) quoted then U.S. President Jimmy Carter as saying "half the people in this country are living below the median income -- and this is intolerable." What is disputable and what is true in this quote?
Here are some of the correct answers received then:
"President Carter must have meant that whatever the actual value of the median income was, it was intolerable that half of the nation lived below that income."
Indeed, that was the answer I was looking for.

The Carter quote was obviously taken out of context. We don't know what he said right before the quoted text, but he might have actually expressed the median income in dollars. Here's an extremely contrived version of this hypothesis, for illustration purposes:

"We have calculated the median yearly income and we found it to be (say) \$600. We think this is a problem that needs to be addressed immediately; half the people in this country are living below the median income -- and this is intolerable."
There were some answers saying that Carter probably meant the mean. He did not mention the (arithmetical) mean, and that's probably because he did not want to say anything about it. He only wanted to make a comment about the median. He found it too low, and he expressed a concern that the value is too low to be the upper limit of income for half of the nation.

What makes this example intriguing is that, taken out of its context, it drastically polarizes our assumptions about what is said, placing the focus of our understanding on the wrong aspect. To see how this happens in another example (for your enjoyment) witness the following English sentence.

The ship sailed past the harbor sank.
How does this sound? Well, here's the same sentence in its original context:
A small part of Napoleon's fleet tried to run the English blockade at the entrance to the harbor. Two ships, a sloop and a frigate, ran straight for the harbor while a third ship tried to sail past the harbor in order to draw enemy fire. The ship sailed past the harbor sank.
I hope, perhaps, this makes the point. The question, and the quote, were tricky.

You need to watch for tricks like this in real life too.

So we have now defined:

• mode
• median
• arithmetical mean

The homework is asking you to compare them.

Now let's list more properties and definitions, and describe an experiment.

1. The Arithmetical Mean.

The arithmetical mean is defined as the sum of the scores divided by the number of scores.

In equation form that is

Properties of the mean:
1. The mean is sensitive to the exact value of all the scores in the distribution.

2. The sum of the deviations about the mean equals zero.
3. The mean is very sensitive to extreme scores.

4. The sum of the squared deviations of all the scores about their mean is a minimum.

In other words, this formula (in which zeta is an unknown)

admits a minimum when zeta has this value
We need to verify that.

5. Under most circumstances, of the measures used for central tendency, the mean is least subject to sampling variation. If we were repeatedly to take samples from a population on a random basis, the mean would vary from sample to sample. The same is true for the median and the mode. However, the mean varies less than these other measures of central tendency. This is very important in inferential statistics, and is a major reason why the mean is used in inferential statistics whenever possible.
2. The Median.

The median is defined as the scale value below which 50% of the scores (or measurements) fall.

Properties of the median:

1. The median is less sensitive than the mean to extreme scores.

2. Under usual circumstances, the median is more subject to sampling variability than the mean but less subject to sampling variability than the mode.

3. The Mode.

The mode is defined as the most frequent score in the distribution.

Usually distributions are unimodal. When a distribution has two modes it is bimodal.

MEASURES OF VARIABILITY

1. The Range.

The range is defined as the difference between the highest and lowest score in the distribution.

2. Deviation Scores.

A deviation score tells how far away the raw score is from the mean of its distribution.

3. The Standard Deviation.

For a population of scores we have:

For a sample we have:
Alternative formula for the standard deviation:
Properties of the standard deviation:

1. The standard deviation gives us a measure of dispersion relative to the mean. This differs from the range, which tells us directly the spread of the two most extreme scores.

2. Like the mean, the standard deviation is sensitive to each score in the distribution. If a score is moved closer to the mean, then the standard deviation will become smaller. If a score shifts away from the mean, then the standard deviation will increase.

3. Like the mean, the standard deviation is stable with regard to sampling fluctuations.

4. Both the mean and the standard deviation can be manipulated algebraically. This is an important aspect, as it allows mathematics to be done with them for use in inferential statistics.

In lab tomorrow you should check the following experiment:

Squared deviations and the mean

We will "prove" today (in lab) that the sum of the squared deviations of all the scores about their mean is a minimum. In other words, the formula below, in which zeta is an unknown (or variable)

admits a minimum for
Let's prove that (and in the process calculate other things as well).

Here are the steps:

1. Open up Excel. New worksheet.

2. Enter these numbers: 1, 2, 3, 2, 4 in cells A1:A5.

That's our data (the scores).

3. In E1 write this formula
=average(a1:a5)
For me that is 2.4 (the arithmetical mean of these 5 numbers).

4. Let's calculate the deviations to the mean.

In B1 write the formula for the first deviation:

=A1-\$E\$1
Notice that the second element has an absolute reference to column E row 1.

When we paste this formula that will become relevant.

The first deviation is -1.4 so the formula works fine.

5. Select cell B1. Drag the lower right corner of the cell to B5. Release mouse button.

The deviations are calculated.

6. Select cell B5. The formula inside it should be:
=A5-\$E\$1
Excel updated only the relative components of the formula.

7. In cell B6 enter this formula
=sum(b1:b5)
The value should be 0 (zero).

We have just verified that the sum of all the deviations about the mean is zero.

8. Let's square the deviations one by one.

In cell C1 write this formula

=b1^2
The value is 1.96 for me, the square of -1.4 as expected.

9. Now let's paste this formula throughout.

Select C1. Drag lower right corner to C5. Release mouse button.

Squared deviations have been calculated.

10. Let's finish this first part.

In cell C6 enter the following formula

=sum(c1:c5)
That's the sum of the squared deviations about the mean, and it's 5.2 for me.

11. In e2 write this formula:
=count(a1:a5)
That's 5. I leave it to you to label your spreadsheet nicely.

12. Calculate the standard deviation in E3:
=sqrt(c6/e2)
13. Do it again like this in E4
=sqrt(sum(c1:c5)/count(c1:c5))
14. Do it again in E5 as follows
=stdevp(a1:a5)
15. Do it again in E6 as follows
=stdev(a1:a5)
That should clarify the difference between STDEVP and STDEV.

16. Now let's work on the main point of this lab.

In A9 through F9 enter 0, 1, 2, 3, 4, 5.

We'll calculate the sum of squared deviations around each of these numbers.

17. Enter this formula in A10:
=(\$A1-A\$9)^2
18. Paste this formula to F10 (drag lower right corner to F10).

19. Select cells A10:F10. Drag lower right corner to F14.

The cells from A10 to F14 should now contain squared deviations.

20. Think a bit about it.

21. Now let's sum the squared deviations.

In cell A15 enter this formula

=sum(a10:a14)
22. Paste this formula through F15.

23. With A15:F15 selected click the Chart Wizard button.

Choose "Line" and click "Finish". Still another way would be to

1. select A10:F10, then
2. press and hold the Control key, then
3. select A14:F14, and then
4. release the Control key. After this
5. push the Chart Wizard button and
6. choose the Scatter Plot type of chart.

24. You're done. Notice the minimum around 2.4 (the arithmetical mean).

25. You can now change the numbers and see the chart change.

There will be a new mean, but that's where the minimum will also be.

26. Please work through this on Wednesday in labs (and also follow the lab notes).

27. You can use a longer sequence, with a different range.

28. Please let me know if you have any questions or if you need help.

Lab notes for tomorrow will be posted in the morning.

Reading assignment for this week is Carey and Berk up to p. 168 (roughly).

Last updated: Oct 28, 2003 by Adrian German for `A113`