PSY 737.01: Descriptive and Statistical Data Analysis
This course introduces the use of computer software to manage and manipulate data sets, produce descriptive statistics, graphs, or other output that appropriately summarize patterns and relationships in the data, and produce inferential statistics that appropriately test hypotheses and support substantive interpretations and conclusions. Inferential statistics include bivariate and multivariate models. Prerequisites: PSY 769. 30 hours, 3 credits.
1. Students will gain a basic understanding of data management concepts used in the behavioral sciences.
2. Students will gain a degree of familiarity and comfort with two statistical software packages.
3. Students will gain hands-on experience running common statistical analyses.
Field, A. (2009). Discovering statistics using SPSS (3rd ed.). Thousand Oaks, CA: Sage Publications.
This course requires access to three software packages: Excel, SPSS (11.0 or higher, the current version is IBM SPSS 19.0), and R (2.12.0 or higher). You can access Excel in computer labs and will only need it for the first few weeks of the class. You can also access SPSS in most computer labs. However, this software will occupy the bulk of the course and you may find it helpful to have a copy at home. The Statistics Base Grad Pack is less expensive, but does not include logistic regression (a commonly used procedure in forensic psychology where dichotomous outcomes are common). For logistic regression, you will need the Standard Statistics Grad Pack. Finally, we will devote the last few weeks to R, a free, open-source package that implements a statistical programming language called S. You can download R at no cost from the Comprehensive R Archive Network (CRAN). (R includes logistic regression.) If you are at all uncertain about purchasing SPSS, you may want to postpone your decision until after the first class meeting. You will not need SPSS for the first two weeks.
To download R, go to the CRAN Web
then select a location geographically close to you for a more
efficient download. Clicking that link should give you a
page that looks just like the CRAN page (it mirrors it) but has
a different URL. Choose "Windows" from "precompiled binary
distributions." Click "base" to download the set up file
(with a name that begins with "rw" and ends with ".exe").
Depending upon the speed of your connection this could take from
a few minutes to an hour. Once you have the file
downloaded, run it and follow the on screen instructions.
I recommend putting R in its own directory (folder). You
can run a demo to confirm that it installed correctly.
(For instance, typing 'demo()' with no quotes at the '>'
prompt should open a new window with a list of demos.)
Warning: Different versions of SPSS will read the
same data files (SAV) and syntax files (SPS) but not the same
output files (SPO or SPV).
Each week we will use the appropriate software in class to demonstrate the material from the reading. If you have questions about the reading, bring them to class and I will try to answer them there. If you have trouble with the previous homework, we can also go over that in class. I want to leave some flexibility to use the class time in the manner you will find most useful. This course is designed such that the text supplements and supports the lectures rather than the other way around.
All assignments come due at the beginning of class for the date noted.
Grading: I will grade assignments on the following scale.
0 = Completely missing.
6 = Not missing, but clearly wrong (demonstrates poor understanding).
7 = Partially correct (demonstrates partial understanding).
8 = Almost completely correct (minor mistakes only, but mistakes indicate incomplete understanding).
9 = Completely correct (demonstrates full understanding).
10 = Clearly exemplary (demonstrates effort beyond a minimally correct response).
I will then average the component scores, divide
by 9, and multiply by 100 to assign a percent grade. (This
should never come up, but should I receive a late assignment it
will receive a grade no higher than 80%.)
The following assignments may not make much sense until you
have done the corresponding reading. If they still do not
make sense after you do the reading, ask questions in class
before the assignments come due. Chances are others have
the same questions.
Assignment 1. Make a
data file in Excel with each of your courses this semester as
cases. Include the following variables: Course
Title, Course Prefix, Course Number, Day of Week, Credits. Print
out the data set and turn in the hard copy.
Assignment 2. Read
SPSS and save it as an SPSS SAV file. Print both the output file
displying the SPSS commands used to read the Excel file and a
copy of the SPSS data file. (Recall the warning given above. Old
and new versions of SPSS tend to be compatible with respect to
SAV and SPS files, but not output files. If you save your work
in one location to print it in another, you may encounter
difficulties openning the file. You can save SPSS output as HTML
or use a free pdf print driver to save it as a pdf file. If you
need to continue working, the best strategy may be to save
everything as syntax and then recreate the entire output by
running the full syntax at the second location.)
Assignment 3. Add start-hour and
start-minute variables to your data set. Code start hour
using a 24-hour clock with midnight as zero and noon as
12. Code start minute as a number between 0 and 59.
Paste and run a Compute command that computes start time in
minutes counting from midnight (= zero, not 1440). Compute
an end-time variable from your start-time variable. Use a
conditional compute to make sure that any courses starting after
10 PM have an end time less than 24 * 60 = 1440. Write the
compute statements in a general way that would work for any
start time values. Paste the transformation(s) and save
them all in one SPSS syntax file. Turn in both your syntax
and your resulting data set. (Many students find this the
hardest assignment. You may find it helpful to draw a flow chart
of your computations and test it out with some examples to
confirm that it gives the correct answer. I strongly recommend
that you also test your SPSS code by creating a fake course that
starts after 10 PM. Failure to handle post 10 PM courses
correctly is a very common mistake.)
Make a data set with 50 test scores ranging from 0 to
10. Compute descriptive statistics for the test scores
including the skewness. Recode the test scores into a new
variable called "pass" such that scores of 6 or higher
correspond to a 1 and lower scores correspond to a zero.
Run a crosstabulation of the new and old variables to check your
work. Save the SPSS output (SPV on new versions, SPO on
old versions) file containing both parts of the assignment. Turn
in a printed copy of the SPSS output file (I do not need you
data file or syntax file). Be sure that you have SPSS set to
include your commands in the output (this is typically the
default but can be shut off).
Make a data set with 180 cases. Make a group variable that
divides the 180 cases into three groups of 60 by assigning the
value 1 to 60 cases, 2 to 60 cases, and 3 to 60 cases.
Call this variable "iv1." Use the RV.NORMAL function to
compute values for a dependent variable called "dv." Give
all three groups a standard deviation of 10. Give group 1
a mean of zero, group 2 a mean of 7 and group 3 a mean of
9. Compute an "iv2" variable that evenly divides each iv1
group (1, 2, and 3) in half with values of zero and 1.
Only for cases with iv2 = 1, add 5 to the dv in the iv1 = 1
group and subtract 1 from the dv in the iv1 = 3 group (both of
these apply only to iv2 = 1). Run descriptives on your
three variables. Confirm that you have means consistent
with the table below (see note below table). Then run t tests
comparing each pair of groups defined by iv1. Finally, run
a one-way ANOVA for iv1 with dv as the DV. Copy
and paste your SPSS output into a word processing program.
Insert a few sentences describing the results. List
the appropriate p-values for each t test and interpret
them. Save your data for future use.
Turn in the annotated output.
Note: 30 cases per cell is a common rule of thumb of for
factorial designs. However, it still allows for a lot of
sampling variability. Therefore, a good strategy to check your
work is as follows. First, simulate the data with N = 18000
(3000 per cell) instead of N = 180 (30 per cell). Then confirm
the means in the above table using your N=18000 data set. When
the means match closely, then use the same computations to rerun
the simulation with N = 180 and use the smaller data set for the
Use the data set from Assignment 5. Run a 3x2 ANOVA using
both iv variables to predict the dv variable. Copy and paste the
output into a word processing program and insert a few sentences
describing the results. List the p value for the F tests
and interpret the results of the ANOVA with respect to the two
main effects and the interaction. Print the annotated
Use SPSS RV.NORMAL to create three random variables with means
of zero and SDs of one. N = 100. Compute a variable called
z that equals the first random variable plus 6. Compute a
variable called x which equals z plus the second random variable
and compute a variable called y which equals z plus the third
random variable. Use the WITH syntax to create a
correlation matrix with the random variables across the columns
and x, y and z down the rows. Next, compute a correlation
matrix for just x, y and z. Then run a linear regression
predicting y from x and z. Annotate the output to describe
(a) the correlations between z and the second two random
variables, (b) the correlation between x and y, and (c) the
regression coefficient for x predicting y with z in the
equation. Print the annotated output.
Use SPSS to open the survey_sample.sav data set from SPSS. (This
file can be found inside the IBM SPSS program directory. A
typical path may look like this: C:/program
Look at the variable labels in the variable-view window so that
you know the questions. Run a factor analysis with the
following variavles: educ, paeduc, maeduc, speduc, confinan,
conbus, coneduc, conpress, conmedic, contv. Use the following
options: univariate descriptive statistics, a scree plot,
direct oblimin rotation, factor loading plot, let SPSS choose
the number of factors. Look at the scree plot and compare
it to the number of factors in the solution chosen by SPSS that
appears below the scree plot. Insert a text box comparing
these. Look at the Component Matrix output for the
unrotated solution and the Pattern Matrix output for the rotated
solution. Insert a text box describing the pattern of
loadings for each solution. Consider which survey items
(i.e., variables) load similarly in each solution. Look in
the Component Correlation Matrix to see the correlation between
the two factors and describe this in a text box. Next, run
SPSS Reliabilities for the confidence items. Select items
statistics, scale statistics, and scale statistics with each
item deleted. Look at the scale alpha value and the alpha
values with each item deleted. Copy the output to a word
processor. Add annotations describing the output noted above.
Indicate if deletion of any items would improve the scale
reliability. Print the annotated output.
In R, use the c() function to assign to a variable named Artist
a list of your five favorite recording artists (use quotes for
string values). Use the same method to assign
corresponding values of the number of CDs, MP3s, or other
recordings you own by each artist to a variable called CD.
Similarly, create a variable called Time indicating how many
years you have listened to each artist, and a variable called
Rating indicating, on a scale of 1 to 10 where ten is highest,
how much you like each artist's music. Assign all four
variables to a data frame called Music. Type "Music" to
have R show the data set on the screen. Uses the summary()
function to print a summary of Music to the screen. Take a
moment to appreciate the fact that you just did something
relatively few psychologists know how to do. Use "Save to
file" from the File menu to save your R session as a text file
and print it.
Use c(), scan(), or any method you prefer to create two vectors
of numbers. In the first vector, rate the difficulty of
each of the first 10 assignments on a scale from 1 = "What a
breeze" to 10 = "I thought I would die." Call this vector
Actual. Now rate each of the assignments again for how
difficult you expected them to be before you did them. Use
the same scale and call this vector Expected. Type the
name of each variable to check for errors in the data. Use
the boxplot() function to have a look at your data (do not worry
that the plot opens in a separate window). Run a
paired-sample t test
(hint: use the paired parameter).
Next, make a new variable called diff1 that contains actual
- expected for the first 5 assignments and diff2 that contains
the same quantity for assignments 6 to 10. Print the
variables to the screen to check your work. Now use the
var.test() function to test for equal variances in the two sets
of ratings (i.e., diff1 and diff2). Next, use t.test() to
run an independent-samples t
test two ways: first with equal variances assumed and then
without. (Hint: Use the var.equal parameter.)
Save your R session to a text file. Copy and
paste the graph to the word processing document by right
clicking and copying as a bitmap. (For better results, use the
File menu to save as a 100% quality jpeg file, then insert the
saved image.) Add a paragraph describing the
results (use a word processing program or text editing
program). Based on the first analysis, on
average, did you find the assignments easier or harder than you
expected, or about the same? Based on the
second analysis, indicate which t test you would use, on the basis of the F test, and interpret the
result of that t
test. Print the edited output file including the graph.
Assignment 11. Open
the juges.sav file from SPSS as a data frame in R. According to
the IBM SPSS Statistics Brief Guide, each of the rows represents
a gymnastics performance. Each of the first 7 columns represents
scores from a professional judge. The eighth column represents
scores from an enthusiast. The data is hypothetical. The
remainder of the assignment assumes that you have read the data
as an R data frame called Judges.df. You can use the following
function to test your work (because a mistake here will cause
If the function fails to find an object called Judges.df,
you probably read the data into an object with a different name.
If the function returns FALSE then you probably forgot to read
the data as a data frame. Remember that any changes you make
after attaching the data will not be reflected in the attached
data, it is like a snapshot. Next, use the following code to
format the data that you just read. The code modifies the SPSS
variable labels and uses them as variable names in the data
frame. Running the lines one at a time will make it easier to
follow what they are doing.
See the default names
attr(Judges.df, "variable.labels") # See the variable labels
make.names(attr(Judges.df, "variable.labels") # Remove spaces
names(Judges.df) <- make.names(attr(Judges.df, "variable.labels"))
names(Judges.df) # See the new names
Use the pairs() function to draw a scatterplot matrix of all the judge's scores. Copy the graph into a word processing program, and add some text describing what you see in the graph. Overall, how would you describe the strength, direction, and shape of the relationships? How do the enthusiast scores differ from the other scores? Do all the relationships look linear?
Use the following code to compute a vector of unique
correlations between sets of scores.
JudgeTri <-upper.tri(JudgeCors, diag=FALSE)
JudgeCorList <- JudgeCors[JudgeTri]
Use the stem() function to draw a stem-and-leaf plot of
these correlations and copy it into your word processing
document. Which set of correlations stands out from the rest?
Compute a new R object called Professional that equals the
average of the scores of the first 7 judges (all but
enthusiast). Remember that a data frame is like a matrix: The
first column is Judges.df[,1] and the first two rows are
Judges.df[1:2,]. Also, the function rowMeans(...) will compute a
vector containing the mean of each row of a matrix. Using this
function can make this step much easier.
You can test
your work with the following line of code. The result should be
TRUE. If you get FALSE, you made a mistake computing
Use the following code to plot each set of scores against
the Professional average. (Use help('for') to obtain a
description of the for() function.)
# Set some parameter
CEX <- c(rep(.5,4), rep(.25,4))
LWD <- 3
COL <- rainbow(8)
PCH = c(15:18, 22:25)
LIM = c(6.75, 10.25)
# Draw an empty plot
plot(7:10, 7:10, type='n', xlab='Professional Average', ylab='',
# Use a loop to draw points and lines onto the plot
for(Rep in 1:8)
points(Professional, Judges.df[,Rep], col=COL[Rep], pch=PCH[Rep], cex=CEX[Rep])
legend('bottomright', legend = names(Judges.df), lwd=LWD, col=COL[1:8], cex=.75, pch=PCH)
Use the title() function to add a title to the plot. Finally, use the abline() function to draw a diagonal line (lower left to upper right) using lty=2 and lwd=1.
Copy the resulting graph into your word processing document.
Which country's judges tend to be more lenient in their scoring?
Which country's judges tend to be more stringent? Italy and
Enthusiast fall closest to the diagonal line, but what is the
big difference between these two sets of scores?
The Examinations will contain tasks similar to the assignments but in some cases a little more complex. Each examination covers the chapters indicated on the course schedule. Examinations come due at the start of class because we will discuss them in class after you have turned them in. On average, it takes about four hours to complete a take-home examination. Take-home examinations will not be accepted unless a student has returned a signed copy of the take-home examination agreement.
Make up any study you like involving at least four variables. Be sure to include each of the following sections in your paper. Use this as a checklist before you turn in your paper. The paper is relatively short, but writing concisely can require more work than writing something long. Give yourself plenty of time to review and revise what you write. Make sure that everything is presented clearly and accurately.
1. APA formatted title page. Identify your thesis advisor at
the bottom of the title page. (If you do not have a thesis
advisor, identify your advisor as "None".)
2. In 1-2 pages, summarize some current literature related to your hypotheses. In a subsection entitled "Current Study" state your hypothesis or hypotheses. (Feel at liberty to use a topic area that you have researched for another course or your thesis.)
3. In a method section of 1-2 pages, summarize the research design. Do not use this section to describe data simulation. Describe the research design that you would use to collect non-simulated data.
4. Simulate some data that fits your hypothesis and analyze it (if you have questions about the appropriate analysis, ask ahead of time). In a results section, first describe the method that you used to generate the simulated data and explain how the simulation is designed to fit the hypotheses in a subsection entitled "Data Simulation" (1-2 pages). You can use any software covered in this course to create the data set. However, you must use the software to generate the values. Do not simply make up numbers in your head and type the entire data set into the software. Then, in a subsection entitled "Data Analysis", describe the analysis and findings. If you do not obtain statistically significant results that match your hypotheses, tweak the simulation until you successfully simulate the hypothesized relationships (2-3 pages).
5. Attach as an appendix a code book containing the following columns and rows for each variable in your study. Include an identification number, at least two demographic variables, and the four or more variables from your study as variables in the code book. For each variable, give the name in one column, a brief description in a second column, the possible values including missing-value values in a third column, and any additional information required to correctly interpret the data set in a fourth column (1-2 pages).
6. Attach a second appendix with a printed data set containing at least 10 representative cases with values for all the variables in your code book, including missing value values (1 page).
7. Attach a third appendix displaying the code used to simulate the data (1-2 pages). If you use Excel, follow these steps: Open the spreadsheet. Type <ctrl>~ (Control tilde) to display the formulas in the spreadsheet. Select the first 10 rows of the spreadsheet. Copy and paste into your word processing program. The result should be a table displaying the formulas used to simulate the variables. If you use SPSS, copy and paste your syntax file. If you use R, copy and paste your script file.
Note that this comes due before the end of the
semester. I encourage you to get started on it early in
the semester. You can turn it in early if you prefer.
However, I will grade them all together for purposes of
Your final grade comprises your examination grades (40%, 20% each), your assignment grades (40%, 3.64% each), and your paper (20%). I will return all grades in percent form, so to compute your final grade you multiply by the above percents and add them up: Final Grade = .20(E1) + .20(E2) + .20(P) + .40(A1 + A2 + A3 + A4 + A5 + A6 + A7 + A8 + A9 + A10 + A11)/11. (E means exam, P means paper, and A means assignment.) I will assign letter grades as indicated below. I will round x.5 and above up and anything below x.5 down. Note that I will do the above computation before rounding off, so your results may differ by rounding error in the reporting of individual grades.
|Letter Grade||Percent Grade|
Contact Information: Professor Markus
Hours: Tuesdays 5:45 PM to 6:15 PM GC room
3204.02, Wednesday 4:30 PM to 5 PM JJ room 10.63.11 and Thursday 2PM to 3PM
JJ room 10.63.11.
(It usually works best to email me.)
||Overview, data file basics, Excel basics, coding
variables, code books and documentation. Data
level and structure, sorting data in Excel, searching
data in Excel.
|Week 2: 2/9||
Field Chapter 3 (F3). Assignment 1. Turn in
your take-home exam agreement form as early as
|SPSS Basics, SPSS data files, reading ASCII data into Excel, reading Excel files into SPSS, saving Excel files and text files from SPSS.|
|| F5. Assignment 2.
||SPSS data menu and transform menu functions.
Computing, recoding and correcting variables.
Matching and merging, concatenating, and aggregating
data sets. Using SPSS syntax.
|Week 4: 2/23||F4 & Handout 2. Assignment 3.
|| Graphs, Frequencies, Descriptives &
Crosstabs. Data cleaning and checking.
|Week 5: 3/1||
F9 & 10. (F18 optional.) Assignment 4.
|Means, t tests, and one-way ANOVA
||F11 & 12. (F13 & 14 optional.) Assignment
||ANOVA & ANCOVA using SPSS GLM. Review.
|Week 7: 3/15||Midterm Examination (F3-5 & 9-12, Handouts). (This is also your last chance to turn in take-home exam agreement form.)||We will go over the exam in class. Review paper assignment (don't let me forget). Methods for simulating data.|
|Week 8: 3/22||F6 & 7. Assignment 6.||Correlation and Regression in SPSS.
||F17. Assignment 7.||Item analysis and
factor analysis in SPSS.
|Week 10: 4/5||Venables & Smith Sections 1,2 & 6 (VS 1,2 & 6). Assignment 8.||Different software, different
metaphors. Finding your bearings in R.
Managing data in R.
(Last day to drop without academic penalty.)
|VS 8. Assignment 9. Paper Due.||Computing basic statistics in R.
Searching help for statistics in R.
||Plotting basic graphs
|Week 13: 5/3||VS 11 (11.6 to 11.8 optional).
||Estimating Linear Models in R. The generalized linear model (logistic regression, negative binomial, and beyond).|
|Week 14: 5/10
||Dealing with missing data. Review.|
|Week 15: 5/24||Final
Examination (covers everything but emphasis on second
half of course).
||We will go over the exam in class.|