Keith Markus' Urban Sprawl

Course Information
 Blackboard Login
Comprehensive R Archive Network
Site Map
Contact Information
 John Jay Home Page 


PSY 737.01:  Descriptive and Statistical Data Analysis in Psychology
Spring 2012

Professor Keith A. Markus

Time: Thursdays 4:05-6:05 PM
Room:  436T

Course Description
This course introduces the use of computer software to manage and manipulate data sets, produce descriptive statistics, graphs, or other output that appropriately summarize patterns and relationships in the data, and produce inferential statistics that appropriately test hypotheses and support substantive interpretations and conclusions.  Inferential statistics include bivariate and multivariate models.  Prerequisites:  PSY 769.  30 hours, 3 credits.

Course Objectives
1. Students will gain a basic understanding of data management concepts used in the behavioral sciences.
2. Students will gain a degree of familiarity and comfort with two statistical software packages.
3. Students will gain hands-on experience running common statistical analyses.

Required Reading
   Field, A. (2009). Discovering statistics using SPSS (3rd ed.). Thousand Oaks, CA: Sage Publications. 

    Venables, W. N., Smith, D. N., & R Development Core Team (2003).  An introduction to R.  PDF document distributed with the R software package (

    In addition, there will be two handouts used as reading early in the course available on Blackboard.

    This course requires access to three software packages:  Excel, SPSS (11.0 or higher, the current version is IBM SPSS 19.0), and R (2.12.0 or higher).  You can access Excel in computer labs and will only need it for the first few weeks of the class.  You can also access SPSS in most computer labs. However, this software will occupy the bulk of the course and you may find it helpful to have a copy at home. The Statistics Base Grad Pack is less expensive, but does not include logistic regression (a commonly used procedure in forensic psychology where dichotomous outcomes are common). For logistic regression, you will need the Standard Statistics Grad Pack. Finally, we will devote the last few weeks to R, a free, open-source package that implements a statistical programming language called S. You can download R at no cost from the Comprehensive R Archive Network (CRAN). (R includes logistic regression.) If you are at all uncertain about purchasing SPSS, you may want to postpone your decision until after the first class meeting. You will not need SPSS for the first two weeks.

    To download R, go to the CRAN Web page (  From the side menu, select "mirrors" and then select a location geographically close to you for a more efficient download.  Clicking that link should give you a page that looks just like the CRAN page (it mirrors it) but has a different URL.  Choose "Windows" from "precompiled binary distributions."  Click "base" to download the set up file (with a name that begins with "rw" and ends with ".exe").  Depending upon the speed of your connection this could take from a few minutes to an hour.  Once you have the file downloaded, run it and follow the on screen instructions.  I recommend putting R in its own directory (folder).  You can run a demo to confirm that it installed correctly.  (For instance, typing 'demo()' with no quotes at the '>' prompt should open a new window with a list of demos.)

Warning: Different versions of SPSS will read the same data files (SAV) and syntax files (SPS) but not the same output files (SPO or SPV).

Class Time
    Each week we will use the appropriate software in class to demonstrate the material from the reading.  If you have questions about the reading, bring them to class and I will try to answer them there.  If you have trouble with the previous homework, we can also go over that in class.  I want to leave some flexibility to use the class time in the manner you will find most useful. This course is designed such that the text supplements and supports the lectures rather than the other way around.


All assignments come due at the beginning of class for the date noted.

Grading:  I will grade assignments on the following scale.

0 = Completely missing.
6 = Not missing, but clearly wrong (demonstrates poor understanding).
7 = Partially correct (demonstrates partial understanding).
8 = Almost completely correct (minor mistakes only, but mistakes indicate incomplete understanding).
9 = Completely correct (demonstrates full understanding).
10 = Clearly exemplary (demonstrates effort beyond a minimally correct response).

I will then average the component scores, divide by 9, and multiply by 100 to assign a percent grade.  (This should never come up, but should I receive a late assignment it will receive a grade no higher than 80%.)

The following assignments may not make much sense until you have done the corresponding reading.  If they still do not make sense after you do the reading, ask questions in class before the assignments come due.  Chances are others have the same questions.

  Assignment 1.  Make a data file in Excel with each of your courses this semester as cases.  Include the following variables:  Course Title, Course Prefix, Course Number, Day of Week, Credits. Print out the data set and turn in the hard copy.

  Assignment 2.  Read your Excel file from Assignment 1 into SPSS and save it as an SPSS SAV file. Print both the output file displying the SPSS commands used to read the Excel file and a copy of the SPSS data file. (Recall the warning given above. Old and new versions of SPSS tend to be compatible with respect to SAV and SPS files, but not output files. If you save your work in one location to print it in another, you may encounter difficulties openning the file. You can save SPSS output as HTML or use a free pdf print driver to save it as a pdf file. If you need to continue working, the best strategy may be to save everything as syntax and then recreate the entire output by running the full syntax at the second location.)

  Assignment 3.  Add start-hour and start-minute variables to your data set.  Code start hour using a 24-hour clock with midnight as zero and noon as 12.  Code start minute as a number between 0 and 59.  Paste and run a Compute command that computes start time in minutes counting from midnight (= zero, not 1440).  Compute an end-time variable from your start-time variable.  Use a conditional compute to make sure that any courses starting after 10 PM have an end time less than 24 * 60 = 1440.  Write the compute statements in a general way that would work for any start time values.  Paste the transformation(s) and save them all in one SPSS syntax file.  Turn in both your syntax and your resulting data set. (Many students find this the hardest assignment. You may find it helpful to draw a flow chart of your computations and test it out with some examples to confirm that it gives the correct answer. I strongly recommend that you also test your SPSS code by creating a fake course that starts after 10 PM. Failure to handle post 10 PM courses correctly is a very common mistake.)

  Assignment 4.  Make a data set with 50 test scores ranging from 0 to 10.  Compute descriptive statistics for the test scores including the skewness.  Recode the test scores into a new variable called "pass" such that scores of 6 or higher correspond to a 1 and lower scores correspond to a zero.  Run a crosstabulation of the new and old variables to check your work.  Save the SPSS output (SPV on new versions, SPO on old versions) file containing both parts of the assignment. Turn in a printed copy of the SPSS output file (I do not need you data file or syntax file). Be sure that you have SPSS set to include your commands in the output (this is typically the default but can be shut off).

  Assignment 5.  Make a data set with 180 cases.  Make a group variable that divides the 180 cases into three groups of 60 by assigning the value 1 to 60 cases, 2 to 60 cases, and 3 to 60 cases.  Call this variable "iv1."  Use the RV.NORMAL function to compute values for a dependent variable called "dv."  Give all three groups a standard deviation of 10.  Give group 1 a mean of zero, group 2 a mean of 7 and group 3 a mean of 9.  Compute an "iv2" variable that evenly divides each iv1 group (1, 2, and 3) in half with values of zero and 1.  Only for cases with iv2 = 1, add 5 to the dv in the iv1 = 1 group and subtract 1 from the dv in the iv1 = 3 group (both of these apply only to iv2 = 1).  Run descriptives on your three variables.  Confirm that you have means consistent with the table below (see note below table). Then run t tests comparing each pair of groups defined by iv1.  Finally, run a one-way ANOVA for iv1 with dv as the DV.  Copy and paste your SPSS output into a word processing program. Insert a few sentences describing the results.  List the appropriate p-values for each t test and interpret them.  Save your data for future use.  Turn in the annotated output.


Note: 30 cases per cell is a common rule of thumb of for factorial designs. However, it still allows for  a lot of sampling variability. Therefore, a good strategy to check your work is as follows. First, simulate the data with N = 18000 (3000 per cell) instead of N = 180 (30 per cell). Then confirm the means in the above table using your N=18000 data set. When the means match closely, then use the same computations to rerun the simulation with N = 180 and use the smaller data set for the assignment.

  Assignment 6.  Use the data set from Assignment 5.  Run a 3x2 ANOVA using both iv variables to predict the dv variable. Copy and paste the output into a word processing program and insert a few sentences describing the results.  List the p value for the F tests and interpret the results of the ANOVA with respect to the two main effects and the interaction.  Print the annotated output.

   Assignment 7.  Use SPSS RV.NORMAL to create three random variables with means of zero and SDs of one.  N = 100. Compute a variable called z that equals the first random variable plus 6.  Compute a variable called x which equals z plus the second random variable and compute a variable called y which equals z plus the third random variable.  Use the WITH syntax to create a correlation matrix with the random variables across the columns and x, y and z down the rows.  Next, compute a correlation matrix for just x, y and z.  Then run a linear regression predicting y from x and z.  Annotate the output to describe (a) the correlations between z and the second two random variables, (b) the correlation between x and y, and (c) the regression coefficient for x predicting y with z in the equation.  Print the annotated output.

   Assignment 8.  Use SPSS to open the survey_sample.sav data set from SPSS. (This file can be found inside the IBM SPSS program directory. A typical path may look like this: C:/program files/IBM/SPSS/Statistis/19/Samples/English/survey_sample.sav.) Look at the variable labels in the variable-view window so that you know the questions.  Run a factor analysis with the following variavles: educ, paeduc, maeduc, speduc, confinan, conbus, coneduc, conpress, conmedic, contv. Use the following options:  univariate descriptive statistics, a scree plot, direct oblimin rotation, factor loading plot, let SPSS choose the number of factors.  Look at the scree plot and compare it to the number of factors in the solution chosen by SPSS that appears below the scree plot.  Insert a text box comparing these.  Look at the Component Matrix output for the unrotated solution and the Pattern Matrix output for the rotated solution.  Insert a text box describing the pattern of loadings for each solution.  Consider which survey items (i.e., variables) load similarly in each solution.  Look in the Component Correlation Matrix to see the correlation between the two factors and describe this in a text box.  Next, run SPSS Reliabilities for the confidence items.  Select items statistics, scale statistics, and scale statistics with each item deleted.  Look at the scale alpha value and the alpha values with each item deleted.  Copy the output to a word processor. Add annotations describing the output noted above. Indicate if deletion of any items would improve the scale reliability.  Print the annotated output.

   Assignment 9.  In R, use the c() function to assign to a variable named Artist a list of your five favorite recording artists (use quotes for string values).  Use the same method to assign corresponding values of the number of CDs, MP3s, or other recordings you own by each artist to a variable called CD.  Similarly, create a variable called Time indicating how many years you have listened to each artist, and a variable called Rating indicating, on a scale of 1 to 10 where ten is highest, how much you like each artist's music.  Assign all four variables to a data frame called Music.  Type "Music" to have R show the data set on the screen.  Uses the summary() function to print a summary of Music to the screen.  Take a moment to appreciate the fact that you just did something relatively few psychologists know how to do.  Use "Save to file" from the File menu to save your R session as a text file and print it.

   Assignment 10.  Use c(), scan(), or any method you prefer to create two vectors of numbers.  In the first vector, rate the difficulty of each of the first 10 assignments on a scale from 1 = "What a breeze" to 10 = "I thought I would die."  Call this vector Actual.  Now rate each of the assignments again for how difficult you expected them to be before you did them.  Use the same scale and call this vector Expected.  Type the name of each variable to check for errors in the data.  Use the boxplot() function to have a look at your data (do not worry that the plot opens in a separate window).  Run a paired-sample t test (hint:  use the paired parameter).

Next, make a new variable called diff1 that contains actual - expected for the first 5 assignments and diff2 that contains the same quantity for assignments 6 to 10.  Print the variables to the screen to check your work.  Now use the var.test() function to test for equal variances in the two sets of ratings (i.e., diff1 and diff2).  Next, use t.test() to run an independent-samples t test two ways:  first with equal variances assumed and then without.  (Hint:  Use the var.equal parameter.)  Save your R session to a text file.  Copy and paste the graph to the word processing document by right clicking and copying as a bitmap. (For better results, use the File menu to save as a 100% quality jpeg file, then insert the saved image.)  Add a paragraph describing the results (use a word processing program or text editing program).  Based on the first analysis, on average, did you find the assignments easier or harder than you expected, or about the same?  Based on the second analysis, indicate which t test you would use, on the basis of the F test, and interpret the result of that t test.  Print the edited output file including the graph.

    Assignment 11.  Open the juges.sav file from SPSS as a data frame in R. According to the IBM SPSS Statistics Brief Guide, each of the rows represents a gymnastics performance. Each of the first 7 columns represents scores from a professional judge. The eighth column represents scores from an enthusiast. The data is hypothetical. The remainder of the assignment assumes that you have read the data as an R data frame called Judges.df. You can use the following function to test your work (because a mistake here will cause problems later).

If the function fails to find an object called Judges.df, you probably read the data into an object with a different name. If the function returns FALSE then you probably forgot to read the data as a data frame. Remember that any changes you make after attaching the data will not be reflected in the attached data, it is like a snapshot. Next, use the following code to format the data that you just read. The code modifies the SPSS variable labels and uses them as variable names in the data frame. Running the lines one at a time will make it easier to follow what they are doing.

names(Judges.df) # See the default names
attr(Judges.df, "variable.labels") # See the variable labels
make.names(attr(Judges.df, "variable.labels") # Remove spaces
names(Judges.df) <- make.names(attr(Judges.df, "variable.labels"))
names(Judges.df) # See the new names

Use the pairs() function to draw a scatterplot matrix of all the judge's scores. Copy the graph into a word processing program, and add some text describing what you see in the graph. Overall, how would you describe the strength, direction, and shape of the relationships? How do the enthusiast scores differ from the other scores? Do all the relationships look linear?

Use the following code to compute a vector of unique correlations between sets of scores.

JudgeCors <- cor(Judges.df, use='complete.obs')
JudgeTri <-upper.tri(JudgeCors, diag=FALSE)
JudgeCorList <- JudgeCors[JudgeTri]

Use the stem() function to draw a stem-and-leaf plot of these correlations and copy it into your word processing document. Which set of correlations stands out from the rest?

Compute a new R object called Professional that equals the average of the scores of the first 7 judges (all but enthusiast). Remember that a data frame is like a matrix: The first column is Judges.df[,1] and the first two rows are Judges.df[1:2,]. Also, the function rowMeans(...) will compute a vector containing the mean of each row of a matrix. Using this function can make this step much easier.


You can test your work with the following line of code. The result should be TRUE. If you get FALSE, you made a mistake computing Professional.

all.equal(length(Professional), length(Judges.df[,1]))

Use the following code to plot each set of scores against the Professional average. (Use help('for') to obtain a description of the for() function.)

# Set some parameter values
CEX <- c(rep(.5,4), rep(.25,4))
LWD <- 3
COL <- rainbow(8)
PCH = c(15:18, 22:25)
LIM = c(6.75, 10.25)

# Draw an empty plot
plot(7:10, 7:10, type='n', xlab='Professional Average', ylab='',
 xlim=LIM, ylim=LIM)

# Use a loop to draw points and lines onto the plot
for(Rep in 1:8)
   points(Professional, Judges.df[,Rep], col=COL[Rep], pch=PCH[Rep], cex=CEX[Rep])
   lines(lowess(Judges.df[,Rep]~Professional), col=COL[Rep])
legend('bottomright', legend = names(Judges.df), lwd=LWD, col=COL[1:8], cex=.75, pch=PCH)

Use the title() function to add a title to the plot. Finally, use the abline() function to draw a diagonal line (lower left to upper right) using lty=2 and lwd=1.

Copy the resulting graph into your word processing document. Which country's judges tend to be more lenient in their scoring? Which country's judges tend to be more stringent? Italy and Enthusiast fall closest to the diagonal line, but what is the big difference between these two sets of scores?

The Examinations will contain tasks similar to the assignments but in some cases a little more complex.  Each examination covers the chapters indicated on the course schedule.  Examinations come due at the start of class because we will discuss them in class after you have turned them in. On average, it takes about four hours to complete a take-home examination. Take-home examinations will not be accepted unless a student has returned a signed copy of the take-home examination agreement.
Make up any study you like involving at least four variables. Be sure to include each of the following sections in your paper. Use this as a checklist before you turn in your paper. The paper is relatively short, but writing concisely can require more work than writing something long. Give yourself plenty of time to review and revise what you write. Make sure that everything is presented clearly and accurately.

1. APA formatted title page. Identify your thesis advisor at the bottom of the title page. (If you do not have a thesis advisor, identify your advisor as "None".)
2. In 1-2 pages, summarize some current literature related to your hypotheses. In a subsection entitled "Current Study" state your hypothesis or hypotheses. (Feel at liberty to use a topic area that you have researched for another course or your thesis.)
3. In a method section of 1-2 pages, summarize the research design. Do not use this section to describe data simulation. Describe the research design that you would use to collect non-simulated data.
4. Simulate some data that fits your hypothesis and analyze it (if you have questions about the appropriate analysis, ask ahead of time).  In a results section, first describe the method that you used to generate the simulated data and explain how the simulation is designed to fit the hypotheses in a subsection entitled "Data Simulation" (1-2 pages). 
You can use any software covered in this course to create the data set. However, you must use the software to generate the values. Do not simply make up numbers in your head and type the entire data set into the software. Then, in a subsection entitled "Data Analysis", describe the analysis and findings. If you do not obtain statistically significant results that match your hypotheses, tweak the simulation until you successfully simulate the hypothesized relationships (2-3 pages).
5. Attach as an appendix a code book containing the following columns and rows for each variable in your study.  Include an identification number, at least two demographic variables, and the four or more variables from your study as variables in the code book.  For each variable, give the name in one column, a brief description in a second column, the possible values including missing-value values in a third column, and any additional information required to correctly interpret the data set in a fourth column (1-2 pages). 
6. Attach a second appendix with a printed data set containing at least 10 representative cases with values for all the variables in your code book, including missing value values (1 page).
7. Attach a third appendix displaying the code used to simulate the data (1-2 pages). If you use Excel, follow these steps: Open the spreadsheet. Type <ctrl>~ (Control tilde) to display the formulas in the spreadsheet. Select the first 10 rows of the spreadsheet. Copy and paste into your word processing program. The result should be a table displaying the formulas used to simulate the variables. If you use SPSS, copy and paste your syntax file. If you use R, copy and paste your script file. 

Note that this comes due before the end of the semester.  I encourage you to get started on it early in the semester.  You can turn it in early if you prefer. However, I will grade them all together for purposes of consistency.

Your final grade comprises your examination grades (40%, 20% each), your assignment grades (40%, 3.64% each), and your paper (20%).   I will return all grades in percent form, so to compute your final grade you multiply by the above percents and add them up:  Final Grade = .20(E1) + .20(E2) + .20(P) + .40(A1 + A2 + A3 + A4 + A5 + A6 + A7 + A8 + A9 + A10 + A11)/11.  (E means exam, P means paper, and A means assignment.)  I will assign letter grades as indicated below.  I will round x.5 and above up and anything below x.5 down.  Note that I will do the above computation before rounding off, so your results may differ by rounding error in the reporting of individual grades.

Letter Grade  Percent Grade
A 95-100
A- 90-94
B+ 85-89
B 80-84
B- 75-79
C+ 70-74
C 65-69
C- 60-64
F 0-59


Contact Information: Professor Markus

  Office Hours:  Tuesdays 5:45 PM to 6:15 PM GC room 3204.02, Wednesday 4:30 PM to 5 PM JJ room 10.63.11 and Thursday 2PM to 3PM JJ room 10.63.11.
(It usually works best to email me.)

  Office:  10.63.11

  Phone:  212-237-8784

  Email:  KMarkus@AOL.COM



Section 01
Assignments Topics
Week 1: 2/2
Handout 1.
Overview, data file basics, Excel basics, coding variables, code books and documentation. Data level and structure, sorting data in Excel, searching data in Excel.
Week 2: 2/9

Field Chapter 3 (F3).  Assignment 1. Turn in your take-home exam agreement form as early as possible.

SPSS Basics, SPSS data files, reading ASCII data into Excel, reading Excel files into SPSS, saving Excel files and text files from SPSS.
Week 3: 2/16

F5.  Assignment 2.
SPSS data menu and transform menu functions.  Computing, recoding and correcting variables.  Matching and merging, concatenating, and aggregating data sets.  Using SPSS syntax.
Week 4: 2/23 F4 & Handout 2.  Assignment 3.
Graphs, Frequencies, Descriptives & Crosstabs.  Data cleaning and checking.  Checking transformations.
Week 5: 3/1

F9 & 10. (F18 optional.)  Assignment 4.

Means, t tests, and one-way ANOVA
Week 6: 3/8

F11 & 12. (F13 & 14 optional.) Assignment 5.
ANOVA & ANCOVA using SPSS GLM.  Review.
Week 7: 3/15 Midterm Examination (F3-5 & 9-12, Handouts). (This is also your last chance to turn in take-home exam agreement form.)  We will go over the exam in class. Review paper assignment (don't let me forget). Methods for simulating data.
Week 8: 3/22 F6 & 7.  Assignment 6. Correlation and Regression in SPSS.
Week 9: 3/29
F17. Assignment 7.   Item analysis and factor analysis in SPSS.
Week 10: 4/5 Venables & Smith Sections 1,2 & 6 (VS 1,2 & 6).  Assignment 8. Different software, different metaphors.  Finding your bearings in R.  Managing data in R.
Week 11: 4/19
(Last day to drop without academic penalty.)
VS 8.  Assignment 9.  Paper Due. Computing basic statistics in R.  Searching help for statistics in R.
Week 12: 4/26
VS 12.  Assignment 10.
Plotting basic graphs in R.
Week 13: 5/3 VS 11 (11.6 to 11.8 optional). Assignment 11.
Estimating Linear Models in R. The generalized linear model (logistic regression, negative binomial, and beyond).
Week 14: 5/10

Dealing with missing data. Review.
Week 15: 5/24 Final Examination (covers everything but emphasis on second half of course).
We will go over the exam in class.


Site Map
John Jay Home Page

Created 8 January 2012
Updated 26 January 2012