Join Our Newsletter





Events Calendar

« < June 2017 > »
S M T W T F S
28 29 30 31 1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 1
Home arrow Analysing Survey Data arrow Introduction to Survey Analysis
Introduction to Survey Analysis PDF Print E-mail
Written by John F Hall   
10 Mar 2006

Introduction to Survey Analysis    

© Copyright  2006 John F Hall 

1:  The nature of survey data    

A sample survey studies part of a group (the sample) in order  to make inferences about the whole group (the population) from which the sample is drawn.  We usually try to ensure that the sample is representative of the population; but a sample which is known to be biased can sometimes be useful if sufficient is known about the bias to be able to make an allowance for it, or if it can be used to set an upper or lower limit to some population value. 

Social surveys study  social  entities (persons, families, business firms, political parties, clubs etc.).  The survey data consist of information about each entity in a sample selected from the whole population of such entities.  Each entity in the sample is known as a CASE.

Data are obtained by measuring, observing, asking questions about, various characteristics (height, colour of eyes, voting behaviour, number of children) of the cases in the sample.  The characteristics studied are called VARIABLES and the descriptions of the characteristics for each case (e.g. height in metres  -  1.63m; colour of eyes - blue; party voted for at last election - Labour; number of children -2) are called their VALUES.

Thus survey data typically consist of the value of each variable for each case, and can be represented in a rectangular data matrix  (just like a spreadsheet).   For example:

      V A R I A B L E S  →  →  →

CASES

SEX
AGE
MARITAL
CLASS
VOTE
HEIGHT
HAPPY
KIDS
     ↓

 

1             
Male
26
Single
C1
LibDem
1.82
Fairly
None
2
Female
35
Married
C2
Labour
1.63
Very
2
3
Female
56
Married
D
Tory
1.55
Very
3
4
Male
42
Divorced
AB
No vote
1.74
Not very
1
5
Female
83
Widowed
E
Refused
 Don't     know
Fairly
4

VALUES (each of the boxes above) …etc. etc. until all cases are entered

Spreadsheets such as Excel and survey analysis packages such as SPSS operate with just such a matrix of data values (actually just the cells in black in the main body of the table above).

It is also typical of social surveys that for some cases the value of a particular variable is missing because the question was not answered, either because it is inapplicable (e.g. job description of someone who is not working), because the respondent refused to answer, could not decide or did not know, or sometimes because the question was missed altogether (see blank cell above).  Such missing values need to be entered in the matrix, but the way they are treated in analysis will depend on a variety  of factors.

2:  What’s a Data Matrix?

Example of a simple data matrix containing information on respondents’ answers to precoded or post-coded open-ended questions, in which the columns represent VARIABLES e.g.

V A R I A B L E S  →  →  →

CASES

SEX
AGE
MARITAL
CLASS
VOTE
HEIGHT
HAPPY
KIDS
     ↓

 

1       
Male
26   
Single
C1   
LibDem 
1.82
Fairly
None
2
Female
35
Married
C2
Labour
1.63
Very
2
3
Female
56
Married
D
Tory
1.55
Very
3
4
Male
42
Divorced
AB
No vote
1.74
Not very
1
5
Female
83
Widowed
E
Refused
 Don't     know
Fairly
4

… the rows represent individual CASES, e.g.

V A R I A B L E S  →  →  →

CASES

SEX
AGE
MARITAL
CLASS
VOTE
HEIGHT
HAPPY
KIDS
     ↓

 

1       
Male
26   
Single
C1  
LibDem
1.82
Fairly
None
2
Female
35
Married
C2
Labour
1.63
Very
2
3
Female
56
Married
D
Tory
1.55
Very
3
4
Male
42
Divorced
AB
No vote
1.74
Not very
1
5
Female
83
Widowed
E
Refused
 Don't     know
Fairly
4

… and the cells contain the initial VALUES on each variable for each case e.g.

V A R I A B L E S  →  →  →

CASES

SEX
AGE
MARITAL
CLASS
VOTE
HEIGHT
HAPPY
KIDS
     ↓

 

1      
Male
26  
Single
C1  
LibDem
1.82
Fairly
None
2
Female
35
Married
C2
Labour
1.63
Very
2
3
Female
56
Married
D
Tory
1.55
Very
3
4
Male
42
Divorced
AB
No vote
1.74
Not very
1
5
Female
83
Widowed
E
Refused
 Don't     know
Fairly
4

Although computers can handle alphabetic data obtained from questionnaire surveys (and it is sometimes easier for anxious beginners) alphabetic values are normally changed to numeric values before or immediately after data entry.  Numeric data can be processed much more quickly and efficiently on some computers.  Thus the matrix of data will eventually look something like this, in which numeric values have been left unchanged, numeric codes for valid alphabetic responses are indicated in italics and those for responses to be treated as missing in bold.

V A R I A B L E S  →  →  →

CASES

SEX
AGE
MARITAL
CLASS
VOTE
HEIGHT
HAPPY
KIDS
     ↓

 

1
1
26
3
2
2
1.82
2
0
2
2
35
2
3
8
1.63
3
2
3
2
56
2
4
3
1.55
3
3
4
1
42
4
1
1
1.74
1
1
5
2
83
5
5
7
 9
.
4

Note that the blank cell has been replaced with a full stop.  This is a default missing value in SPSS and will be explained later.  The matrix is now ready to be fed into a spreadsheet such as Excel or into a computer package such as SPSS ready for editing and analysis.  These days it will often already be in a computer readable format generated by a research agency or after being entered by the researcher directly from the questionnaire. 

3:  Levels of measurement

For ease of computer processing, the values of variables are usually, but not always, coded as numbers, most often as integers (numbers with no decimals), but it is important to remember that these numeric codes frequently do not have all the properties of integers.  This can affect the kind of statistical presentation and manipulation which is appropriate or permissible.

One important quality is the level of measurement.

The basic category is nominal (or categorical).   All that is necessary is that the categories are properly defined (precise, mutually exclusive, exhaustive of all cases).   Religious affiliation  is such a variable: so are marital status and parliamentary constituency.   Surveys usually ensure that categories are exhaustive by including a residual category 'Other'.  Numeric codes are arbitrarily assigned to categories.

Ordinal implies that, in addition, the categories can be ranked, i.e. placed in order from highest to lowest on some defined criterion (e.g. Very satisfied, Quite satisfied, Neither satisfied nor dissatisfied, Quite dissatisfied, Very dissatisfied).   Numeric codes cannot be arbitrarily assigned to categories, but they can be reversed.

Interval has all the characteristics of nominal and ordinal plus a defined unit of measurement.  Thus, for instance, the distance from 2 to 4 is the same as that from 4 to 6.  Examples include age, height, income in ££, number of children.  Numerical codes are neither arbitrary nor reversible.   If the scale has a true zero point it is a ratio scale (e.g. 4 is twice 2 for number of children, but not for temperature in degrees Celsius)

Note that in sociological discussion things like age or years of schooling are frequently used as indicators of something less precise such as "experience" or "level of education".  It is somewhat dubious whether they really ought to be treated as interval variables in such a context.

When all the cases are grouped into only two categories, according to whether they do, or do not, have a particular characteristic, (e.g. Male - Female, Yes - No) they are known as dichotomies. These can always be treated as interval measurements.

The values of some interval variables are continuous  (Height, age) and have to be rounded.   It is important to remember how the rounding has been done: e.g. when height is measured in inches,   68 inches means from 67.5 inches up to, but not including, 68.5 inches, whereas 68 years old means from 68 up to, but not including, 69 years old.   Thus average age calculated on age last birthday of each case in a sample will need to be adjusted by adding six months to the result.

The values of other interval variables are discrete (e.g. number of children) and can only increase in increments of 1.  Sometimes it is important for statistical calculation to bear this in mind.

4:  Author’s note:

These notes are based on the very first session of the part-time evening  post-graduate course Survey Analysis Workshop which I taught at PNL from 1976 until 1992.

We used to introduce ourselves briefly with name, institutional affiliation (if any), previous qualifications and experience, and reasons for coming on the course.  An all too frequent lament was “I’ve got a degree in Sociology and I want a job!”, but even the ones with jobs often had received little or no training in statistical or technical skills in their undergraduate courses (much of which was inadequate).  Some of these wanted a better job and/or had been sent by their employers (mainly central and local government or the voluntary sector).  Some were MPhil or PhD students.

Although I explained that there was no need for them to take notes, as everything was covered in the course booklets, this did not prevent some of them from scribbling furiously away.

I  started by listing a few things typically measured by questionnaires (some solicited from the class) across the top of the (double-width) board.  I then divided the board into columns defined by the items listed and filled in the responses of imaginary respondents (with a running commentary on the kind of thing they might mutter before giving a response to a questionnaire-clutching stranger knocking at the door just as they’ve settled down to watch their favourite TV programme) to yield something like the small data matrix in section 1 above.  At this point I would write VARIABLES across the top, CASES down the side, note that the entries inside the matrix were VALUES and that we had just generated a DATA MATRIX

This constituted the first introduction to formal terminology and to some keywords in the SPSS language.  I then explained that, whilst it is possible for computers to work with text responses, it is normal and a lot quicker to work with numbers and so the non-numeric responses needed to be coded using numeric codes to represent the original responses.

Leaving the original responses on the board, using a different colour chalk (yes, chalk!) and referring to an imaginary coding frame (perhaps pausing to explain what a coding frame was) I would write in a numeric code alongside each response.  After cleaning off the original text responses, leaving only the numeric codes, students were asked if they had any comments on, or noticed anything about, the numbers in the matrix, but this was usually met by blank expressions all round.  (Remember, these were mostly social science graduates!)  I then asked whether there was any difference in the way numbers were used between two variables such as vote and number of children, or between height in centimetres and number of children (using a joke about average families having 2.4 children as a hint), or anything that, say, two sets of numbers had, but another didn’t.  It took different amounts of time with different groups, but eventually someone would get the idea and by this “Socratic” process the class would arrive at the notion of levels of measurement without the phrase having once being mentioned, thus proving that they weren’t as innumerate as they thought.

This was followed by a session in the computer lab in which students familiarised themselves with the terminals and the line-printers, learned to log on to the Vax, copy a short pre-prepared SPSS job into their area, run it with a specially written front-end program  to make it easier to use SPSS on the Vax, print out the results and return to class with the printout for a brief explanation.  No student ever left empty-handed (but the spare copies always came in handy for one or two of them!) from this or later sessions, which greatly assisted motivation and subsequent learning.

Here endeth the first lesson.  Bulgarian Cabernet Sauvignon all round!  Let’s have a fun course.

Last updated 6 March 2006  Feedback and enquiries welcome on This e-mail address is being protected from spam bots, you need JavaScript enabled to view it

For more information click here

[1]   The program was written by Jim Ring while he was Senior Research Officer in the Survey Research Unit at PNL.  It limited SPSS output files to two editions (to avoid users running out of disk space) and had excellent error trapping.  If errors occurred it returned users to the point in their syntax file where they had left off, although SPSS didn’t always precisely identify the type of error.  It greatly assisted students and researchers with a series of prompts in editing SPSS syntax files, correcting and running of SPSS jobs and local printing of results and enabled a great deal of work to be completed in a very short time. 

Last Updated ( 26 Oct 2009 )
 
< Prev   Next >

Polls

How important is market research to start-ups in the current economic climate?
 

RSS Feeds

Subscribe Now