Confidence intervals are used to express the uncertainty associated with a population estimate. For example, imagine we wanted to use a survey to estimate the mean age of a population. The 95% confidence interval tells us that if we sampled this same population lots of times, and generated a CI each time, 95% of these CI would contain the true mean age of the population.
Aggregate or macro data are data about populations, groups, regions or countries. These are data that have been averaged, totaled or otherwise derived from the individual level data found in the survey datasets.
Sampling bias occurs when a sample statistic does not accurately reflect the true value of the parameter in the target population. Sample estimates might be too high or too low compared to the true population values. This may arise where the sample is not representative of the population.
A survey case is a unit for which values are captured. Typically, surveys use individuals, families/households or institutions/organisations as observation units (cases). In survey datasets, cases are usually stored in rows.s
A variable that can take on one value values from a set of discrete and mutually exclusive list of responses. For example a marital status variable can include the categories of single (never married), married, civil partnership, divorced, widowed etc. and respondent can be assigned only one value from this list.
The process of dividing a population into groups, then selecting a simple random sample of groups and sampling everyone in those groups. An example of this is geographical clustering, which is often efficiently applied in face-to-face surveys. Clustering of addresses limits travel for interviewers and so allows survey producers to sample more respondents for a given budget.
A codebook describes the contents, structure, and layout of a data collection. Codebooks begin with basic front matter, including the study title, name of the principal investigator(s), table of contents, and an introduction describing the purpose and format of the codebook. Some codebooks also include methodological details, such as how weights were computed, and data collection instruments, while others, especially with larger or more complex data collections, leave those details for a separate user guide and/or data collection instrument.
A control variable is a variable that is included in an analysis in order to control or eliminate its influence on the variables of interest. For example if we are looking at the relationship between having a university degree and smoking prevalence, we might need to consider the impact of age at the same time. Older generation respondents are more likely to smoke than a younger generation. If we control for age we can see whether graduates are less likely to smoke than non-graduates once age has been accounted for.
Cross-sectional data are collected from a sample at a single point in time. It is often likened to taking a snapshot. Cross-sectional studies are quick and relatively simple, but they cannot provide information about the change in the same individuals or units over time. They can however be used to look at aggregate changes in the population as a whole.
A variable that is created from one or more already existing variables by following some sort of calculation or other data processing techniques. For example, respondent’s estimated annual income from savings and investments could be derived from several reported income variables.
Equal interval simply divides the data into an equal sized subranges. For example if your data ranged from 0 to 300, and you specified three classes 0 to 300, the ranges would be: 0–100, 101–200, and 201–300.
In a survey setting microdata is individual-level data stored in cases, usually with one case per respondent. In business microdata setting data are stored at the firm-level with one case per firm. Cases are usually stored in rows.
Some variables have values that are recorded as missing. These values may be missing unintentionally (due to data entry errors) or may stem from the survey design (e.g. if only part of the sample were asked a particular question). Sometimes non-substantive responses (such as ‘don’t know’) are also recorded as missing values. To draw accurate inferences about the data missing values need to be treated prior to the analyses, e.g. excluded.
This method attempts to model mathematically or statistically data from two or more variables measured on the same observations. The multivariate statistical modelling often involves a dependent variable and multiple independent variables. Examples of multivariate analyses are factor analysis, latent class analysis, and multivariate regressions. In contrast, univariate method involves an analysis of a single variable.
The Natural breaks (Jenks) method groups similar values together, and breaks are assigned where there are relatively large distances between the classes. This reduces variance within classes and maximises variance between classes.
This a categorical variable that contains values which represent categories which do not have a natural order. The values assigned to the categories can be presented in any order. For example, there is no natural order to a set of categories describing the religion a person follows.
Non-substantive responses are responses that do not offer a quantifiable value. Examples include responses such as: ‘Unsure / undecided’, ‘Cannot recall’, ‘Have no idea’, ‘Don’t know (DK). Unlike substantive responses they cannot be used in analysis.
Precision is a measure of the variation of a survey estimator for a population parameter.
It refers to the size of deviations from a survey estimate (i.e. a survey statistic, such as a mean or percentage) that occurs over repeated application of the same probability-based sampling procedures using the same sampling frame and sample size. Standard errors and confidence intervals are two examples of commonly used measures of precision.
A sample based on random selection of elements. It should be possible for the sample designer to calculate the probability with which an element in the population is selected for inclusion in the sample.
Quantile classification arranges data so there are the same count of features in each class. This will result in an equal distribution of shading across the maps. This can result in a misleading map, as similar features can be in different classes, and widely different features in the same class.
Standard error measures the uncertainty associated with the estimate. The standard error of the mean is a measure of how representative a sample is of the population from which it was drawn. It measures the amount that a sample statistic (such as a percentage) varies from the true population statistic.
Standard error is related to standard deviation and the standard deviation can be used to calculate it. For a given sample size, the standard error equals the standard deviation divided by the square root of the sample size.
The standard error is also inversely proportional to the sample size; the larger the sample size, the smaller the standard error.
A statistical model is a theoretical construction of the relationship of explanatory variables to variables of interest created to better understand these relationships.
They typically consist of a collection of probability distributions and are used to describe patterns of variability that data may display.
The statistical model is expressed as a function. For example, a researcher may model a linear relationship using the regression function below:
y = b0 + b1x1 + b2x2 ... + bixi
In this model, y represents an outcome variable and xi represents its corresponding predictor variables. The term b0 is an intercept for the model. The term bi is a regression coefficient and represents the numerical relationship between the predictor variables and the outcome for the ith term.
Statistically modelling is a major topic and outside the scope of the module. Readers who want to know more will find extensive accounts of statistical models including linear regression and logistic regression in statistical texts and online.
The type of probability sampling where researchers divide population into non-overlapping groups (strata) and collect a simple random sample of participants from each stratum. In contrast to cluster sampling, cluster sampling uses simple random sampling to select clusters and everyone in those clusters are sampled.
A structured interview follows a strict protocol using a set of defined questions administered in the same order to all interviewees. It allows for a quick collection of focused data, however there are limited opportunities for probing and further exploration of topics. The interviews are usually conducted face to face or over the phone.
Survey nonresponse can occur at both an item and unit level.
Item nonresponse occurs when a sample member responds to the survey, but fails to provide a valid response to a particular item (e.g. a question they refuse to answer).
Unit nonresponse occurs when eligible sample members either cannot be contacted, refuse to participate in the survey or do not provide sufficient information for their responses to be valid. Unit nonresponse can be a source of bias in survey estimates and reducing unit nonresponse is an important objective of good survey practice.
Univariate analysis models data that consist of a single variable. Examples of univariate analyses include descriptive statistics (mean, standard deviation, kurtosis) goodness-of-fit tests and the Student’s t-test.
A description of the values a variable can take on. Sometimes nominal values are coded as numbers and the label helps to describe what each of these numbers means. E.g. for the variable ‘sex’ the values may be: