UK Data Service: Data skills glossary

95% confidence interval

Confidence intervals are used to express the uncertainty associated with a population estimate. For example, imagine we wanted to use a survey to estimate the mean age of a population. The 95% confidence interval tells us that if we sampled this same population lots of times, and generated a CI each time, 95% of these CI would contain the true mean age of the population.

Aggregate data

Aggregate or macro data are data about populations, groups, regions or countries. These are data that have been averaged, totalled or otherwise derived from the individual level data found in the survey datasets.

Attrition

Sample attrition refers to the loss of study units from a sample after an initial wave of data collection. For example, individuals who take part in a first wave of a longitudinal study dropping out at a sunsequent wave.

Bias

Sampling bias occurs when a sample statistic does not accurately reflect the true value of the parameter in the target population. Sample estimates might be too high or too low compared to the true population values. This may arise where the sample is not representative of the population.

Source: SAGE Research Methods.

Case

A survey case is a unit for which values are captured. Typically, surveys use individuals, families/households or institutions/organisations as observation units (cases). In survey datasets, cases are usually stored in rows.

Catalogue record

Within a data catalogue, a catalogue record provides essential metadata for the dataset(s) and access to the accompanying documentation. These records typically include a title, details about the data creators, a descriptive overview of the dataset content, information about the sample, and details about the data access conditions, facilitating efficient data discovery and understanding.

Categorical variable

A variable that can take on one value from a set of discrete and mutually exclusive list of responses. For example, a marital status variable can include the categories of single (never married), married, civil partnership, divorced, widowed etc. and respondents can be assigned only one value from this list.

Choropleth map

Choropleth maps colour or shade different areas according to a range of values, e.g. population density or per-capita income.

Cluster sampling

The process of dividing a population into groups, then selecting a simple random sample of groups and sampling everyone in those groups. An example of this is geographical clustering, which is often efficiently applied in face-to-face surveys. Clustering of addresses limits travel for interviewers and so allows survey producers to sample more respondents for a given budget.

Sources: An Introduction to Statistical Methods and Data Analysis.

Codebook

A codebook describes the contents, structure, and layout of a data collection. Codebooks begin with basic front matter, including the study title, name of the principal investigator(s), table of contents, and an introduction describing the purpose and format of the codebook. Some codebooks also include methodological details, such as how weights were computed, and data collection instruments, while others, especially with larger or more complex data collections, leave those details for a separate user guide and/or data collection instrument.

Cohort

A group of people who share a characteristic, usually birth year. See also: cohort study.

Source: Cambridge Dictionary

Cohort study

Cohort studies chart the lives of groups of individuals who experience the same life events within a given time period.

Source: Closer Learning Hub

Control variables

A control variable is a variable that is included in an analysis in order to control or eliminate its influence on the variables of interest. For example, if we are looking at the relationship between having a university degree and smoking prevalence, we might need to consider the impact of age at the same time. Older generation respondents are more likely to smoke than a younger generation. If we control for age, we can see whether graduates are less likely to smoke than non-graduates once age has been accounted for.

Source: SAGE Research Methods.

Copyright

Copyright is the exclusive and assignable legal right to control all use of an original work, such as a book, data etc., for a particular period of time.

Source: Cambridge Dictionary.

Cross-sectional data

Cross-sectional data are collected from a sample at a single point in time. It is often likened to taking a snapshot. Cross-sectional studies are quick and relatively simple, but they cannot provide information about the change in the same individuals or units over time. Repeated cross-sectional data can however be used to look at aggregate changes in the population as a whole.

Source: SAGE Research Methods.

Data archives

A data archive is a centralised database system that collects, manages, and stores datasets for later use. Similar to a data repository.

Data licensing

Data licensing is a legal arrangement between the creator of the data and the end-user specifying what users can do with the data.

Source: How to FAIR.

Data linkage

Data linkage is the process of joining together records from different sources that pertain to the same entity.

Source: ONS; Understanding Society.

Data manipulation

Data manipulation is the process of arranging and organising data to make it easier to use, analyse and interpret.

Data mining

Data mining is defined as the process of extracting useful information from large data sets through the use of any relevant data analysis techniques developed to help people make better decisions.

Source: SAGE Research Methods .

Data repository

A data repository is a centralised database system that collects, manages, and stores datasets for later use, similar to a data archive .

Dataset

Any computer file (or set of files) which is organised under a single title and is capable of being described as a coherent unit.

Derived variable

A variable that is created from one or more already existing variables by following some sort of calculation or other data processing technique. For example, each respondent’s estimated annual income from savings and investments could be derived from several reported income variables.

Descriptive statistic

Descriptive statistics are those that describe data. Examples include means, medians, variances, standard deviations, correlation coefficients, etc.

Source: SAGE Research Methods.

Documentation

Accompanying files that enable users to understand a dataset, exactly how the research was carried out and what the data mean. Usually consisting of data-level documentation i.e. about individual databases or data files and study-level documentation i.e. high-level information on the research context and design, the data collection methods used, any data preparations and manipulations, plus summaries of findings based on the data.

Equal interval

This is a method of dividing the data displayed in a choropleth map. Equal interval simply divides the data into equal sized subranges. For example, if your data ranged from 0 to 300, and you specified three classes 0 to 300, the ranges would be 0–100, 101–200, and 201–300. See also Natural breaks (or Jenks) and Quantile.

Imputation of missing data

Imputation involves replacing missing values, with an estimated value. It is one of three options for handling missing data. The general principle is to delete when the data are expendable, impute when the data are precious, and segment for the less common situation in which a large data set has a large fissure.

Source: SAGE Research Methods.

Informed consent

Informed consent is a process used in research where individuals are provided with complete and adequate information about a study including its risks, benefits, and alternatives, based on which the individual decides whether to participate in the study or not.

Source: SAGE Research Methods.

Level of measurement

Data that contain information about the sampled units (e.g. respondents, households) measured on two or more occasions.

Source: Learn Statistics Easily.

Long-format data

In long format data, each row represents a single observation or measurement for a subject, often resulting in multiple rows for each subject. For example, in a longitudinal survey tracking student performance over several years, each student will have multiple rows corresponding to different years, with each row recording their performance for that particular year. Contrast with wide-format data.

Longitudinal data

A longitudinal design is one that measures the characteristics of the same individuals on at least two, but ideally more, occasions over time. Its purpose is to directly address the study of individual change and variation. Longitudinal studies are expensive in terms of both time and money, but they provide many significant advantages relative to cross-sectional studies.

Source: SAGE Research Methods.

Machine learning

Machine learning - a subset of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed.

Metadata

Metadata is a set of data that describes and gives information about other data. Information that describes significant aspects (e.g. content, context and structure of information) of a resource; metadata are created for the purposes of resource discovery, managing access and ensuring efficient preservation of resources.

Microdata

Microdata are unit-level data obtained from sample surveys, censuses, and administrative systems. They provide information about characteristics of individual people or entities such as households, business enterprises, facilities, farms or even geographical areas such as villages or towns.

Source: The World Bank.

Missing value

Some variables have values that are recorded as missing. These values may be missing unintentionally (due to data entry errors) or may stem from the survey design (e.g. if only part of the sample were asked a particular question). Sometimes non-substantive responses (such as ‘don’t know’) are also recorded as missing values. To draw accurate inferences about the data missing values need to be treated prior to the analyses, e.g. excluded.

Multivariate modelling

This method attempts to model mathematically or statistically data from two or more variables measured on the same observations. Multivariate statistical modelling often involves a dependent variable and multiple independent variables. Examples of multivariate analyses are factor analysis, latent class analysis, and multivariate regressions. In contrast, univariate method involves an analysis of a single variable.

Resources: Centre for Statistical Methodology; STATA; Science Direct; UCLA Institute for Digital Research & Education.

Multivariate analysis

Multivariate analysis encompasses all statistical techniques that are used to analyse more than two variables at once.

Sources: International Encyclopedia of the Social & Behavioral Sciences;

Natural breaks

This is a method of dividing the data displayed in a choropleth map. The Natural breaks (Jenks) method groups similar values together, and breaks are assigned where there are relatively large distances between the classes. This reduces variance within classes and maximises variance between classes. See also Equal interval and Quantile.

Nominal variable

This is a type of categorical variable that represents categories that do not have a natural order. The values assigned to the categories can be presented in any order. For example, there is no natural order to a set of categories describing the religion a person might follow.

Non-substantive

Non-substantive responses in surveys are responses such as: ‘Not sure/ Do not recall’, ‘Don’t know (DK)’. Non-substantive responses are generally not used in analysis.

Ordinal variable

This is a type of categorical variable that contains values which represent categories which have a natural order. For example, a highest level of qualification variable might follow an order such as:

higher degree
first degree
further education below degree
GCSE or equivalent
no qualification

There is a logical order that the values assigned to the categories can be presented in.

Outlier

An outlier is an extreme value that differs greatly from other values in a set of values.

Source: Stat Trek Statistics Dictionary.

Panel

A panel refers to a survey sample in which the same units or respondents are surveyed or interviewed on two or more occasions (waves). See also: panel study.

Source: SAGE Research Methods..

Panel study

Panel studies follow the same individuals over time. Information is normally collected about the whole household at each wave. See also: wave.

Source: Closer Learning Hub.

Population

In survey design, a population is an entire collection of observation units, for example all 'residents in England and Wales in 2020', about which researchers seek to draw inferences.

Population estimate

Statistics produced using a sample of cases (sample statistics), which are designed to produce an estimate about the characteristics of the population (population parameter).

Source: SAGE Research Methods.

Precision

Precision refers to the size of deviations from a survey estimate (i.e. a survey statistic, such as a mean or percentage) that occurs over repeated application of the same probability-based sampling procedures using the same sampling frame and sample size. Standard errors and confidence intervals are two examples of commonly used measures of precision.

Source: SAGE Research Methods.

Primary data source

Primary data is data collected first-hand for a specific research purpose or project.

Source: SAGE Research Methods.

Probability sample

A sample based on random selection of elements. It should be possible for the sample designer to calculate the probability with which an element in the population is selected for inclusion in the sample.

PSPP

PSPP is an open source statistics package which has a similar design and basic functionality of SPSS. Visit the PSPP website for more information.

Quantile

This is a method of dividing the data displayed in a choropleth map . Quantile classification arranges data so there is the same count of features in each class. This will result in an equal distribution of shading across the maps. This can result in a misleading map, as similar features can be in different classes, and widely different features in the same class. See also Equal intervals and Natural breaks (or Jenks).

Raw variable

A variable that stores responses given to a question in the survey in their original form. Contrast with derived variables.

Representative sample

A representative sample is one that replicates the characteristics of the population.

Reproducibility

In the context of scientific research, reproducibility refers to the ability of an independent researcher or team to recreate the results of a study using the same methods and data as the original study. This concept hinges on the provision of detailed methodology, context, and background information.

Research data management

Research data management refers to the systematic organisation, storage, preservation and sharing of data resulting from a research project. It involves practices that span the entire data lifecycle, from planning the collection to wider data sharing.

Source: CODA.

Research methodology

Research methodology is a description of the approach followed to complete a research project; the 'how' that helps the researcher address the research aims, objectives and research questions.

Respondent

A person, or other entity, who responds to a survey.

Sample

A sample is a subset of a population.

Sampling

The process of selecting and examining a portion (a sample) of a larger group of potential participants (a population) in order to produce inferences that apply to the broader group of participants.

Source: Encyclopaedia of Quality of Life and Well-Being Research.

Sampling frame

A sampling frame is a comprehensive list of all the members of the population from which a probability sample will be selected.

Source: SAGE Research Methods.

SPSS

SPSS is a commercial statistics package. Visit the website for more information.

Standard error

Standard error measures the uncertainty or variability associated with a sample estimate when compared to the true population parameter. The standard error of a statistic (like a mean or percentage) indicates how much that statistic is expected to vary from the true population value.

The standard error is also inversely proportional to the sample size; the larger the sample size, the smaller the standard error.

Source: Stat Trek Statistics Dictionary.

Statistical modelling

A statistical model is a theoretical construction of the relationship of explanatory variables to variables of interest created to better understand these relationships.

They typically consist of a collection of probability distributions and are used to describe patterns of variability that data may display.

The statistical model is expressed as a function. For example, a researcher may model a linear relationship using the regression function below:

y = b₀ + b₁x₁ + b₂x₂ ... + b_ix_i

In this model, y represents an outcome variable and x_i represents its corresponding predictor variables. The term b₀ is an intercept for the model. The term b_i is a regression coefficient and represents the numerical relationship between the predictor variables and the outcome for the ith term.

Statistical modelling is a major topic. Readers who want to know more will find extensive accounts of statistical models including linear regression and logistic regression in statistical texts and online.

Sources: Science Direct; Magoosh Statistics Blog.

Strata

Stratified random sampling refers to a sampling method in which the total population is divided into non-overlapping subgroups. Each of the subgroups is called a stratum, and two or more subgroups are called strata.

Sources: An Introduction to Statistical Methods and Data Analysis; Stat Trek Statistics Dictionary.

Structured interview

A structured interview follows a strict protocol using a set of defined questions administered in the same order to all interviewees. It allows for a quick collection of focused data, however there are limited opportunities for probing and further exploration of topics. The interviews are usually conducted face to face or over the phone.

Survey design

Survey design involves a series of methodological steps to create an effective survey, such as defining an objective, determining a target population, designing the questionnaire etc.

Survey design can also refer to the structure or format of the survey, such as a cross-sectional survey, longitudinal survey etc.

Survey non-response

Survey nonresponse can occur at both an item and unit level.

Item nonresponse occurs when a sample member responds to the survey but fails to provide a valid response to a particular item (e.g. a question they refuse to answer).

Unit nonresponse occurs when eligible sample members either cannot be contacted, refuse to participate in the survey or do not provide sufficient information for their responses to be valid. Unit nonresponse can be a source of bias in survey estimates and reducing unit nonresponse is an important objective of good survey practice.

Source: SAGE Research Methods.

Target population

The target population, or simply the population, represents the specific group we are interested in studying.

Unit of analysis

The unit which is being analysed. This is synonymous to case.

Univariate

Univariate analysis involves analysis of a single variable. Examples of univariate analyses include descriptive statistics (mean, standard deviation, kurtosis) goodness-of-fit tests and the Student’s t-test.

Value

A representation of a characteristic for one case. For one variable, values may vary from one case to another. E.g. for the variable ‘gender’ the values may be ‘male’, ‘female’ or ‘other’.

Value label

A description of the values a variable can take on. Sometimes nominal values are coded as numbers and the label helps to describe what each of these numbers means. E.g. for the variable ‘gender’ the values may be:

female
male
other

Variable

A variable is an attribute that describes a person, place, thing, or idea. The value of the variable can "vary" from one entity to another. In surveys, this is usually a characteristic that varies between cases.

Source: Stat Trek Statistics Dictionary

Wave

A wave is a round of data collection in a particular longitudinal survey (for example, the age 7 wave of the National Child Development Study refers to the data collection that took place in 1965 when the participants were aged 7). Note that the term sweep often has the same meaning.

Source: Closer Learning Hub.

Weighting

Weighting is a statistical adjustment made to survey data to improve accuracy of survey estimates. Weighting can correct for unequal probabilities of selection and survey non-response.

Source: SAGE Research Methods.

Wide-format data

In wide format data, each subject's responses are listed in a single row, with different variables spread across multiple columns. Longitudinal data in wide format would contain one row of information per person, and measurements of the same variable at different time points would be contained in different variables. Contrast with long-format data.