Confidence intervals are used to express the uncertainty associated with a
population estimate. For example, imagine we wanted to use a survey to estimate the mean age of a population. The 95% confidence interval tells us that if we sampled this same population lots of times, and generated a CI each time, 95% of these CI would contain the true mean age of the population.
Aggregate or macro data are data about populations, groups, regions or countries. These are data that have been averaged, totalled or otherwise derived from the individual level data found in the survey datasets.
Sample attrition refers to the loss of study units from a sample after an initial wave of data collection. For example, individuals who take part in a first wave of a
longitudinal study dropping out at a sunsequent wave.
Sampling bias occurs when a sample statistic does not accurately reflect the true value of the parameter in the target population. Sample estimates might be too high or too low compared to the true population values. This may arise where the sample is not representative of the population.
Source: SAGE Research Methods.
A survey case is a unit for which
values are captured.
Typically, surveys use individuals, families/households or institutions/organisations
as observation units (cases). In survey datasets, cases are usually stored in rows.
Within a data catalogue, a catalogue record provides essential
metadata for the
dataset(s) and access to the accompanying
documentation. These records typically include a title, details about the data creators, a descriptive overview of the dataset content, information about the
sample, and details about the data access conditions, facilitating efficient data discovery and understanding.
A
variable that can take on one
value from a set of discrete and mutually
exclusive list of responses. For example, a marital status variable
can include the categories of single (never married), married,
civil partnership, divorced, widowed etc. and respondents can
be assigned only one value from this list.
Choropleth maps colour or shade different areas
according to a range of values, e.g. population density
or per-capita income.
The process of dividing a population into groups,
then selecting a simple random sample of groups
and sampling everyone in those groups. An example
of this is geographical clustering, which is often
efficiently applied in face-to-face surveys.
Clustering of addresses limits travel for interviewers
and so allows survey producers to sample more respondents
for a given budget.
Sources: An Introduction to Statistical Methods and Data Analysis.
A codebook describes the contents, structure, and
layout of a data collection. Codebooks begin with
basic front matter, including the study title, name
of the principal investigator(s), table of contents,
and an introduction describing the purpose and format
of the codebook. Some codebooks also include methodological
details, such as how weights were computed, and data
collection instruments, while others, especially with
larger or more complex data collections, leave those
details for a separate user guide and/or data collection
instrument.
Cohort studies chart the lives of groups of
individuals who experience the same life events
within a given time period.
Source: Closer Learning Hub
A control variable is a
variable
that is included in an analysis in order to control or
eliminate its influence on the variables of interest.
For example, if we are looking at the relationship between
having a university degree and smoking prevalence, we might
need to consider the impact of age at the same time.
Older generation respondents are more likely to smoke
than a younger generation. If we control for age, we can
see whether graduates are less likely to smoke than
non-graduates once age has been accounted for.
Source: SAGE Research Methods.
Copyright is the exclusive and assignable legal
right to control all use of an original work, such
as a book, data etc., for a particular period of time.
Source: Cambridge Dictionary.
Cross-sectional data are collected from a sample at
a single point in time. It is often likened to taking
a snapshot. Cross-sectional studies are quick and relatively
simple, but they cannot provide information about the change
in the same individuals or units over time. Repeated cross-sectional data can however
be used to look at aggregate changes in the population as
a whole.
Source: SAGE Research Methods.
A data archive is a centralised database system that collects, manages, and stores datasets for later use. Similar to a
data repository.
Data licensing is a legal arrangement between the
creator of the data and the end-user specifying what
users can do with the data.
Source: How to FAIR.
Data linkage is the process of joining together
records from different sources that pertain to the same entity.
Source: ONS;
Understanding Society.
Data manipulation is the process of arranging and
organising data to make it easier to use, analyse and interpret.
Data mining is defined as the process of extracting
useful information from large data sets through the
use of any relevant data analysis techniques developed
to help people make better decisions.
Source: SAGE Research Methods .
A data repository is a centralised database system that collects, manages, and stores datasets for later use, similar to a
data archive .
Any computer file (or set of files) which is organised under a single title and is capable of being described as a coherent unit.
A
variable that is created from one or more already existing variables by following some sort of calculation or other data processing technique. For example, each respondent’s estimated annual income from savings and investments could be derived from several reported income variables.
Descriptive statistics are those that describe data. Examples include means, medians, variances, standard deviations, correlation coefficients, etc.
Source: SAGE Research Methods.
Accompanying files that enable users to understand a
dataset, exactly how the research was carried out and what the data mean. Usually consisting of data-level documentation i.e. about individual databases or data files and study-level documentation i.e. high-level information on the research context and design, the data collection methods used, any data preparations and manipulations, plus summaries of findings based on the data.
This is a method of dividing the data displayed in a
choropleth map. Equal interval simply divides the data into equal sized subranges. For example, if your data ranged from 0 to 300, and you specified three classes 0 to 300, the ranges would be 0–100, 101–200, and 201–300.
See also
Natural breaks (or Jenks) and
Quantile.
Imputation involves replacing
missing values, with an estimated value. It is one of three options for handling missing data. The general principle
is to delete when the data are expendable, impute when the data are precious, and segment for the less common situation in which a large data
set has a large fissure.
Source: SAGE Research Methods.
Informed consent is a process used in research where individuals are provided with complete and adequate information about a study including its risks, benefits,
and alternatives, based on which the individual decides whether to participate in the study or not.
Source: SAGE Research Methods.
Data that contain information about the sampled units (e.g. respondents, households) measured on two or more occasions.
Source: Learn Statistics Easily.
In long format data, each row represents a single observation or measurement for a subject, often resulting
in multiple rows for each subject. For example, in a
longitudinal survey
tracking student performance over several years, each student will
have multiple rows corresponding to different years, with each row
recording their performance for that particular year.
Contrast with
wide-format data.
A longitudinal design is one that measures the characteristics of the same individuals on at least two, but ideally more, occasions over time. Its purpose is to directly address the study of individual change and variation. Longitudinal studies are expensive in terms of both time and money, but they provide many significant advantages relative to cross-sectional studies.
Source: SAGE Research Methods.
Machine learning - a subset of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed.
Metadata is a set of data that describes and gives information about other data. Information that describes significant aspects (e.g. content, context and structure of information) of a resource; metadata are created for the purposes of resource discovery, managing access and ensuring efficient preservation of resources.
Microdata are unit-level data obtained from sample surveys, censuses, and administrative systems. They provide information about characteristics of individual people or entities such as households, business enterprises, facilities, farms or even geographical areas such as villages or towns.
Source: The World Bank.
Some
variables have
values that are recorded as missing. These values may be missing unintentionally (due to data entry errors) or may stem from the survey design (e.g. if only part of the sample were asked a particular question). Sometimes
non-substantive responses (such as ‘don’t know’) are also recorded as missing values. To draw accurate inferences about the data missing values need to be treated prior to the analyses, e.g. excluded.
This is a method of dividing the data displayed in a
choropleth map. The Natural breaks (Jenks) method groups similar values together, and breaks are assigned where there are relatively large distances between the classes. This reduces variance within classes and maximises variance between classes.
See also
Equal interval and
Quantile.
This is a type of
categorical variable that represents categories that do not have a natural order. The values assigned to the categories can be presented in any order. For example, there is no natural order to a set of categories describing the religion a person might follow.
Non-substantive responses in surveys are responses such as: ‘Not sure/ Do not recall’, ‘Don’t know (DK)’. Non-substantive responses are generally not used in analysis.
This is a type of
categorical variable that contains
values which represent categories which have a natural order. For example, a highest level of qualification variable might follow an order such as:
- higher degree
- first degree
- further education below degree
- GCSE or equivalent
- no qualification
There is a logical order that the values assigned to the categories can be presented in.
Panel studies follow the same individuals over time. Information is normally collected about the whole household at each wave. See also:
wave.
Source: Closer Learning Hub.
In survey design, a population is an entire collection of observation units, for example all 'residents in England and Wales in 2020', about which researchers seek to draw inferences.
Statistics produced using a sample of cases (sample statistics), which are designed to produce an estimate about the characteristics of the population (population parameter).
Source: SAGE Research Methods.
Precision refers to the size of deviations from a survey estimate (i.e. a survey statistic, such as a mean or percentage) that occurs over repeated application of the same probability-based
sampling procedures using the same sampling frame and sample size.
Standard errors and
confidence intervals are two examples of commonly used measures of precision.
Source: SAGE Research Methods.
Primary data is data collected first-hand for a specific research purpose or project.
Source: SAGE Research Methods.
A sample based on random selection of elements. It should be possible for the sample designer to calculate the probability with which an element in the population is selected for inclusion in the sample.
This is a method of dividing the data displayed in a
choropleth map .
Quantile classification arranges data so there is the same count of
features in each class. This will result in an equal distribution of
shading across the maps. This can result in a misleading map, as
similar features can be in different classes, and widely different
features in the same class.
See also
Equal intervals and
Natural breaks (or Jenks).
A representative
sample is one that replicates the characteristics of the population.
In the context of scientific research, reproducibility refers to the ability of an independent researcher or team to recreate the results of a study using the same methods and data as the original study. This concept hinges on the provision of detailed methodology, context, and background information.
Research data management refers to the systematic organisation, storage, preservation and sharing of data resulting from a research project. It involves practices that span the entire data lifecycle, from planning the collection to wider data sharing.
Source: CODA.
Research methodology is a description of the approach followed to complete a
research project; the 'how' that helps the researcher address the research aims,
objectives and research questions.
A person, or other entity, who responds to a survey.
A sampling frame is a comprehensive list of all the members of the population from which a probability sample will be selected.
Source: SAGE Research Methods.
Standard error measures the uncertainty or variability associated with a sample estimate when compared to the true population parameter. The standard error of a statistic (like a mean or percentage) indicates how much that statistic is expected to vary from the true population value.
The standard error is also inversely proportional to the sample size; the larger the sample size, the smaller the standard error.
Source: Stat Trek Statistics Dictionary.
A statistical model is a theoretical construction of the relationship of explanatory variables to variables of interest created to better understand these relationships.
They typically consist of a collection of probability distributions and are used to describe patterns of variability that data may display.
The statistical model is expressed as a function. For example, a researcher may model a linear relationship using the regression function below:
y = b0 + b1x1 + b2x2 ... + bixi
In this model, y represents an outcome variable and xi represents its corresponding predictor variables. The term b0 is an intercept for the model. The term bi is a regression coefficient and represents the numerical relationship between the predictor variables and the outcome for the ith term.
Statistical modelling is a major topic. Readers who want to know more will find extensive accounts of statistical models including linear regression and logistic regression in statistical texts and online.
Sources: Science Direct;
Magoosh Statistics Blog.
A structured interview follows a strict protocol using a set of defined questions administered in the same order to all interviewees. It allows for a quick collection of focused data, however there are limited opportunities for probing and further exploration of topics. The interviews are usually conducted face to face or over the phone.
Survey design involves a series of methodological steps
to create an effective survey, such as defining an objective, determining
a target population, designing the questionnaire etc.
Survey design can also refer to the structure or format of the survey,
such as a cross-sectional survey, longitudinal survey etc.
Survey nonresponse can occur at both an item and unit level.
Item nonresponse occurs when a sample member responds to the survey but fails to provide a valid response to a particular item (e.g. a question they refuse to answer).
Unit nonresponse occurs when eligible sample members either cannot be contacted, refuse to participate in the survey or do not provide sufficient information for their responses to be valid. Unit nonresponse can be a source of bias in survey estimates and reducing unit nonresponse is an important objective of good survey practice.
Source: SAGE Research Methods.
The target population, or simply the population, represents
the specific group we are interested in studying.
The unit which is being analysed. This is synonymous to
case.
Univariate analysis involves analysis of a single variable. Examples of univariate analyses include descriptive statistics (mean, standard deviation, kurtosis) goodness-of-fit tests and the Student’s t-test.
A representation of a characteristic for one case. For one variable, values may vary from one case to another. E.g. for the variable ‘gender’ the values may be ‘male’, ‘female’ or ‘other’.
A description of the values a variable can take on. Sometimes nominal values are coded as numbers and the label helps to describe what each of these numbers means. E.g. for the variable ‘gender’ the values may be:
- female
- male
- other
A variable is an attribute that describes a person, place, thing, or idea. The value of the variable can "vary" from one entity to another. In surveys, this is usually a characteristic that varies between
cases.
Source: Stat Trek Statistics Dictionary
A wave is a round of data collection in a particular longitudinal survey (for example, the age 7 wave of the National Child Development Study refers to the data collection that took place in 1965 when the participants were aged 7). Note that the term sweep often has the same meaning.
Source: Closer Learning Hub.
Weighting is a statistical adjustment made to survey data to improve accuracy of survey estimates. Weighting can correct for unequal probabilities of selection and survey non-response.
Source: SAGE Research Methods.
In wide format data, each subject's responses are listed in a single row, with different variables spread across multiple columns.
Longitudinal data in wide format would contain one row of information per person, and measurements of the same variable at different time points would be contained in different variables.
Contrast with
long-format data.