Research Programme 4: Demographic, Socio-Economic and Environmental Data Linkage

Prof Paul Boyle, Dr Chris Dibben

Aims

  1. To explore the feasibility of taking major genetic studies in Scotland back through time by linking historical vital events data for the >20,000 study members and their families
  2. To estimate and validate complex time-space exposures to various environmental agents through a linkage between environmental datasets, hospital admissions and the Scottish Longitudinal Study (SLS)

ScotlandExemplar Study 1: Longitudinal Research Support: Linking genetic studies back through time (Prof Paul Boyle, Prof Paul McKeigue)

Background

A variety of large-scale genetic databases now exist in Scotland and elsewhere. These studies provide the opportunity to identify genetic variants accounting for variation in levels of quantitative traits underlying the major common complex diseases (such as cardiovascular disease, cognitive decline, and mental illness).  We propose a pilot study exploring the feasibility of taking major genetic studies in Scotland back through time by linking historical data for the study members and their families. It will make use of a unique set of vital events records (births, deaths and marriage certificates) scanned for the whole of Scotland back to 1855. An index allows these scanned images to be searched making it possible using genealogical methods to build up complex family histories.

Current challenges

Genealogical data have several uses in genetic epidemiology: they are the basis of genetic linkage studies to localize disease genes, they can yield estimates of the size of genetic effects on an outcome variable, they can be used to correct for relatedness as a confounder in association studies, and can be used in linkage disequilibrium mapping to identify haplotypes shared by affected individuals.
The success of deCODE in Iceland in discovering genes influencing complex traits has been at least partly attributable to their ability to construct large genealogies. This is a real possibility in Scotland because of the fact that vital events back to 1855 have been scanned and indexed. This resource would provide the potential for adding considerable value to Scottish genetic and the challenge is to explore how reliably and cost efficiently this could be done.

Research questions

This study will create a linked sample of around 1000 members of Generation Scotland Scottish Family Heallth Study and the Wellcome Trust UK Type 2 Diabetes Case Control Collection with pedigrees back to 1855 and cause-specific mortality for all pedigree members. From the pedigrees, we can compute the relationship matrix (and its inverse) for any subset of individuals on whom phenotypic data are available. We can use these matrices for several types of study:

  1. Adjustment for additive genetic effects in epidemiological studies, including genetic association studies in which tests of any locus under study must be adjusted for additive polygenic effects to control for relatedness.
  2. Identification of inbred individuals for studies of effects associated with homozygosity by descent.
  3. Estimation of the size of additive genetic effects on any measured trait or outcome. This can be applied to any recorded outcome or trait, including cause-specific mortality and longevity for which genetic effects on outcome can be modelled with Cox regression.

Methodology

This effort builds upon the GROS DIGROS project (Digital-Imaging of the Genealogical Records of Scotland's People), which converted all statutory vital events records (births, marriages and deaths) since 1855 into digital imaging format. First, we would contact a sample of the study members (around 1000 of the >20,000 members) to gather information about the names and dates of birth of their relatives. Second, with this information we will use genealogical methods to identify the relevant records about family members in the DIGROS system. Third, we would transcribe relevant information (on cause and age of death etc.) from the electronically scanned images. Fourth, we would develop a database system for managing this information, including a system for monitoring the precise costs of conducting each part of the study.
The statistical modelling can be undertaken in a Bayesian framework, using MCMC simulation to generate the posterior distribution of additive genetic effects in linear, logistic or survival regression models. McKeigue has previously applied these methods for analysis of the ORCADES study of a Scottish population isolate.

Deliverables

We will have explored the accuracy with which linkages can be made, tested the reliability of the sample and conducted preliminary analyses of the resulting data. We will then host a workshop of experts where we would present the results of the study and discuss the relative costs and benefits of conducting a full-scale linkage for the entire sample of the two genetic studies used here, informing any future request for funding. This will be the first attempt in the UK to link information about respondents in genetic studies back through time and it has the potential to become a unique resource for those interested in disease and genetic variation across populations.

Exemplar project 2: Linking demographic, socioeconomic, geospatial and environmental datasets within SHIP (Dr Chris Dibben, Prof Paul Boyle)

Aims

  1. To explore the practicalities of linking environmental pollution data to existing longitudinal datasets
  2. To estimate the effect of exposures to various environmental agents on health

Background

Understanding the relationship between environmental insults and health can require large study sizes because the effects are often relatively small. Also it is often useful to know the long-term exposure of individuals. This type of data can, therefore, be difficult and expensive to collect through survey methods.
The quality and depth of Scotland's EPRs encompassing information on the health of the whole population for some 30 years, means that it represents a powerful tool for understanding the relationship between environment insults and health. In addition, the Scottish Longitudinal Study (SLS) ties these data to other non-medical data for a large sample. To exploit these data sources, however, environmental conditions need to be estimated for all members of these administrative datasets. This exemplar project will explore the practicalities of doing this and assess the quality of the results.

Current challenges

In Scotland there has been little linkage of environmental data to existing administrative health data. Methods for doing this as well as further refinement of sources of environment data need to be developed before this potential research resource can be fully exploited.

Research questions

  1. Is it possible to produce valid, reliable estimates of environmental conditions for individuals within the major Scottish administrative datasets and the SLS?
  2. Can long-term and time-space varying exposures be estimated?
  3. Can health effects be detected using these methods?

Methodology

We will use Geographical Information Systems (GIS) to create spatial surface models of environmental conditions, such as ambient pollution (both Anthropogenic and natural in origin) exposures, for all individuals within the main health administrative datasets and the SLS. Estimates will be further refined by using information contained particularly in the SLS to model complex time-space exposures to various environmental agents. The SLS will also be used to model important potential confounders. The relationship between the environmental estimates and health outcomes will then be explored.

Deliverables

The main health administrative datasets and the SLS will have been linked to environmental data from 1993, including estimates of air pollution (PM10, SO2 NO2 and O3), radon and sunlight exposure. A number of innovative methodologies will have been developed that will be applicable to other parts of the UK. A more nuanced method of estimating environmental exposure will have been developed: pollution, for example, is almost always measured with reference to an individual's place of residence, representing only a part of their daily exposure, and in this study we propose providing both home and work exposures..