Description
Track indicating the location of the centromere sequences.
Centromeres are specialized chromatin structures that are required for cell division. These
genomic regions are normally defined by long tracts of tandem repeats, or satellite DNA, that
contain a limited number of sequence differences to distinguish the linear order of repeat copies.
The size and repetitive nature of these regions mean they are typically not represented in
reference assemblies. Unlike all previous versions of the human reference assembly, where the
centromere regions have been represented by a multi-megabase gap, GRCh38 incorporates centromere
reference models that provide an initial genomic description derived from chromosome-assigned whole
genome shotgun (WGS) read libraries of alpha satellite.
Each reference model provides an approximation of the true array sequence organization.
Although the long-range repeat ordering is not expected to represent the true organization,
the submissions are expected to provide a biologically rich description of array variants and
local-monomer organization as observed in the initial WGS read dataset. As a result, these
sequences serve as a useful mapping target to extend sequence-based studies to sites previously
omitted from the human reference genome.
Methods
The sequences are generated based on second-order Markov models of monomer
variants, and graphical models of larger scale higher order repeats.
The graphical models are based on an analysis of Sanger reads from the
HuRef sequencing project (Assembly
GCA_000002125.1; BioProject
PRJNA19621),
and their local-ordering is supported by observed same-read monomer
adjacencies. The Markov models are generated by the program linearSat, which
was written for this project and that also generates a linear representation
of monomer order. The software linearSat generates a second-order Markov
chain to the size of a given array provided by sequence coverage normalization
estimates. The sequence definitions of transposable element insertions are
limited to the sequences directly adjacent to alpha satellite within the read
database, and incomplete representations are noted with an adjacent
100 bp gap. In total, these sequences provide a more complete reference
of sequence composition and higher order repeat variation inherent to a
given alpha satellite array, used to assemble centromeric regions of the
human chromosomes.
Credits
The data for this track was supplied by
Karen Miga.
References
Miga KH, Newton Y, Jain M, Altemose N, Willard HF, Kent WJ.
Centromere reference models for human chromosomes X and Y satellite arrays.
Genome Res. 2014 Apr;24(4):697-707.
PMID: 24501022; PMC: PMC3975068
|