Gene expression and the three-dimensional organization of the genome

Proper control of gene expression is critical for cell differentiation and homeostasis, and aberrant gene expression leads to disease states. Most of our understanding of the mechanisms that control transcription comes from studies that examine the one-dimensional 10 nm chromatin fiber. Only in recent years has technology become available to examine the effect of the 3D organization of the 10 nm chromatin fiber on gene expression. Understanding how transcription and chromatin 3D organization are related to each other in a causal manner and how they impact each other is important from a basic science perspective. In addition, recent evidence suggests that many human diseases arise as a consequence of alterations in processes related to chromatin 3D organization, including laminopathies and cohesinopathies. Furthermore, recent results suggest that alterations in cohesin function result in AML, that mutations in CTCF sites result in altered expression of cancer-related genes and tumorigenesis, and that mutations in CTCF cause autism and related disorders. Therefore, understanding the mechanisms leading to the establishment and maintenance of chromatin 3D organization is critical.

Figure 1. Classical view of 3D organization based on low resolution Hi-C data. A. Compartments defined by the Eigenvector of PCA. B. TADs defined computationally by a directionality index; note that not all TADs correspond to CTCF loops. C. Loops containing CTCF at their anchors are manifested by intense punctate signal. Note that interactions within a CTCF loop are not necessarily uniform.

The organization of the genetic material within the three-dimensional (3D) nuclear space affects and/or is affected by critical cellular processes such as gene expression, DNA replication, and recombination. Discerning the mechanistic principles responsible for this organization and the causal relationships between genome architecture and function is critical to understand basic biological processes involved in cell differentiation and the establishment of disease states. Our current knowledge of nuclear 3D organization comes from studies using microscopy, biochemistry and, more recently, high throughput genomic approaches involving chromatin conformation capture (3C) methods. Each of these approaches has advantages and drawbacks but, together, they have given us a picture of how the chromatin fiber is packed in the eukaryotic nucleus. In particular, results from Hi-C experiments suggest that chromosomes are divided into large domains, several mega bases (Mb) long, called compartments. These domains are identified by Principal Component Analysis (PCA) of Hi-C data analyzed at low resolution and defined by the first component, PC1 or Eigenvector, which has two states. As a consequence, compartments have been classified as A and B (Figure 1A, dark orange triangles). Comparison of the location of compartments in the genome and the distribution of covalent histone modifications, RNA Polymerase II (RNAPII) and RNA-seq data suggests that A compartments (Figure 1A, red bars) contain active genes and B compartments (Figure 1A, blue bars) contain inactive genes or are depleted of genes. However, this correlation is not perfect at this level of resolution, since B compartments contain some active genes and not all genes in A compartments are active (Figure 1A; compare Eigenvector and RNA-seq). A and B compartments interact with other A and B compartments, respectively, visible in Hi-C heatmaps as the plaid pattern signal away from the diagonal (Figure 1A, light orange rectangles), but A compartments do not interact with B compartments (Figure 1A, very light orange rectangles). This pattern of interactions agrees with classical microscopy studies, which show that active genes tend to interact with each other to form transcription factories, silenced genes containing H3K27me3 and the Polycom (Pc) complex interact to form Pc bodies, and the H3K9me3-containing pericentromeric regions interact to form chromocenters. These results also agree with recent studies showing that H3K9me3/HP1a can form condensates by liquid-liquid phase separation, and the same is true for H3K27me3/Pc and the carboxy-terminal domain of RNAPII. Importantly, note that while microscopy and biochemical experiments suggest the existence of 3 types of nuclear compartments formed as a consequence of interactions among proteins present in various types of chromatin, Hi-C only detects two. We suggest that this discrepancy, which we discuss in more detail below, is due to the inability of the Eigenvector to identify more than two states in the first component of PCA.

In addition to compartments identified by PCA at low resolution, analysis of Hi-C data obtained at a higher resolution and using computational algorithms designed to detect changes in the directionality of interactions allows the identification of domains smaller than compartments and named Topologically Associating Domains (TADs) (Figure 1B). To avoid confusion, we will use the term “TAD” exclusively to refer to domains identified by algorithms that measure directionality of interactions, as described in the original publications reporting the existence of these domains. TADs do not correlate with a specific transcriptional state, as compartments do. Instead TADs are defined based on boundaries where sequences change the orientation of their interactions in the Hi-C heatmap. As a consequence, sequences within TADs interact preferentially with each other rather than with sequences in different TADs. Importantly, TAD boundaries are not uniform: some contain CTCF sites whereas others contain actively transcribed genes. However, in the literature, some authors use the term TAD to refer to CTCF loops, although in reality the two are not equivalent. CTCF loops are visible in Hi-C heatmaps as punctae of intense signal whereas many TADs lack this signal at their submits (Figure 1B). The partial overlap between TADs and CTCF loops has resulted in the attribution of properties of CTCF loops to TADs. CTCF is an architectural protein (previously referred to as an insulator protein) that is able to form loops between sites convergently oriented via loop extrusion by the cohesin complex (Figure 1C). Using insulator assays with reporter genes, CTCF was shown to inhibit enhancer-promoter interactions when one of these two sequences was present inside and the second one outside the loop. More recent results based on analyses of Hi-C or related data suggest that indeed sequences inside CTCF loops interact with each other more frequently than with sequences outside, probably due to the constant extrusion process via cohesin, which forces sequences in the two sides of the loop to come together. This explains the decrease in interactions between enhancers and promoters when separated by CTCF sites. However, depending on their arrangement, CTCF sites can also serve to bring enhancers close to their cognate promoter, thus the classification of CTCF as an architectural protein to reflect its more general role in the nucleus. Based on this, the broad consensus in the field is that chromosomes are organized in a hierarchical manner, with large compartments that roughly correlate with the transcriptional state of their sequences and smaller TAD domains, contained within larger compartments, some of them corresponding to CTCF loops. Since TADs are formed by sequences that preferentially self-interact, it is thought that TADs represent domains of regulation i.e. a specific TAD contains genes and their regulatory sequences, and the TAD constrains interactions among regulatory elements and their target promoters to ensure co-regulation of genes within the TAD. Based on this view, gene expression is a consequence of 3D organization.

New view of 3D chromatin organization

Our laboratory has had a long-standing interest in understanding the relationship between nuclear organization and transcription. Over the years, we have used Drosophila as a model system to identify and characterize proteins involved in 3D nuclear organization using a combination of genetics and molecular methods. These studies resulted in the identification of the first sequence shown to interfere with enhancer-promoter interactions and the characterization of its associated proteins, which we first referred to as insulator proteins, but we now prefer the broader term architectural proteins. Using microscopy to analyze the distribution of Drosophila architectural proteins, we found their presence at specific nuclear locations and first proposed the idea of loop formation as an explanation for their punctated distribution and their ability to interfere with enhancer-promoter interactions. Using genomic approaches, we analyzed the distribution of 15 Drosophila architectural proteins at high resolution, we showed that they colocalize at many genomic sites, and we demonstrated that the strength of their functional effects correlates with the number and type of architectural proteins present. Using Hi-C we characterized what we called at the time “physical domains” (now called TADs) and showed that architectural proteins are present at the boundaries between TADs. Using high resolution Hi-C data, we discovered that

Figure 2. New view of 3D organization based on high resolution Hi-C data. A. Compartmental domains defined by the Eigenvector of PCA at high resolution. CTCF loops alter interactions within compartmental domains. B. Loops containing CTCF at their anchors and formed by cohesin extrusion can increase interactions between A and B compartmental domains, delimit single A or B domains, or decrease interactions between two halves of a uniform domain.

organisms such as Drosophila, C. elegans, and A. thaliana pack their chromosomes during interphase by forming small self-interacting domains containing genes in the same transcriptional state. We call these domains, identified by high resolution PCA, that precisely correspond to the transcriptional state, compartmental domains. Based on these findings we have proposed a new view of chromatin 3D organization. In lower eukaryotes, including Neurospora, C. elegans, Arabidopsis, and Drosophila, that lack CTCF or the CTCF protein is unable to stop loop extrusion by cohesin, the only type of self-interacting domain is compartmental domains, which correspond precisely to the transcriptional state of sequences located within the domain. The main biochemical principle responsible for the formation of compartmental domains and 3D chromatin organization resides in the ability of multivalent proteins present at sequences in different transcriptional states to interact with each other. These results suggest that 3D chromatin organization is a consequence, rather than an effector, of the transcriptional state of genes i.e. the nature of the proteins associated with specific sequences independent of whether they are undergoing transcription. However, once established, compartmental domains interact to stabilize interactions that contribute to the maintenance of the active or silenced chromatin states. In vertebrates, this process is antagonized by the continuous extrusion via cohesin, which separates sequences present in compartmental domains from their interacting partners (Figure 2A). The presence of CTCF stops the extrusion process when cohesin interacts with the CTCF protein in a specific orientation, which results in the formation of CTCF loops between anchors in convergent orientation (Figure 2B). This model points to several critical gaps in our understanding of 3D organization that we are currently addressing. These gaps include the following: 1) Are the findings observed in Drosophila and other lower eukaryotes, suggesting that compartmental domains, corresponding to the transcriptional or chromatin state, also true for mammalian cells? If so, which are the proteins involved in mediating the interactions that result in the formation of compartmental domains? Can the 3D organization established by compartmental domains be predicted from one-dimensional epigenetic information, including the distribution of chromatin-bound proteins, using computational approaches? What is the contribution of interactions among compartmental domains to regulatory contacts that are significant for gene expression? 2) How does the process of loop extrusion mediated by cohesin, and the inhibition of this process by CTCF to form stable loops, alter interactions among compartmental domains and what are the effect of these interactions on gene expression? How do cells regulate the genomic location of CTCF in order to alter interactions between compartmental domains and gene expression patterns? 3) And finally, how these two apparent opposing processes, interactions between compartmental domains and loop extrusion, play out during cell differentiation to elicit specific patterns of gene expression?

Overview of current research interests on 3D organization

We are exploring the causal relationships between chromatin state i.e. the transcription-related proteins present in the chromatin independent on whether the genes are being transcribed or not, and the formation of compartmental domains in mammals, and whether compartmental domain-based organization can be predicted from the one-dimensional distribution of chromatin-associated proteins using machine learning approaches. In vertebrates, this organization overlaps with a CTCF/cohesin-mediated organization that depends on cohesin extrusion and the presence of CTCF at specific sites in the genome. A second goal is to decipher the logic of CTCF occupancy in the genome and the principles that regulate whether cohesin extrusion stops at certain CTCF sites in the genome but not others. We are currently applying these concepts to study how chromatin 3D organization is controlled during cell differentiation in the context of normal development by analyzing changes in one-dimensional epigenetic information and 3D organization during the differentiation of human embryonic stem cells into definitive endoderm (DE), primitive gut tube-like (PGT), pancreatic progenitors (PP), and islet cells. In addition to their relevance to understanding basic mechanisms of transcription, results may help understand why SNPs associated with type 2 diabetes (T2D) in non-coding regions of the genome presumed to be enhancers lead to beta-cell malfunction. Results from these experiments should provide critical answers to long-standing questions of relevance to our understanding of the relationship between chromatin architecture and gene expression in the context of a disease-relevant system. These results will fill an important gap in our knowledge of fundamental principles of basic biology with applications to many areas of biomedicine.