Regulatory grammar in human promoters uncovered by MPRA-based deep learning - Nature
www.nature.com
Promoters act as the primary control switches for genes, dictating the precise timing and volume of gene transcription. A persistent challenge in genomics involves developing computational models capable of predicting a gene's expression level based exclusively on its DNA sequence. In this study, we introduce the Promoter Activity Regulatory Model (PARM), a sophisticated deep learning tool designed to operate with cell-type specificity. PARM was trained using data derived from a specialized experimental technique known as a Massively Parallel Reporter Assay (MPRA). This platform is highly efficient, enabling researchers to construct accurate predictive models for a diverse array of cell types and environmental conditions. We utilized PARM to identify functional binding sites for transcription factors within promoters, track how these regulatory connections shift upon cellular stimulation, and uncover the complex set of rules governing interactions between binding sites. This methodology offers a powerful and cost-effective strategy for deciphering the dynamic regulation of human genes.
Promoters are specific DNA regions that govern gene transcription. They consist of a transcription start site flanked by surrounding sequences containing short motifs where transcription factors bind. Constructing computational models that can predict a promoter's activity solely from its DNA sequence remains a significant difficulty. While deep learning demonstrates considerable promise, it typically necessitates enormous training datasets, often compiled from hundreds of genome-wide maps of epigenetic features across numerous cell types. Training such models demands substantial computing power. Furthermore, epigenetic data serve as indirect measurements of activity and can be influenced by confounding patterns, complicating the establishment of a direct causal link between a specific sequence and its biological function.
Massively Parallel Reporter Assays (MPRAs) provide a more direct alternative. In these experiments, millions of short DNA fragments are tested individually for their capacity to drive gene expression within a specific cell type. Because each fragment is isolated, any measured activity can be unambiguously assigned to that specific DNA sequence. Early research has demonstrated that combining MPRA data with deep learning can generate powerful predictive models for regulatory elements in other organisms. However, applying this approach to human promoters across many cell types has remained a formidable challenge.
Here, we present PARM, a platform that integrates an optimized MPRA with deep learning to efficiently build predictive models for human promoters. This system is both experimentally and computationally lightweight, enabling the creation of models for ten distinct human cell types and cells exposed to various stimuli.
We initially constructed a sequence-based deep learning model trained on MPRA data that measured the autonomous activity of millions of genomic DNA fragments in human K562 and HepG2 cells. Crucially, each genomic position was covered by an average of 240 overlapping fragments. We trained a convolutional neural network (CNN) architecture, extensively optimizing its design without relying on any prior biological knowledge beyond the MPRA data.
The model demonstrated remarkable accuracy. For a set of 5,204 promoters withheld from the training phase, PARM's predictions correlated strongly with measured activities (Pearson's R = 0.92 for K562). It also accurately predicted the activity of individual fragments within promoters and reliably estimated the activity of promoters integrated into the genome, as verified by a different MPRA method. PARM enabled us to perform in silico saturated mutagenesis (ISM), a process predicting the effect of every possible single-nucleotide change within a promoter. When applied to the promoter of the TERT oncogene, PARM correctly predicted that two cancer-associated mutations increased activity. It also identified other crucial positions. Compared to other state-of-the-art models trained on vast epigenomic datasets, PARM performed similarly to one and was slightly outperformed by another, more computationally demanding model. However, PARM achieves this performance with far fewer parameters, rendering it significantly more efficient.
To further validate PARM's understanding, we investigated whether it could design completely new, synthetic human promoters. Utilizing a genetic algorithm that iteratively mutates sequences and selects those PARM predicts will be strongest, we generated diverse, highly active synthetic sequences from random starting points. We experimentally tested 455 synthetic sequences alongside 42 natural promoter fragments. The measured activities of both natural and synthetic sequences correlated strongly with PARM's predictions. The strongest synthetic promoters were as active as the strongest natural ones. Importantly, mutating the 12 to 18 nucleotides that PARM identified as most critical in the top synthetic promoters caused, on average, a more than threefold drop in activity, confirming the model's precision.
These synthetic sequences showed no significant similarity to any human genomic sequence, proving that PARM is not simply copying existing DNA. The active synthetic promoters contained motifs for known activator transcription factors in K562 cells, such as FOS–JUN and ETS, but each combined these motifs in a unique arrangement. This indicates that PARM has learned biologically relevant rules regarding how transcription factor binding sites work together to drive expression.
PARM's efficiency allowed us to apply ISM to analyze over 30,000 human promoters. For K562 cells, this analysis identified patches of nucleotides predicted to strongly enhance or reduce activity. We matched these patterns to known transcription factor binding motifs, creating maps of predicted functional regulatory sites (RSs), not just potential binding sites. Most motifs linked to RSs corresponded to transcription factors expressed in K562 cells. After accounting for highly similar motifs from related factors, nearly all detected motifs could be linked to an expressed protein. In total, 20,543 promoters had at least one predicted RS; the remaining promoters with no RSs were generally less active, as expected.
We also searched for RSs that did not match any known motif. Out of tens of thousands of sites, we found 1,402 such sequences. Clustering these revealed ten groups with similar sequences. For one cluster with a TCTCTATGGT consensus, we used biochemical methods to identify ZNF48 as the likely binding factor, a prediction we confirmed with additional experiments. Another cluster's motif was recently linked to ZFP91. This demonstrates that PARM can identify rare, functional regulatory interactions that are not yet fully annotated.
Standard genome-wide MPRAs using high-complexity libraries are technically challenging and require vast numbers of cells, limiting scalability. Since PARM was trained only on fragments overlapping promoters, we reasoned that a library enriched for just those fragments would suffice. We developed a capture-based strategy to create MPRA libraries where 90% of fragments originated from promoter regions.
This focused library contained millions of unique fragments, providing high coverage of transcription start sites with far fewer total sequences. It required up to 240-fold fewer cells per experiment. When applied to K562 and HepG2 cells, the predictive power of PARM models trained on this focused data was comparable to models trained on genome-wide data. This advancement allowed us to efficiently generate promoter-focused MPRA datasets and PARM models for seven additional human cell lines and a patient-derived colon cancer organoid culture. Each experiment used only about 10 million cells and produced highly reproducible data and accurate models. Training a PARM model for one cell line took about a day on a single high-end graphics card. This promoter-focused design represents a highly economical and versatile experimental and computational strategy.
To rigorously test the cell-type-specific predictions of PARM models, we created a synthetic MPRA library containing systematic mutations across ten human promoters. We measured the effects of these mutations in multiple cell lines and compared them to predictions from PARM and other models. For example, analyzing the CXCR4 promoter in MCF7 breast cancer cells showed that PARM's predictions of mutation effects closely matched the experimental measurements. Across seven promoters tested in seven cell lines, PARM consistently showed high correlation with the measured data, often performing as well as or better than the much larger models trained on epigenomic data. PARM also demonstrated high recall and precision in identifying true regulatory sites compared to simply scanning sequences for motif matches.
With efficient models for many cell types in hand, we investigated the rules, or "grammar," of promoter regulation. A key question is whether transcription factors have positional preferences relative to the transcription start site and if these differ for activators versus repressors. Analyzing PARM's predictions across promoters in K562 cells, we found clear positional biases. For instance, binding sites for the activator SP1 were most effective around 50 to 100 base pairs upstream of the start site. In contrast, sites for the repressor REST showed a strong preference for positions directly over the start site itself. Experimental validation using MPRAs with designed promoter variants confirmed these positional rules. This demonstrates that a transcription factor's function is not determined by its binding motif alone but also by where it binds relative to the start site.
We also explored the complex interactions between multiple transcription factor binding sites, known as motif-motif grammar. PARM's analysis revealed that certain motif pairs show strong synergy or antagonism depending on their spacing and orientation. For example, pairs of binding sites for certain activator families showed highly cooperative effects when placed at specific distances apart, greatly enhancing promoter activity beyond the sum of their individual effects. These interaction rules were often conserved across promoters and contributed significantly to predicting a promoter's overall strength.
Finally, we used PARM to study how promoter regulation changes dynamically in response to cellular signals. We generated promoter-focused MPRA data and PARM models for K562 cells before and after stimulation with interferon-gamma, a key immune signaling molecule. Comparison of the models revealed widespread "rewiring" of regulatory interactions. Many transcription factor binding sites changed their predicted impact on promoter activity after stimulation. Some sites switched from being neutral or repressive to becoming strong activators, and vice versa. This rewiring involved well-known interferon-responsive factors as well as other factors not classically associated with this pathway. These computational predictions were validated by tracking changes in the activity of mutated promoter variants after interferon-gamma treatment, confirming that PARM could accurately capture the dynamic reassignment of functional roles to specific DNA sequences.
This ability to model condition-specific regulation from sequence alone provides a powerful framework for understanding how cells interpret signals through changes in their regulatory DNA landscape.
The PARM platform demonstrates that coupling focused, economical MPRA experiments with efficient deep learning can produce highly accurate, cell-type-specific models of promoter activity. These models move beyond correlation to provide causal insights, enabling the identification of functional transcription factor binding sites, the design of synthetic promoters, and the discovery of fundamental regulatory grammar rules. Most importantly, this approach scales efficiently, allowing for the systematic exploration of gene regulation across many cell types and conditions. By making the modeling of promoter logic more accessible, PARM opens a path toward a deeper, more dynamic understanding of the sequence basis of human gene expression.