Computational Biology Final Project

Unnamed Gene Expression Visualizer[WIP]

My Graduate Independant study so far

Reading Time: 6 Minutes

Understanding how organisms develop, function, and behave fundamentally begins with the genome. The genome encodes the biological instructions that govern everything from embryonic development to physiological function and, in some cases, susceptibility to disease. It reflects both our evolutionary history and the genetic material we pass on to future generations. As such, the genome represents the most information-dense blueprint of biological identity.

Central Dogma of Molecular Biology

At a molecular level, genetic information is stored in DNA, a double-stranded helical structure composed of nucleotide base pairs. During gene expression, DNA is transcribed into a single-stranded molecule known as RNA. This RNA is then read in groups of three nucleotides, called codons, each of which corresponds to one of twenty amino acids. These amino acids are sequentially linked to form proteins, which serve as the primary functional units within biological systems. A significant proportion of observable biological structures such as hair, muscle, and skin are composed largely of proteins. Moreover, proteins act as enzymes that catalyze the synthesis of carbohydrates, fats, and other complex molecules, and they regulate the storage and release of these compounds within the body. In short, genes encode proteins, and proteins largely determine biological form and function.

Because genes play a central role in maintaining health and biological stability, disruptions in the genome can substantially increase the likelihood of disease. Many genetic disorders arise not from single mutations, but from complex interactions among multiple genetic variants. Genome-Wide Association Studies (GWAS) aim to identify statistical associations between specific genetic variations most commonly single nucleotide polymorphisms (SNPs) and particular traits or disorders. While GWAS have successfully mapped thousands of SNPs to diseases, the biological interpretation of these results remains challenging, especially since most disorders are influenced by combinations of variants rather than isolated SNPs.

While genes often influence our likelihood of specific diseases, it can be difficult to attribute specific genes to specific diseases. This is especially the case when groups of genes (referred to as gene modules) are responsible for a specific disease. As a result, network representations of gene co-expression becomes a meaningful tool in understanding the roles of gene modules and disease. Many papers and methodologies have been used to find important gene modules. Here, we will explore if we can use those methodologies to allow users to identify gene modules and find targets for their own research.

In the last decade, the GTEx consortium has compiled gene expression values by sequencing RNA (via bulk RNA-seq) for a large selection of tissue samples. To quantify how actively any given gene is being expressed in a tissue, GTEx reports expression levels in units of transcripts per million (TPM). TPM is calculated by first counting the number of RNA sequencing reads that map to a given gene, normalizing for the length of that gene (since longer genes naturally capture more reads), and then scaling the result so that all values within a sample sum to one million. This normalization strategy makes TPM values directly comparable across both genes and samples: a gene with a TPM of 100 in liver tissue and a TPM of 1 in brain tissue is being transcribed at substantially higher levels in the liver, indicating that it is more actively expressed there. A gene with near-zero TPM across all tissues is considered transcriptionally silent or lowly expressed, while genes with high TPM values in specific tissues are considered tissue-enriched or tissue-specific markers. Critically, because TPM normalizes within each sample, it does not reflect the absolute number of RNA molecules but rather the relative proportion of transcriptional activity devoted to a given gene in that tissue at the time of sampling.

The GTEx v8 dataset, as described in the consortium's landmark 2020 publication in Science, encompasses 15,201 RNA-sequencing samples from 49 tissues of 838 postmortem donors, with each tissue's mRNA sequenced to a median depth of 82.6 million reads, from which gene expression and splicing were quantified from bulk RNA-seq of heterogeneous tissue samples. By aggregating these TPM-level expression profiles across dozens of tissues and hundreds of individuals, GTEx provides a comprehensive reference atlas of where and how much each gene is expressed across the human body. This resource is particularly powerful for interpreting GWAS findings: when a SNP is associated with a disease, researchers can consult GTEx to determine whether that SNP coincides with a region that regulates gene expression in a disease-relevant tissue which are a class of variants known as expression quantitative trait loci (eQTLs). The GTEx project was established to characterize genetic effects on the transcriptome across human tissues and to link these regulatory mechanisms to trait and disease associations, and the v8 analysis comprehensively characterizes genetic associations for gene expression and splicing in cis and trans, showing that regulatory associations are found for almost all genes. In doing so, GTEx bridges the gap between statistical genetic associations and the underlying molecular biology, enabling researchers to move from a list of disease-associated SNPs toward a mechanistic understanding of how those variants alter gene activity and, ultimately, biological function.

With a dataset of this scale and resolution, the challenge shifts from data availability to interpretation: how do we surface meaningful gene relationships from hundreds of thousands of pairwise expression signals? That question motivates the network construction approach described below.

Representing The Graph

To create a co-expression network, we must begin by gaining an understanding of how genes co-express. We can calculate pairwise correlation coefficients over all tissue types and use those signals to inform whether a link should exist. Networks built on these cascading signals are often noisy and contain too much information. Existing techniques have their own drawbacks. For example, Weighted Gene Co-Expression Analysis (WCGNA) enforces connectivity constraints in order to exhibit power law distributions. Unweighted approaches lead to noisy data with many false positives. In order to address some of these issues, the writers of the MEGENA paper adopted a network embedding paradigm on a topological sphere. Planar Maximally Filtered Graph (PMFG) was developed to extract most relevant information from similarity matrices based on topological sphere and has been applied mostly in financial domains. PMFG becomes an ideal platform to construct co-expression networks due to the following attractive features: i) the preservation of hierarchy by retaining Minimum Spanning Tree (MST) as a subgraph, ii) the correspondence between a coherent cluster (if any) and a connected subnetwork, iii) the abundance of 3- and 4-cliques and exhibition of rich clustering structures.

While PMFG serves as an effective tool, the enforcement of connectivity and edge creation based on correlation criteria will still lead to false positives. To deal with this, the authors created an analysis framework named Multiscale Embedded Gene Co-expression Network Analysis (MEGENA).

Fast Planar Filtered Network construction (FPFNC): Rather than the traditional PMFG algorithm, the authors introduce parallelization, early termination and prior quality control to effectively generate a PMFG over all tissue types.

Network Layout

The Fruchterman-Reingold algorithm produces aesthetically pleasing, two-dimensional visualizations of graphs by performing simplified simulations of physical systems. It presents a modification of the spring-embedder model originally proposed by Eades, designed for drawing undirected graphs with straight edges, and develops its heuristic in analogy to forces in natural systems. The physical intuition is as follows: vertices are treated as steel rings and edges as springs connecting them, forming a mechanical system. The vertices are placed in some initial layout and then released, so that the spring forces move the system toward a minimal energy configuration. Crucially, the algorithm makes only neighboring vertices attract each other, while all vertices repel each other regardless of connectivity. This asymmetry ensures that connected nodes cluster together while unconnected nodes spread apart, producing layouts that intuitively reflect the community structure of the graph. Each iteration of the algorithm proceeds in three steps: computing attractive forces between adjacent nodes, computing repulsive forces between all pairs of nodes, and then limiting total displacement by a global temperature parameter. This temperature controls the step width of node movements. When the temperature is high, nodes move faster and over larger distances; the temperature then cools at each iteration until the nodes stop moving and the system reaches its equilibrium state. This simulated annealing-like schedule allows the layout to settle into a stable configuration without becoming trapped in poor local minima early in the optimization. Vertices are typically initialized at random positions within a bounding frame, and the algorithm iterates until convergence. The result is a layout in which edge lengths tend toward uniformity and nodes are distributed with roughly equal spacing (all aesthetic properties that make the graph easier to interpret visually). While Fruchterman-Reingold produces layouts by simulating physical forces, an alternative family of approaches derives node positions from learned low-dimensional embeddings of graph structure. The most influential of these is node2vec, introduced by Grover and Leskovec in 2016. node2vec is an algorithmic framework for learning continuous feature representations for nodes in networks. It learns a mapping of nodes to a low-dimensional space of features that maximizes the likelihood of preserving network neighborhoods of nodes, defines a flexible notion of a node's network neighborhood, and designs a biased random walk procedure that efficiently explores diverse neighborhoods. The key innovation of node2vec over prior work lies in this flexibility of neighborhood definition. The algorithm accommodates various definitions of network neighborhoods by simulating biased random walks, and specifically provides a way of balancing the exploration-exploitation tradeoff that leads to representations obeying a spectrum of equivalences from homophily to structural equivalence.

The Search Bias Parameters

Return parameter (p): Controls the likelihood of immediately revisiting a node in the walk. High values ensure already-visited nodes are less likely to be sampled again, forcing the walk to explore further.

In-out parameter (q): Governs the "Search Strategy." High values encourage a Breadth-First Search (BFS) to capture local community homophily, while low values encourage a Depth-First Search (DFS) to capture structural equivalence across the graph.

By tuning these two parameters, node2vec can interpolate between capturing nodes that belong to the same community and nodes that occupy analogous structural positions in different parts of the graph. Once random walks are generated for each node, node2vec optimizes a custom graph-based objective function using stochastic gradient descent, motivated by prior work in natural language processing, treating sequences of nodes in random walks analogously to sequences of words in sentences. The result is a vector embedding for each node in the graph, where nodes with similar neighborhoods occupy nearby positions in the embedding space.

Graph Coarsening

Now that we have our graph layout, we need to address the high degree in the order of 10^4 unique nodes. One way to do this and make a meaningful visual is by finding highly connected subgraphs and rendering them as one node with an average x and y of the members of the community. Our visual will dynamically coarsen and resolve based on the zoom level of the user so we will need a highly parameterized graph community detection algorithm The objective function the algorithm seeks to maximize is modularity, a scalar quantity that measures the quality of a given partition.

Modularity ($Q$)

A value between −1 and 1 that measures the relative density of edges inside communities with respect to edges outside communities. High modularity indicates dense internal connections and sparse external connections, suggesting a "natural" community structure.

The algorithm proceeds in two alternating phases that are repeated iteratively. In the first phase, the method looks for small communities by optimizing modularity in a local way. In the second phase, it aggregates nodes of the same community and builds a new network whose nodes are the communities. These steps are repeated iteratively until a maximum of modularity is attained, producing several partitions as output. More concretely, in Phase 1 every node begins as its own singleton community. The algorithm then cycles through all nodes and, for each one, evaluates the modularity gain that would result from moving it into each of its neighboring communities. The node is assigned to the neighboring community that yields the greatest positive modularity gain; if no positive gain is achievable, the node remains in its original community. This phase continues until no individual move can improve the modularity. In Phase 2, all nodes belonging to the same community are collapsed into a single supernode. The edge weights between supernodes are set to the sum of all edge weights between the member nodes of the corresponding communities, and the algorithm then re-enters Phase 1 on this reduced graph. The partition found after the first iteration typically consists of many communities of small sizes, while at subsequent steps larger and larger communities emerge due to the aggregation mechanism, naturally leading to a hierarchical decomposition of the network. This hierarchical, multi-resolution character is precisely what makes the Louvain method well-suited to our dynamic visualization. The method seems to run in time O(n log n), with most of the computational effort concentrated in the optimization at the first level. For our purposes, each level of the hierarchy corresponds to a different zoom resolution: at low zoom, the graph is rendered in its most coarsened form, with large supernodes representing entire gene modules; as the user zooms in, the algorithm's intermediate partitions are progressively resolved, eventually revealing individual gene nodes at full zoom. The key parameter governing this behavior is the resolution parameter γ, which appears in the modularity gain formula. When the resolution is set to less than 1, the algorithm favors larger communities, while values greater than 1 favor smaller communities, giving us fine-grained control over the granularity of coarsening at each zoom level. By precomputing the full hierarchy of Louvain partitions across a sweep of resolution values and storing the community assignments alongside their centroid coordinates, the visualization can transition fluidly between coarse and fine-grained representations entirely client-side. It is worth noting one known limitation of the method: the resolution limit of modularity can cause multiple small communities to be grouped together into a larger one, causing the smaller communities to be hidden and making it no longer possible to differentiate those nodes from nodes already present in the absorbing community. For the purposes of a first-pass visualization of gene coexpression structure, this tradeoff between resolution and computational tractability is acceptable, though downstream analyses requiring more precise community boundaries may benefit from the Leiden algorithm, a refined successor that guarantees well-connected communities at each level of the hierarchy.