全面解读scVelo实用指南

最编程 2024-07-25 14:42:08

...

image.png

通过动力学建模将RNA速率推广到瞬时细胞状态

RNA速率开辟了利用scRNA测序数据研究细胞分化的新途径。RNA速率基于剪接和非剪接mRNA的比率描述了在一个给定时间点上，一个基因表达变化的速率。然而，如果共同剪接率的核心假设和对稳定状态mRNA水平的完整剪接动力学的观察被违反，速率估计就会出现误差。在这里，作者提出了scVelo方法，通过使用基于可能性的动力学模型求解剪接动力学的完整转录动力学来克服这些限制。这将RNA速率推广到具有瞬时细胞状态的系统，这种状态在发育和对扰动的反应中是常见的。

背景简介

单细胞转录学使得在单细胞分辨率下对细胞分化和谱系选择等生物过程的无偏见研究成为可能。由此产生的计算问题被称为轨迹推断。轨迹推断算法从处于发育过程不同阶段的一组细胞开始，旨在重建导致潜在细胞命运的转录变化的发育序列。轨迹推断的一个中心挑战是单细胞RNA测序(scRNA-seq)的破坏性，它只揭示细胞的某一瞬时静止状态。为了从描述性轨迹模型转向预测性轨迹模型，就需要用额外的信息来约束可能导致相同轨迹的可能动力的空间。因此，譬如血统追踪分析便可以通过修改基因添加信息，以重建血统关系。然而，这些分析并不容易建立，而且在许多系统中受到技术限制，如人类组织。

RNA速率的概念可以通过新转录出来的未剪接的前mRNA和成熟的剪接后的mRNA来恢复定向动力信息得以确定。

假设一个简单的每个基因的反应模型来将未剪接和剪接的mRNA的丰度联系起来，就可以推断出mRNA丰度的变化，称为RNA速率。正的RNA速率表明一个基因表达上调，这种情况发生在细胞中，该基因的未剪接mRNA的丰度比在稳定状态下预期的更高。相反，负的RNA速率表明一个基因的表达下调。然后，跨基因的速率组合起来就可以用来估计单个细胞的未来状态。

原始的模型评估RNA速率是在假设诱导和抑制基因表达的转录阶段持续足够长时间使其达到活跃转录和不活跃转录平衡状态的前提下。在推断处于恒定转录稳定状态的未剪接和剪接mRNA丰度的比率后，速度被确定为观察到的比率与其稳定状态比率的偏差。推测稳态比率有两个基本假设，即(1)在基因水平上，捕捉到转录诱导、抑制和稳态mRNA水平的完整剪接动态；(2)在细胞水平上，所有基因都有共同的剪接率。这些假设经常被违反，特别是当一个种群包括多个具有不同动力学的异质性亚群时。我们将这种原始的建模方法称为“稳态模型”。

scVelo，基于可能性的动力学模型，可以解决完整的基因转录动力学。从而将RNA速率预测推广到瞬时系统和具有异质性亚群动力学的系统。我们在一个有效的期望最大化(EM)框架中推导出转录、剪接和降解的基因特异性反应速度和潜在的基因共享潜伏期。推断的潜伏时间代表了细胞的内部时钟，它准确地描述了细胞在潜在生物过程中的位置。与现有的基于相似性的伪时间方法不同，这种潜在时间仅基于转录动力学，并考虑了运动的速度和方向。推断的潜伏时间能够重建转录事件和细胞命运的时间序列。与稳态模型相比，动态模型通常产生更一致的跨相邻细胞的速度估计，并准确地识别转录状态。此外，scVelo还能确定调控例如过渡状态和细胞命运转变的阶段变化的机制。scVelo能确定这些转录态变化的假定驱动基因。驱动基因表现出明显的动态行为，并通过动力模型中的高概率特征被系统地检测到。这一过程提供了一种基于动力学的标准差异表达图谱的替代。

如何在单细胞分辨率下求解完整的基因转录动力学？

image.png

scVelo提出的基于可能性的模型，由两组参数控制：
(1)转录αk(t)、剪接β和降解 γ的反应速度
(2)细胞特定的转录状态ki和时间ti的潜在变量。通过最小化细胞到当前相轨迹的距离来为每个细胞分配时间点ti。转录状态ki是通过将可能性与轨迹上的各个片段相关联来分配的--即【诱导】、【抑制】和【活跃和不活跃的稳定状态】
通过EM算法，极大似然估计，反复迭代来推断参数。

scVelo计算RNA速率——基本分析

*引自https://scvelo.readthedocs.io/VelocityBasics/

import scvelo as scv
scv.logging.print_version()

scv.logging.print_version()

scv.settings.verbosity = 3  # show errors(0), warnings(1), info(2), hints(3)
scv.settings.presenter_view = True  # set max width size for presenter view
scv.set_figure_params('scvelo')  # for beautified visualization

加载数据

scVelo包的示例数据是胰腺癌数据。
To run velocity analysis on your own data, read your file (loom, h5ad, csv …) to an AnnData object with adata = scv.read('path/file.loom', cache=True). If you want to merge your loom file into an already existing AnnData object, use scv.utils.merge(adata, adata_loom).

adata = scv.datasets.pancreas()
adata

adata

scVelo is based on adata:
an object that stores a data matrix adata.X,
annotation of observations adata.obs,
variables adata.var,
and unstructured annotations adata.uns.
Names of observations and variables can be accessed via adata.obs_names and adata.var_names, respectively.
AnnData objects can be sliced like dataframes, for example, adata_subset = adata[:, list_of_gene_names].
For more details, see the anndata docs.

scv.pl.proportions(adata,save=True)

scv.pl.proportions(adata,save=True)

数据预处理

预处理包括以下几个步骤：
gene selection by detection (with a minimum number of counts) and high variability (dispersion), normalizing every cell by its total size and logarithmizing X. Filtering and normalization is applied in the same vein to spliced/unspliced counts and X. Logarithmizing is only applied to X. If X is already preprocessed from former analysis, it will not be touched.
预处理的过程使用的函数为scv.pp.filter_and_normalize，其执行以下四个操作

#这四行代码无需单独执行，只是展示函数scv.pp.filter_and_normalize（）具体做了那些操作
scv.pp.filter_genes(adata, min_shared_counts=20)
scv.pp.normalize_per_cell(adata)
scv.pp.filter_genes_dispersion(adata, n_top_genes=2000)
scv.pp.log1p(adata)

Further, we need the first and second order moments (means and uncentered variances) computed among nearest neighbors in PCA space, summarized in scv.pp.moments, which internally computes scv.pp.pca and scv.pp.neighbors. First order is needed for deterministic velocity estimation, while stochastic estimation also requires second order moments.

#预处理过程执行这两行
scv.pp.filter_and_normalize(adata, min_shared_counts=20, n_top_genes=2000)
scv.pp.moments(adata, n_pcs=30, n_neighbors=30)

预处理

Further preprocessing (such as batch effect correction) may be used to remove unwanted sources of variability. See the best practices for further details. Note, that any additional preprocessing step only affects X and is not applied to spliced/unspliced counts.

估计RNA速率

Velocities are vectors in gene expression space and represent the direction and speed of movement of the individual cells. The velocities are obtained by modeling transcriptional dynamics of splicing kinetics, either stochastically (default) or deterministically (by setting mode='deterministic').
For each gene, a steady-state-ratio of pre-mature (unspliced) and mature (spliced) mRNA counts is fitted, which constitutes a constant transcriptional state. Velocities are then obtained as residuals from this ratio. Positive velocity indicates that a gene is up-regulated, which occurs for cells that show higher abundance of unspliced mRNA for that gene than expected in steady state. Conversely, negative velocity indicates that a gene is down-regulated.
The solution to the full dynamical model is obtained by setting mode='dynamical', which requires to run scv.tl.recover_dynamics(adata) beforehand. We will elaborate more on the dynamical model in the next tutorial.

scv.tl.velocity(adata)

scv.tl.velocity(adata)

The computed velocities are stored in adata.layers just like the count matrices.

The combination of velocities across genes can then be used to estimate the future state of an individual cell. In order to project the velocities into a lower-dimensional embedding, transition probabilities of cell-to-cell transitions are estimated. That is, for each velocity vector we find the likely cell transitions that are accordance with that direction. The transition probabilities are computed using cosine correlation between the potential cell-to-cell transitions and the velocity vector, and are stored in a matrix denoted as velocity graph. The resulting velocity graph has dimension nobs×nobs and summarizes the possible cell state changes that are well explained through the velocity vectors (for runtime speedup it can also be computed on reduced PCA space by setting approx=True).

scv.tl.velocity_graph(adata)

scv.tl.velocity_graph(adata)

For a variety of applications, the velocity graph can be converted to a transition matrix by applying a Gaussian kernel to transform the cosine correlations into actual transition probabilities. You can access the Markov transition matrix via scv.utils.get_transition_matrix.

As mentioned, it is internally used to project the velocities into a low-dimensional embedding by applying the mean transition with respect to the transition probabilities, obtained with scv.tl.velocity_embedding. Further, we can trace cells along the Markov chain to their origins and potential fates, thereby getting root cells and end points within a trajectory, obtained via scv.tl.terminal_states.

可视化结果

Finally, the velocities are projected onto any embedding, specified by basis, and visualized in one of these ways: - on cellular level with scv.pl.velocity_embedding, - as gridlines with scv.pl.velocity_embedding_grid, - or as streamlines with scv.pl.velocity_embedding_stream.

Note, that the data has an already pre-computed UMAP embedding, and annotated clusters. When applying to your own data, these can be obtained with scv.tl.umap and scv.tl.louvain. For more details, see the scanpy tutorial. Further, all plotting functions are defaulted to using basis='umap' and color='clusters', which you can set accordingly.

scv.pl.velocity_embedding_stream(adata, basis='umap',save=f"./stream.pdf")

scv.pl.velocity_embedding_stream(adata, basis='umap',save=f"./stream.pdf")

scv.pl.velocity_embedding(adata, arrow_length=3, arrow_size=2, dpi=120,save=f"./embedding.pdf")

scv.pl.velocity_embedding(adata, arrow_length=3, arrow_size=2, dpi=120,save=f"./embedding.pdf")

解释RNA速率

This is perhaps the most important part as we advise the user not to limit biological conclusions to the projected velocities, but to examine individual gene dynamics via phase portraits to understand how inferred directions are supported by particular genes.

See the gif here to get an idea of how to interpret a spliced vs. unspliced phase portrait. Gene activity is orchestrated by transcriptional regulation. Transcriptional induction for a particular gene results in an increase of (newly transcribed) precursor unspliced mRNAs while, conversely, repression or absence of transcription results in a decrease of unspliced mRNAs. Spliced mRNAs is produced from unspliced mRNA and follows the same trend with a time lag. Time is a hidden/latent variable. Thus, the dynamics needs to be inferred from what is actually measured: spliced and unspliced mRNAs as displayed in the phase portrait.

Now, let us examine the phase portraits of some marker genes, visualized with scv.pl.velocity(adata, gene_names) or scv.pl.scatter(adata, gene_names).

scv.pl.velocity(adata, ['Cpe',  'Gnao1', 'Ins2', 'Adk'], ncols=2,save=f"./Interprete_the_velocities.pdf")

scv.pl.velocity(adata, ['Cpe', 'Gnao1', 'Ins2', 'Adk'], ncols=2,save=f"./Interprete_the_velocities.pdf")

黑色虚线代表稳态，黑线上表明基因上调，黑线下表明基因下调
For instance Cpe explains the directionality in the up-regulated Ngn3 (yellow) to Pre-endocrine (orange) to β-cells (green), while Adk explains the directionality in the down-regulated Ductal (dark green) to Ngn3 (yellow) to the remaining endocrine cells.

scv.pl.scatter(adata, 'Cpe', color=['clusters', 'velocity'],
               add_outline='Ngn3 high EP, Pre-endocrine, Beta',save=f"./Cpe.pdf")
scv.pl.scatter(adata, 'Adk', color=['clusters', 'velocity'],save=f"./Adk.pdf")

Cpe

Adk

确定一些重要的驱动基因

We need a systematic way to identify genes that may help explain the resulting vector field and inferred lineages. To do so, we can test which genes have cluster-specific differential velocity expression, being siginificantly higher/lower compared to the remaining population. The module scv.tl.rank_velocity_genes runs a differential velocity t-test and outpus a gene ranking for each cluster. Thresholds can be set (e.g. min_corr) to restrict the test on a selection of gene candidates.

scv.tl.rank_velocity_genes(adata, groupby='clusters', min_corr=.3)

df = scv.DataFrame(adata.uns['rank_velocity_genes']['names'])
df.head()

scv.tl.rank_velocity_genes(adata, groupby='clusters', min_corr=.3)

kwargs = dict(frameon=False, size=10, linewidth=1.5,
              add_outline='Ngn3 high EP, Pre-endocrine, Beta')

scv.pl.scatter(adata, df['Ngn3 high EP'][:5], ylabel='Ngn3 high EP', **kwargs,save=f"./Ngn3_high_EP.pdf")
scv.pl.scatter(adata, df['Pre-endocrine'][:5], ylabel='Pre-endocrine', **kwargs,save=f"./Pre-endocrine.pdf")

scv.pl.scatter(adata, df['Ngn3 high EP'][:5], ylabel='Ngn3 high EP', **kwargs,save=f"./Ngn3_high_EP.pdf")

scv.pl.scatter(adata, df['Pre-endocrine'][:5], ylabel='Pre-endocrine', **kwargs,save=f"./Pre-endocrine.pdf")

The genes Ptprs, Pclo, Pam, Abcc8, Gnas, for instance, support the directionality from Ngn3 high EP (yellow) to Pre-endocrine (orange) to Beta (green).
基因Ptprs, Pclo, Pam, Abcc8支持黄->橙->绿的方向

分析细胞周期中的RNA速率

The cell cycle detected by RNA velocity, is biologically affirmed by cell cycle scores (standardized scores of mean expression levels of phase marker genes).

scv.tl.score_genes_cell_cycle(adata)
scv.pl.scatter(adata, color_gradients=['S_score', 'G2M_score'], smooth=True, perc=[5, 95],save=f"./cell_cycle.pdf")

scv.tl.score_genes_cell_cycle(adata)

scv.pl.scatter(adata, color_gradients=['S_score', 'G2M_score'], smooth=True, perc=[5, 95],save=f"./cell_cycle.pdf")

For the cycling Ductal cells, we may screen through S and G2M phase markers. The previous module also computed a spearmans correlation score, which we can use to rank/sort the phase marker genes to then display their phase portraits.

s_genes, g2m_genes = scv.utils.get_phase_marker_genes(adata)
s_genes = scv.get_df(adata[:, s_genes], 'spearmans_score', sort_values=True).index
g2m_genes = scv.get_df(adata[:, g2m_genes], 'spearmans_score', sort_values=True).index

kwargs = dict(frameon=False, ylabel='cell cycle genes')
scv.pl.scatter(adata, list(s_genes[:2]) + list(g2m_genes[:3]), **kwargs,save=f"./g2m_genes.pdf")

scv.pl.scatter(adata, list(s_genes[:2]) + list(g2m_genes[:3]), **kwargs,save=f"./g2m_genes.pdf")

Particularly Hells and Top2a are well-suited to explain the vector field in the cycling progenitors. Top2a gets assigned a high velocity shortly before it actually peaks in the G2M phase. There, the negative velocity then perfectly matches the immediately following down-regulation.

scv.pl.velocity(adata, ['Hells', 'Top2a'], ncols=2, add_outline=True,save=f"./Hells-Top2a.pdf")

scv.pl.velocity(adata, ['Hells', 'Top2a'], ncols=2, add_outline=True,save=f"./Hells-Top2a.pdf")

分析速度和一致性

Two more useful stats: - The speed or rate of differentiation is given by the length of the velocity vector. - The coherence of the vector field (i.e., how a velocity vector correlates with its neighboring velocities) provides a measure of confidence.

scv.tl.velocity_confidence(adata)
keys = 'velocity_length', 'velocity_confidence'
scv.pl.scatter(adata, c=keys, cmap='coolwarm', perc=[5, 95],save=f"./coherence.pdf")

scv.tl.velocity_confidence(adata)

scv.pl.scatter(adata, c=keys, cmap='coolwarm', perc=[5, 95],save=f"./coherence.pdf")

These provide insights where cells differentiate at a slower/faster pace, and where the direction is un-/determined.

On cluster-level, we find that differentiation substantially speeds up after cell cycle exit (Ngn3 low EP), keeping the pace during Beta cell production while slowing down during Alpha cell production.

速度图和伪时间

We can visualize the velocity graph to portray all velocity-inferred cell-to-cell connections/transitions. It can be confined to high-probability transitions by setting a threshold. The graph, for instance, indicates two phases of Epsilon cell production, coming from early and late Pre-endocrine cells.

scv.pl.velocity_graph(adata, threshold=.1,save=f"./velocity_graph.pdf")

scv.pl.velocity_graph(adata, threshold=.1,save=f"./velocity_graph.pdf")

Further, the graph can be used to draw descendents/anscestors coming from a specified cell. Here, a pre-endocrine cell is traced to its potential fate.

x, y = scv.utils.get_cell_transitions(adata, basis='umap', starting_cell=70)
ax = scv.pl.velocity_graph(adata, c='lightgrey', edge_width=.05, show=False)
ax = scv.pl.scatter(adata, x=x, y=y, s=120, c='ascending', cmap='gnuplot', ax=ax,save=f"./one_cell_velocity_graph.pdf")

ax = scv.pl.scatter(adata, x=x, y=y, s=120, c='ascending', cmap='gnuplot', ax=ax,save=f"./one_cell_velocity_graph.pdf")

Finally, based on the velocity graph, a velocity pseudotime can be computed. After inferring a distribution over root cells from the graph, it measures the average number of steps it takes to reach a cell after walking along the graph starting from the root cells.

Contrarily to diffusion pseudotime, it implicitly infers the root cells and is based on the directed velocity graph instead of the similarity-based diffusion kernel.

scv.tl.velocity_pseudotime(adata)
scv.pl.scatter(adata, color='velocity_pseudotime', cmap='gnuplot',save=f"./pseudotime.pdf")

scv.tl.velocity_pseudotime(adata)

scv.pl.scatter(adata, color='velocity_pseudotime', cmap='gnuplot',save=f"./pseudotime.pdf")

PAGA速度图

PAGA graph abstraction has benchmarked as top-performing method for trajectory inference. It provides a graph-like map of the data topology with weighted edges corresponding to the connectivity between two clusters. Here, PAGA is extended by velocity-inferred directionality.

#需要python-igraph包，安装命令为：conda install python-igraph

# this is needed due to a current bug - bugfix is coming soon.
adata.uns['neighbors']['distances'] = adata.obsp['distances']
adata.uns['neighbors']['connectivities'] = adata.obsp['connectivities']

scv.tl.paga(adata, groups='clusters')

scv.tl.paga(adata, groups='clusters')

scv.pl.paga(adata, basis='umap', size=50, alpha=.1,
            min_edge_width=2, node_size_scale=1.5,save=f"./PAGA.pdf")

scv.pl.paga

scVelo计算RNA速率——动态模型（最新）

We run the dynamical model to learn the full transcriptional dynamics of splicing kinetics.

It is solved in a likelihood-based expectation-maximization framework, by iteratively estimating the parameters of reaction rates and latent cell-specific variables, i.e. transcriptional state and cell-internal latent time. It thereby aims to learn the unspliced/spliced phase trajectory for each gene.

scv.tl.recover_dynamics(adata)
scv.tl.velocity(adata, mode='dynamical')
scv.tl.velocity_graph(adata)

scv.tl.recover_dynamics(adata)

scv.tl.velocity

Running the dynamical model can take a while. Hence, you may want to store the results for re-use, with adata.write('data/pancreas.h5ad'), which can later be read with data = scv.read('data/pancreas.h5ad').

#adata.write('data/pancreas.h5ad', compression='gzip')
#adata = scv.read('data/pancreas.h5ad')

scv.pl.velocity_embedding_stream(adata, basis='umap',save=f"./Umap.pdf")

scv.pl.velocity_embedding_stream(adata, basis='umap',save=f"./Umap.pdf")

计算动力学参数

df = adata.var
df = df[(df['fit_likelihood'] > .1) & df['velocity_genes'] == True]

kwargs = dict(xscale='log', fontsize=16)
with scv.GridSpec(ncols=3) as pl:
    pl.hist(df['fit_alpha'], xlabel='transcription rate', **kwargs)
    pl.hist(df['fit_beta'] * df['fit_scaling'], xlabel='splicing rate', xticks=[.1, .4, 1], **kwargs)
    pl.hist(df['fit_gamma'], xlabel='degradation rate', xticks=[.1, .4, 1], **kwargs)
plt.savefig("Kinetic_rate_paramters.pdf",bbox_inches = 'tight')
scv.get_df(adata, 'fit*', dropna=True).head()

with scv.GridSpec()

Kinetic_rate_paramters.pdf

scv.get_df(adata, 'fit*', dropna=True).head()

The estimated gene-specific parameters comprise rates of transription (fit_alpha), splicing (fit_beta), degradation (fit_gamma),
switching time point (fit_t_),
a scaling parameter to adjust for under-represented unspliced reads (fit_scaling),
standard deviation of unspliced and spliced reads (fit_std_u, fit_std_s),
the gene likelihood (fit_likelihood),
inferred steady-state levels (fit_steady_u, fit_steady_s) with their corresponding p-values (fit_pval_steady_u, fit_pval_steady_s),
the overall model variance (fit_variance),
and a scaling factor to align the gene-wise latent times to a universal, gene-shared latent time (fit_alignment_scaling).

计算潜伏时间

The dynamical model recovers the latent time of the underlying cellular processes. This latent time represents the cell’s internal clock and approximates the real time experienced by cells as they differentiate, based only on its transcriptional dynamics.

scv.tl.latent_time(adata)
scv.pl.scatter(adata, color='latent_time', color_map='gnuplot', size=80,save=f"./latent_time.pdf")

scv.tl.latent_time(adata)

scv.pl.scatter(adata, color='latent_time', color_map='gnuplot', size=80,save=f"./latent_time.pdf)

top_genes = adata.var['fit_likelihood'].sort_values(ascending=False).index[:300]
scv.pl.heatmap(adata, var_names=top_genes, sortby='latent_time', col_color='clusters', n_convolve=100,save=f"./heatmap.pdf")

scv.pl.heatmap(adata, var_names=top_genes, sortby='latent_time', col_color='clusters', n_convolve=100,save=f"./heatmap.pdf")

找Top-likelihood 基因

Driver genes display pronounced dynamic behavior and are systematically detected via their characterization by high likelihoods in the dynamic model.

top_genes = adata.var['fit_likelihood'].sort_values(ascending=False).index
scv.pl.scatter(adata, basis=top_genes[:15], ncols=5, frameon=False,save=f"./top_genes.pdf")

var_names = ['Actn4', 'Ppp3ca', 'Cpe', 'Nnat']
scv.pl.scatter(adata, var_names, frameon=False,save=f"./top_genes2.pdf")
scv.pl.scatter(adata, x='latent_time', y=var_names, frameon=False,save=f"./top_genes3.pdf")

scv.pl.scatter(adata, basis=top_genes[:15], ncols=5, frameon=False,save=f"./top_genes.pdf")

scv.pl.scatter(adata, var_names, frameon=False,save=f"./top_genes2.pdf")

scv.pl.scatter(adata, x='latent_time', y=var_names, frameon=False,save=f"./top_genes3.pdf")

找cluster特异性的Top-likelihood 基因

Moreover, partial gene likelihoods can be computed for a each cluster of cells to enable cluster-specific identification of potential drivers.

scv.tl.rank_dynamical_genes(adata, groupby='clusters')
df = scv.get_df(adata, 'rank_dynamical_genes/names')
df.head(5)

df.head(5)

for cluster in ['Ductal', 'Ngn3 high EP', 'Pre-endocrine', 'Beta']:
    scv.pl.scatter(adata, df[cluster][:5], ylabel=cluster, frameon=False)
    plt.savefig(cluster+"_top_genes.pdf",bbox_inches = 'tight')