CoRegNet：共调控网络的重建和整合分析

最编程 2024-04-19 19:36:07

...

1. 介绍

CoRegNet包的目的是从转录组数据推断大型转录共调控网络，整合外部数据和基因调控信息去推断和分析转录组项目。这个包中独特的网络推断算法可以学习共调控网络。通过识别不同数据类型的基因共合作调节（co-operative regulators of genes）允许不同类型数据的整合到一个共表达分析中。

#安装
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("CoRegNet")

library(CoRegNet)
data(CIT_BLCA_EXP,HumanTF,CIT_BLCA_Subgroup)
dim(CIT_BLCA_EXP)
# [1] 1000  183
# 1000个基因x183个sample的基因表达矩阵
length(intersect(rownames(CIT_BLCA_EXP),HumanTF))
# [1] 67 ##1000个基因中有67个转录因子
head(intersect(rownames(CIT_BLCA_EXP),HumanTF))
# [1] "EEF1A1" "INSM1"  "HOXD1"  "IFI16"  "ID2"  "GRHL1"

2. Quick user guide（CoRegNet的主要功能）

1. 从基因表达数据中重建大规模调控网络

grn = hLICORN(CIT_BLCA_EXP, TFlist=HumanTF) 
#输入的是基因表达矩阵和2020个转录因子的list(包内置的)

2. 推断转录因子活性

influence = regulatorInfluence(grn,CIT_BLCA_EXP)

3. Retrieve inferred co-coregulators

coregs= coregulators(grn)

4. Analyze the network of cooperative regulators using an interactive display

display(grn,CIT_BLCA_EXP,influence,clinicalData=CIT_BLCA_Subgroup)

3详细功能和用法介绍

3.1 Construction of a large-scale co-regulatory network from gene expression data

CoRegNet包中的推理算法是LICORN 算法的hybrid版本。
网络的重构包含四步：
First, the gene expression data is discretized.
Second, all the potential sets of cooperative regulators are extracted using the apriori algorithm of frequent itemset mining.
Third, the best combinations of co-activators and co-inhibitors are identified for each gene.
Finally, a continuous model of regulation using a linear regression method with interaction terms is used to score the local gene regulatory networks of each gene.

使用CoRegNet做共表达至少需要输入两个dataset：

基因表达矩阵/数据框。要求行名为基因名，列名为样本名且列名要唯一。

TF的基因list。CoRegNet包中内置了人类转录因子list，并且有official gene symbol (HumanTF)和EntrezGene IDs (HumanTF_entrezgene)这两种格式。使用的时候直接用data(HumanTF)或data(HumanTF_entrezgene)调用即可。

# An example of how to infer a co-regulation network
grn =hLICORN(CIT_BLCA_EXP, TFlist=HumanTF)
print(grn) #grn是根据输入的表达矩阵和转录因子预测出来的转录因子，靶基因和调控互作
# [1] "63 Transcription Factors.  867 Target Genes.  8072 Regulatory interactions."
# [1] "No added evidences."
head(grn@GRN)
#   Target coact  corep         Coef.Acts          Coef.Reps Coef.coActs Coef.coReps         R2      RMSE
# 1 DIRAS2 NR1H4  IFI16 0.242753364850399  -0.39999852548476          NA          NA 0.24528282 0.9670125
# 2 DIRAS2 NR1H4    MYC 0.225607457464437 -0.273761935521453          NA          NA 0.11392664 1.0525775
# 3 DIRAS2 NR1H4  SNAI2 0.158558921637845 -0.349914647715346          NA          NA 0.23964161 0.9705609
# 4 DIRAS2 NR1H4   IRX3  0.25463526841265 -0.180678166467546          NA          NA 0.12653573 1.0419588
# 5 DIRAS2 NR1H4 SPOCD1  0.29515905804531 -0.212927437477536          NA          NA 0.14379615 1.0316241
# 6 DIRAS2 NR1H4   <NA> 0.308379330917705               <NA>          NA          NA 0.08489369 1.0651439

# 导出所有的Target
Target=unique(grn@GRN$Target) #可以拿出来做GO富集等

63 Transcription Factors x 867 Target Gene x 8072 Regulatory interactions

在上面hLICORN()这一步是自动对输入的矩阵进行了离散化的，也就是说，实际上用于构建调控网络的input矩阵是离散化后的矩阵。下面是对矩阵进行手动离散化的两种方法

#Default discretization. 
#Uses the standard deviation of the whole dataset to set a threshold.
disc1=discretizeExpressionData(CIT_BLCA_EXP)
table(disc1)
# disc1
#     -1      0      1 
#  23482 131019  28499 
boxplot(as.matrix(CIT_BLCA_EXP)~disc1)

#Discretization with a hard threshold
disc2=discretizeExpressionData(CIT_BLCA_EXP, threshold=1)
table(disc2)
# disc2
#    -1     0     1 
# 45956 93063 43981
boxplot(as.matrix(CIT_BLCA_EXP)~disc2)

# more examples here
help(discretizeExpressionData)

使用矩阵前200个基因以方便运算（可以开4线程，速度更快）

# running only on the 200 first gene in the matrix for fast analysis
# Choosing to divide in 4 threads whenever possible
options("mc.cores"=4)
grn =hLICORN(head(CIT_BLCA_EXP,200), TFlist=HumanTF)
print(grn)
# [1] "18 Transcription Factors.  161 Target Genes.  642 Regulatory interactions."
# [1] "No added evidences."
options("mc.cores"=2)
grn =hLICORN(head(CIT_BLCA_EXP,200), TFlist=HumanTF)
print(grn)
# [1] "18 Transcription Factors.  161 Target Genes.  642 Regulatory interactions."
# [1] "No added evidences."

3.2 Refining the inferred regulatory networks

第二步是使用external knowledge来丰富前面得到的inferred regulatory network。有两种类型的外部数据集可以使用：1. TF对基因的调控数据比如Transcription Factor Binding Sites (TFBS) 或ChIP数据；2. 共调控信息比如蛋白-蛋白互作，这类数据可以提供TFs的合作信息。
❗️对于前面两种类型的数据，分别使用addEvidences和addCooperativeEvidences函数来添加。

# 引入四类外部数据
# ChIP data from the CHEA database
data(CHEA_sub)
#ChIP data from the ENCODE project
data(ENCODE_sub)
# Protein protein interactions between TF from the HIPPIE database
data(HIPPIE_sub)
# Protein protein interactions between TF from the STRING database
data(STRING_sub)

enrichedGRN = addEvidences(grn,CHEA_sub,ENCODE_sub)
# CHEA_sub was integrated into the network.
# ENCODE_sub was integrated into the network.
enrichedGRN = addCooperativeEvidences(enrichedGRN,HIPPIE_sub,STRING_sub)
# [1] "HIPPIE_sub"
# [1] "No evidence from HIPPIE_sub were found in the inferred network."
# [1] "STRING_sub"
# [1] "STRING_sub was integrated into the network."

得到的enrichedGRN就是引入external knowledge的grn

print(enrichedGRN)
# [1] "63 Transcription Factors.  867 Target Genes.  8072 Regulatory interactions."
#            commonGene commonReg evidences commonEvidences       enrichment              p.value
# CHEA_sub          567        12      1050             218 1.13521053808365   0.0701008769458297
# ENCODE_sub        724        10      2749             382 1.18076947263415   0.0114797404982399
# STRING_sub       <NA>        39       137              36 3.16656003302599 2.24846204249256e-06

导入外部数据之后就可以对共调控网络进行refine，可以使用默认的refine方法。

# Default unsupervised refinement method
refinedGRN = refine(enrichedGRN)
print(refinedGRN)
# [1] "57 Transcription Factors.  867 Target Genes.  1854 Regulatory interactions."
#            commonGene commonReg evidences commonEvidences       enrichment              p.value
# CHEA_sub          567        12      1050             176 3.55976690089225  3.1172872878133e-32
# ENCODE_sub        724        10      2749             296 4.71677404968161 3.01483641042892e-48
# STRING_sub       <NA>        21        83              17              Inf 4.69277379589639e-08

可以看到refine之后，Transcription Factors和1854 Regulatory interactions的数目都发生了变化

也可以使用指定的数据集来进行refine

# Example of supervised refinement with the CHEA chip data
refinedGRN = refine(enrichedGRN, integration="supervised",referenceEvidence="CHEA_sub")
print(refinedGRN)

3.3 Identification of active transcriptional programs

CoRegNet包的目的是从给定样本或样本集中识别active cooperative TF的集合。Transcriptional activity的衡量方法是估计给定样本中特定转录调节因子的活化程度。这个衡量方法是比较transcriptional network中活化的和抑制的基因表达。它是基于一个样本中这两个基因sets间的divergence的测量(Welch’s t statistics)。从根本上来说，如果可被某个TF激活的基因出现高表达的同时可被抑制的基因出现低表达，这个TF就具有high influence。
使用一个coregnet object的共表达网络和一个基因表达矩阵，不管这个矩阵是用来推断共表达网络的矩阵还是别的矩阵。output的是一个列为样本（和基因表达矩阵一致），行为TF的矩阵。每一行的TF都在转录网络中拥有足够数量的targets(至少10 activated和10 repressed)。

CITinf =regulatorInfluence(grn,CIT_BLCA_EXP)
CITinf[1:6,1:6]
#             CIT100    CIT102     CIT103    CIT105    CIT106    CIT107
# EMX2    -0.6705802 -4.134082   3.433414  8.512250 -2.129228  2.170088
# EGR1     6.2721207 14.563272  -6.805588 -4.347788 11.962142 -3.975751
# TGFB1I1  7.0348603 19.819534  -4.918518 -5.555589 12.702996 -3.141942
# EGR2     4.5810891 14.504279  -8.664420 -7.408075 11.432331 -9.668110
# PLAGL1   5.7439075 18.780449 -10.075245 -7.001515 13.497227 -7.333840
# SOX9     6.1841193 10.627645  -4.749642 -4.485600 12.322927 -1.968706
str(CITinf)
#  num [1:33, 1:183] -0.671 6.272 7.035 4.581 5.744 ...
#  - attr(*, "dimnames")=List of 2
#   ..$ : chr [1:33] "EMX2" "EGR1" "TGFB1I1" "EGR2" ...
#   ..$ : chr [1:183] "CIT100" "CIT102" "CIT103" "CIT105" ...

这个新的transcriptional influence数据集可以被视为整个转录数据集的condensed版本。

# Coregulators of a hLICORN  inferred network
head(coregulators(grn))
#    Reg1    Reg2     Support   NA nGRN    fisherTest adjustedPvalue
# 1 PPARG   VGLL1 0.017881213 1247 1247  1.436958e-68   1.638132e-67
# 2 GATA3   PPARG 0.016891795 1178 1178  2.861003e-87   1.087181e-85
# 3 FOXF1 TGFB1I1 0.013536379  944  944 2.838172e-101   3.235516e-99
# 4  EGR1    EGR2 0.012733373  888  888  4.332173e-72   6.173346e-71
# 5  MSX2   PPARG 0.012288853  857  857  2.767895e-77   6.310800e-76
# 6 GATA3    MSX2 0.009922854  692  692  9.068728e-80   2.584587e-78

也可以引入分组数据和copy number数据等

data(CIT_BLCA_CNV)
data(CIT_BLCA_Subgroup)

display(grn,expressionData=CIT_BLCA_EXP,TFA=CITinf)

# Visualizing additional regulatory or co-regulatory evidences in the network
display(enrichedGRN,expressionData=CIT_BLCA_EXP,TFA=CITinf)

# Visualizing sample classification using a named factor
display(grn,expressionData=CIT_BLCA_EXP,TFA=CITinf,clinicalData=CIT_BLCA_Subgroup)

# Visualizing copy number alteration of regulators
data(CIT_BLCA_CNV)
display(grn,expressionData=CIT_BLCA_EXP,TFA=CITinf,clinicalData=CIT_BLCA_Subgroup,alterationData=CIT_BLCA_CNV)

参考：
http://www.bioconductor.org/packages/release/bioc/vignettes/CoRegNet/inst/doc/CoRegNet.html
差异共表达网络

上一篇：数据包络分析方法

下一篇： set_false_path 和 set_clock_groups 有什么区别？- 了解