欢迎您访问 最编程 本站为您分享编程语言代码,编程技术文章!
您现在的位置是: 首页

CoRegNet:共调控网络的重建和整合分析

最编程 2024-04-19 19:36:07
...

1. 介绍

CoRegNet包的目的是从转录组数据推断大型转录共调控网络,整合外部数据和基因调控信息去推断和分析转录组项目。这个包中独特的网络推断算法可以学习共调控网络。通过识别不同数据类型的基因共合作调节(co-operative regulators of genes)允许不同类型数据的整合到一个共表达分析中。

#安装
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("CoRegNet")
library(CoRegNet)
data(CIT_BLCA_EXP,HumanTF,CIT_BLCA_Subgroup)
dim(CIT_BLCA_EXP)
# [1] 1000  183
# 1000个基因x183个sample的基因表达矩阵
length(intersect(rownames(CIT_BLCA_EXP),HumanTF))
# [1] 67 ##1000个基因中有67个转录因子
head(intersect(rownames(CIT_BLCA_EXP),HumanTF))
# [1] "EEF1A1" "INSM1"  "HOXD1"  "IFI16"  "ID2"  "GRHL1" 

2. Quick user guide(CoRegNet的主要功能)

1. 从基因表达数据中重建大规模调控网络
grn = hLICORN(CIT_BLCA_EXP, TFlist=HumanTF) 
#输入的是基因表达矩阵和2020个转录因子的list(包内置的)
2. 推断转录因子活性
influence = regulatorInfluence(grn,CIT_BLCA_EXP)
3. Retrieve inferred co-coregulators
coregs= coregulators(grn)
4. Analyze the network of cooperative regulators using an interactive display
display(grn,CIT_BLCA_EXP,influence,clinicalData=CIT_BLCA_Subgroup)

3详细功能和用法介绍

3.1 Construction of a large-scale co-regulatory network from gene expression data

CoRegNet包中的推理算法是LICORN 算法的hybrid版本。
网络的重构包含四步:
First, the gene expression data is discretized.
Second, all the potential sets of cooperative regulators are extracted using the apriori algorithm of frequent itemset mining.
Third, the best combinations of co-activators and co-inhibitors are identified for each gene.
Finally, a continuous model of regulation using a linear regression method with interaction terms is used to score the local gene regulatory networks of each gene.

使用CoRegNet做共表达至少需要输入两个dataset:

  1. 基因表达矩阵/数据框。要求行名为基因名,列名为样本名且列名要唯一。
  2. TF的基因list。CoRegNet包中内置了人类转录因子list,并且有official gene symbol (HumanTF)和EntrezGene IDs (HumanTF_entrezgene)这两种格式。使用的时候直接用data(HumanTF)data(HumanTF_entrezgene)调用即可。
# An example of how to infer a co-regulation network
grn =hLICORN(CIT_BLCA_EXP, TFlist=HumanTF)
print(grn) #grn是根据输入的表达矩阵和转录因子预测出来的转录因子,靶基因和调控互作
# [1] "63 Transcription Factors.  867 Target Genes.  8072 Regulatory interactions."
# [1] "No added evidences."
head(grn@GRN)
#   Target coact  corep         Coef.Acts          Coef.Reps Coef.coActs Coef.coReps         R2      RMSE
# 1 DIRAS2 NR1H4  IFI16 0.242753364850399  -0.39999852548476          NA          NA 0.24528282 0.9670125
# 2 DIRAS2 NR1H4    MYC 0.225607457464437 -0.273761935521453          NA          NA 0.11392664 1.0525775
# 3 DIRAS2 NR1H4  SNAI2 0.158558921637845 -0.349914647715346          NA          NA 0.23964161 0.9705609
# 4 DIRAS2 NR1H4   IRX3  0.25463526841265 -0.180678166467546          NA          NA 0.12653573 1.0419588
# 5 DIRAS2 NR1H4 SPOCD1  0.29515905804531 -0.212927437477536          NA          NA 0.14379615 1.0316241
# 6 DIRAS2 NR1H4   <NA> 0.308379330917705               <NA>          NA          NA 0.08489369 1.0651439

# 导出所有的Target
Target=unique(grn@GRN$Target) #可以拿出来做GO富集等
63 Transcription Factors x 867 Target Gene x 8072 Regulatory interactions

在上面hLICORN()这一步是自动对输入的矩阵进行了离散化的,也就是说,实际上用于构建调控网络的input矩阵是离散化后的矩阵。下面是对矩阵进行手动离散化的两种方法

#Default discretization. 
#Uses the standard deviation of the whole dataset to set a threshold.
disc1=discretizeExpressionData(CIT_BLCA_EXP)
table(disc1)
# disc1
#     -1      0      1 
#  23482 131019  28499 
boxplot(as.matrix(CIT_BLCA_EXP)~disc1)
#Discretization with a hard threshold
disc2=discretizeExpressionData(CIT_BLCA_EXP, threshold=1)
table(disc2)
# disc2
#    -1     0     1 
# 45956 93063 43981
boxplot(as.matrix(CIT_BLCA_EXP)~disc2)
# more examples here
help(discretizeExpressionData)

使用矩阵前200个基因以方便运算(可以开4线程,速度更快)

# running only on the 200 first gene in the matrix for fast analysis
# Choosing to divide in 4 threads whenever possible
options("mc.cores"=4)
grn =hLICORN(head(CIT_BLCA_EXP,200), TFlist=HumanTF)
print(grn)
# [1] "18 Transcription Factors.  161 Target Genes.  642 Regulatory interactions."
# [1] "No added evidences."
options("mc.cores"=2)
grn =hLICORN(head(CIT_BLCA_EXP,200), TFlist=HumanTF)
print(grn)
# [1] "18 Transcription Factors.  161 Target Genes.  642 Regulatory interactions."
# [1] "No added evidences."
3.2 Refining the inferred regulatory networks

第二步是使用external knowledge来丰富前面得到的inferred regulatory network。有两种类型的外部数据集可以使用:1. TF对基因的调控数据比如Transcription Factor Binding Sites (TFBS)ChIP数据;2. 共调控信息比如蛋白-蛋白互作,这类数据可以提供TFs的合作信息。
❗️对于前面两种类型的数据,分别使用addEvidencesaddCooperativeEvidences函数来添加。

# 引入四类外部数据
# ChIP data from the CHEA database
data(CHEA_sub)
#ChIP data from the ENCODE project
data(ENCODE_sub)
# Protein protein interactions between TF from the HIPPIE database
data(HIPPIE_sub)
# Protein protein interactions between TF from the STRING database
data(STRING_sub)

enrichedGRN = addEvidences(grn,CHEA_sub,ENCODE_sub)
# CHEA_sub was integrated into the network.
# ENCODE_sub was integrated into the network.
enrichedGRN = addCooperativeEvidences(enrichedGRN,HIPPIE_sub,STRING_sub)
# [1] "HIPPIE_sub"
# [1] "No evidence from HIPPIE_sub were found in the inferred network."
# [1] "STRING_sub"
# [1] "STRING_sub was integrated into the network."

得到的enrichedGRN就是引入external knowledge的grn

print(enrichedGRN)
# [1] "63 Transcription Factors.  867 Target Genes.  8072 Regulatory interactions."
#            commonGene commonReg evidences commonEvidences       enrichment              p.value
# CHEA_sub          567        12      1050             218 1.13521053808365   0.0701008769458297
# ENCODE_sub        724        10      2749             382 1.18076947263415   0.0114797404982399
# STRING_sub       <NA>        39       137              36 3.16656003302599 2.24846204249256e-06

导入外部数据之后就可以对 共调控网络进行refine,可以使用默认的refine方法。

# Default unsupervised refinement method
refinedGRN = refine(enrichedGRN)
print(refinedGRN)
# [1] "57 Transcription Factors.  867 Target Genes.  1854 Regulatory interactions."
#            commonGene commonReg evidences commonEvidences       enrichment              p.value
# CHEA_sub          567        12      1050             176 3.55976690089225  3.1172872878133e-32
# ENCODE_sub        724        10      2749             296 4.71677404968161 3.01483641042892e-48
# STRING_sub       <NA>        21        83              17              Inf 4.69277379589639e-08

可以看到refine之后,Transcription Factors和1854 Regulatory interactions的数目都发生了变化

也可以使用指定的数据集来进行refine

# Example of supervised refinement with the CHEA chip data
refinedGRN = refine(enrichedGRN, integration="supervised",referenceEvidence="CHEA_sub")
print(refinedGRN)
3.3 Identification of active transcriptional programs

CoRegNet包的目的是从给定样本或样本集中识别active cooperative TF的集合。Transcriptional activity的衡量方法是估计给定样本中特定转录调节因子的活化程度。这个衡量方法是比较transcriptional network中活化的和抑制的基因表达。它是基于一个样本中这两个基因sets间的divergence的测量(Welch’s t statistics)。从根本上来说,如果可被某个TF激活的基因出现高表达的同时可被抑制的基因出现低表达,这个TF就具有high influence。
使用一个coregnet object的共表达网络和一个基因表达矩阵,不管这个矩阵是用来推断共表达网络的矩阵还是别的矩阵。output的是一个列为样本(和基因表达矩阵一致),行为TF的矩阵。每一行的TF都在转录网络中拥有足够数量的targets(至少10 activated和10 repressed)。

CITinf =regulatorInfluence(grn,CIT_BLCA_EXP)
CITinf[1:6,1:6]
#             CIT100    CIT102     CIT103    CIT105    CIT106    CIT107
# EMX2    -0.6705802 -4.134082   3.433414  8.512250 -2.129228  2.170088
# EGR1     6.2721207 14.563272  -6.805588 -4.347788 11.962142 -3.975751
# TGFB1I1  7.0348603 19.819534  -4.918518 -5.555589 12.702996 -3.141942
# EGR2     4.5810891 14.504279  -8.664420 -7.408075 11.432331 -9.668110
# PLAGL1   5.7439075 18.780449 -10.075245 -7.001515 13.497227 -7.333840
# SOX9     6.1841193 10.627645  -4.749642 -4.485600 12.322927 -1.968706
str(CITinf)
#  num [1:33, 1:183] -0.671 6.272 7.035 4.581 5.744 ...
#  - attr(*, "dimnames")=List of 2
#   ..$ : chr [1:33] "EMX2" "EGR1" "TGFB1I1" "EGR2" ...
#   ..$ : chr [1:183] "CIT100" "CIT102" "CIT103" "CIT105" ...

这个新的transcriptional influence数据集可以被视为整个转录数据集的condensed版本。

# Coregulators of a hLICORN  inferred network
head(coregulators(grn))
#    Reg1    Reg2     Support   NA nGRN    fisherTest adjustedPvalue
# 1 PPARG   VGLL1 0.017881213 1247 1247  1.436958e-68   1.638132e-67
# 2 GATA3   PPARG 0.016891795 1178 1178  2.861003e-87   1.087181e-85
# 3 FOXF1 TGFB1I1 0.013536379  944  944 2.838172e-101   3.235516e-99
# 4  EGR1    EGR2 0.012733373  888  888  4.332173e-72   6.173346e-71
# 5  MSX2   PPARG 0.012288853  857  857  2.767895e-77   6.310800e-76
# 6 GATA3    MSX2 0.009922854  692  692  9.068728e-80   2.584587e-78

也可以引入分组数据和copy number数据等

data(CIT_BLCA_CNV)
data(CIT_BLCA_Subgroup)
display(grn,expressionData=CIT_BLCA_EXP,TFA=CITinf)
# Visualizing additional regulatory or co-regulatory evidences in the network
display(enrichedGRN,expressionData=CIT_BLCA_EXP,TFA=CITinf)
# Visualizing sample classification using a named factor
display(grn,expressionData=CIT_BLCA_EXP,TFA=CITinf,clinicalData=CIT_BLCA_Subgroup)
# Visualizing copy number alteration of regulators
data(CIT_BLCA_CNV)
display(grn,expressionData=CIT_BLCA_EXP,TFA=CITinf,clinicalData=CIT_BLCA_Subgroup,alterationData=CIT_BLCA_CNV)

参考:
http://www.bioconductor.org/packages/release/bioc/vignettes/CoRegNet/inst/doc/CoRegNet.html
差异共表达网络