实例解析：使用Hisat2, StringTie与Ballgown进行拟南芥转录组数据的分析步骤

最编程 2024-02-18 21:05:43

...

原文链接

RNA-Seq for Model Plant (Arabidopsis thaliana)

参考文章

https://bi.biopapyrus.jp/rnaseq/mapping/hista/hisat2-paired-rnaseq.html

下载数据

直接利用参考文章里的shell脚本

SEQLIBS=(SRR8428909 SRR8428908 SRR8428907 SRR8428906 SRR8428905 SRR8428904)
for seqlib in ${SEQLIBS[@]}; do    wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR842/00${seqlib:9:10}/${seqlib}/${seqlib}_1.fastq.gz    wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR842/00${seqlib:9:10}/${seqlib}/${seqlib}_2.fastq.gzdone

执行

bash download_raw_data.sh

数据对应的文章

https://www.ncbi.nlm.nih.gov/pubmed?LinkName=sra_pubmed&from_uid=7072798

基本的实验设计：野生型对应转基因株系

解压重命名

bgzip -d SRR8428909_1.fastq.gzbgzip -d SRR8428909_2.fastq.gz
mv SRR8428909_1.fastq wt_Rep1_R1.fastqmv SRR8428909_2.fastq wt_Rep1_R2.fastq
bgzip -d SRR8428908_1.fastq.gzbgzip -d SRR8428908_2.fastq.gz
mv SRR8428908_1.fastq wt_Rep2_R1.fastqmv SRR8428908_2.fastq wt_Rep2_R2.fastq
bgzip -d SRR8428907_1.fastq.gzbgzip -d SRR8428907_2.fastq.gz
mv SRR8428907_1.fastq wt_Rep3_R1.fastqmv SRR8428907_2.fastq wt_Rep3_R2.fastq
bgzip -d SRR8428906_1.fastq.gzbgzip -d SRR8428906_2.fastq.gz
mv SRR8428906_1.fastq EE_Rep1_R1.fastqmv SRR8428906_2.fastq EE_Rep1_R2.fastq
bgzip -d SRR8428905_1.fastq.gzbgzip -d SRR8428905_2.fastq.gz
mv SRR8428905_1.fastq EE_Rep2_R1.fastqmv SRR8428905_2.fastq EE_Rep2_R2.fastq
bgzip -d SRR8428904_1.fastq.gzbgzip -d SRR8428904_2.fastq.gz
mv SRR8428904_1.fastq EE_Rep3_R1.fastqmv SRR8428904_2.fastq EE_Rep3_R2.fastq

在windows电脑上写到shell脚本在linux运行的时候遇到报错

2_unzip_raw_data.sh: line 3: $'\r': command not found

修改一下

sed -i 's/\r$//' 2_unzip_raw_data.sh

使用fastp对数据进行过滤

SEQLIBS=(EE_Rep1 EE_Rep2 EE_Rep3 wt_Rep1 wt_Rep2 wt_Rep3)
for seqlib in ${SEQLIBS[@]}; do    fastp -i ${seqlib}_R1.fastq -I ${seqlib}_R2.fastq -o ${seqlib}_clean_R1.fastq -O ${seqlib}_clean_R2.fastqdone

下载参考基因组和注释文件

wget ftp://ftp.ensemblgenomes.org/pub/plants/release-40/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gzbgzip -d Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gzwget ftp://ftp.ensemblgenomes.org/pub/plants/release-40/gff3/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.40.gff3.gzbgzip -d Arabidopsis_thaliana.TAIR10.40.gff3.gz

hisat2构建索引

mkdir referencemv Arabidopsis_thaliana* referencecd referencehisat2-build Arabidopsis_thaliana.TAIR10.dna.toplevel.fa tair10

hisat2比对

SEQLIBS=(EE_Rep1 EE_Rep2 EE_Rep3 wt_Rep1 wt_Rep2 wt_Rep3)
for seqlib in ${SEQLIBS[@]}; do    hisat2 -x reference/tair10 -1 ${seqlib}_clean_R1.fastq -2 ${seqlib}_clean_R2.fastq -p 4 -S ${seqlib}.samdone

sam文件转换为bam文件并排序

SEQLIBS=(EE_Rep1 EE_Rep2 EE_Rep3 wt_Rep1 wt_Rep2 wt_Rep3)
for seqlib in ${SEQLIBS[@]}; do    samtools view -@ 8 -bhS ${seqlib}.sam -o ${seqlib}.bam	samtools sort -@ 8 ${seqlib}.bam -o ${seqlib}.sorted.bamdone

把之前下载的gff3文件转化为gtf格式

cd referencegffread Arabidopsis_thaliana.TAIR10.40.gff3 -o TAIR10_GFF3_genes.gtf

接下来的具体步骤我也暂时看不明白具体是做啥了，反正最后得到的就是ballgown的输入文件了

stringtie -p 8 -l wT1 -G reference/TAIR10_GFF3_genes.gtf -o athaliana_wt_Rep1/transcripts.gtf wt_Rep1.sorted.bamstringtie -p 8 -l wT2 -G reference/TAIR10_GFF3_genes.gtf -o athaliana_wt_Rep2/transcripts.gtf wt_Rep2.sorted.bamstringtie -p 8 -l wT3 -G reference/TAIR10_GFF3_genes.gtf -o athaliana_wt_Rep3/transcripts.gtf wt_Rep3.sorted.bam
stringtie -p 8 -l EE1 -G reference/TAIR10_GFF3_genes.gtf -o athaliana_EE_Rep1/transcripts.gtf EE_Rep1.sorted.bamstringtie -p 8 -l EE2 -G reference/TAIR10_GFF3_genes.gtf -o athaliana_EE_Rep2/transcripts.gtf EE_Rep2.sorted.bamstringtie -p 8 -l EE3 -G reference/TAIR10_GFF3_genes.gtf -o athaliana_EE_Rep3/transcripts.gtf EE_Rep3.sorted.bam
ls -l athaliana_*/*.gtf | awk '{print $9}' > sample_assembly_gtf_list.txt
stringtie --merge -p 8 -o stringtie_merged.gtf -G reference/TAIR10_GFF3_genes.gtf sample_assembly_gtf_list.txtgffcompare -r reference/TAIR10_GFF3_genes.gtf -o gffcompare stringtie_merged.gtfstringtie -e -B -p 4 wt_Rep1.sorted.bam -G stringtie_merged.gtf -o athaliana_wt_Rep1/athaliana_wt_Rep1stringtie -e -B -p 4 wt_Rep2.sorted.bam -G stringtie_merged.gtf -o athaliana_wt_Rep2/athaliana_wt_Rep2.count -A athaliana_wt_Rep2/wt_Rep2_gene_abun.outstringtie -e -B -p 4 wt_Rep3.sorted.bam -G stringtie_merged.gtf -o athaliana_wt_Rep3/athaliana_wt_Rep3.count -A athaliana_wt_Rep3/wt_Rep3_gene_abun.outstringtie -e -B -p 4 EE_Rep1.sorted.bam -G stringtie_merged.gtf -o athaliana_EE_Rep1/athaliana_EE_Rep1.count -A athaliana_EE_Rep1/EE_Rep1_gene_abun.outstringtie -e -B -p 4 EE_Rep2.sorted.bam -G stringtie_merged.gtf -o athaliana_EE_Rep2/athaliana_EE_Rep2.count -A athaliana_EE_Rep2/EE_Rep2_gene_abun.outstringtie -e -B -p 4 EE_Rep3.sorted.bam -G stringtie_merged.gtf -o athaliana_EE_Rep3/athaliana_EE_Rep3.count -A athaliana_EE_Rep3/EE_Rep3_gene_abun.out

将以下文件导出到本地，放到一个文件夹下，我放到了ballgown这个文件夹下

[1] "athaliana_EE_Rep1" "athaliana_EE_Rep2" "athaliana_EE_Rep3" [4] "athaliana_wt_Rep1" "athaliana_wt_Rep2" "athaliana_wt_Rep3"

接下来开始差异表达分析

list.files("ballgown/")sample<-c("athaliana_EE_Rep1","athaliana_EE_Rep2", "athaliana_EE_Rep3", "athaliana_wt_Rep1", "athaliana_wt_Rep2", "athaliana_wt_Rep3" )type<-c(rep("EE",3),rep("wt",3))pheno_df<-data.frame("sample"=sample,"type"=type)rownames(pheno_df)<-pheno_df[,1]pheno_dfBiocManager::install("ballgown")library(ballgown)BiocManager::install("alyssafrazee/RSkittleBrewer")library(RSkittleBrewer)library(genefilter)library(dplyr)library(ggplot2)library(gplots)library(devtools)install.packages("RFLPtools")library(RFLPtools)bg <- ballgown(dataDir = "./ballgown/", pData=pheno_df, samplePattern = "athaliana")gene_expression<-gexpr(bg)head(gene_expression)boxplot(log10(gene_expression+1),names=c("EE1","EE2","EE3","WT1","WT2","WT3"),        col=c("red", "red","red","blue", "blue","blue"))

samples<-c("EE1","EE2","EE3","WT1","WT2","WT3")colnames(gene_expression)<-samplesrv<-rowVars(gene_expression)select<-order(rv,decreasing = T)[seq_len(min(500,length(rv)))]pc<-prcomp(t(gene_expression[select,]),scale=T)pc$xscores<-data.frame(pc$x,samples)scorespcpcnt<-round(100*pc$sdev^2/sum(pc$sdev^2),1)names(pcpcnt)<-paste0("PC",1:6)pcpcntpoint_colors<-c(rep("red",3),rep("blue",3))plot(pc$x[,1],pc$x[,2], xlab="", ylab="", main="PCA plot for all libraries",xlim=c(min(pc$x[,1])-2,max(pc$x[,1])+2),ylim=c(min(pc$x[,2])-2,max(pc$x[,2])+2),col=point_colors)text(pc$x[,1],pc$x[,2],pos=2,rownames(pc$x), col=c("red", "red","red","blue", "blue", "blue"))

results_transcripts<-stattest(bg,feature="transcript",covariate = 'type',                              getFC=T,meas="FPKM")results_genes<-stattest(bg,feature="gene",covariate="type",getFC=T,meas="FPKM")results_genes["log2fc"]<-log2(results_genes$fc)head(results_genes)g_sign<-subset(results_genes,pval<0.1&abs(log2fc)>0.584)head(g_sign)dim(g_sign)write.csv(g_sign,"results_genes.csv",row.names=F,quote = F)DEG<-results_genesDEG$change<-as.factor(ifelse(DEG$pval<0.05&abs(DEG$log2fc)>0.5,                             ifelse(DEG$log2fc>0.5,"UP","DOWN"),                             "NOT"))DEG<-na.omit(DEG)ggplot(data=DEG,aes(x=log2fc,y=-log10(pval),color=change))+  geom_point(alpha=0.4,size=1.75)+  labs(x="log2 fold change")+ ylab("-log10 pvalue")+  theme_bw(base_size = 20)+  theme(plot.title = element_text(size=15,hjust=0.5))+  scale_color_manual(values=c('blue','black','red'))

接下来还有GO注释和网络分析的内容，另外找时间来做了

简单总结

能够运行完基本流程，但是stringtie做了啥，ballgown做了啥，还有对应的参数是什么意思暂时还不太清楚，还的话时间看这两个软件！

欢迎大家关注我的公众号

小明的数据分析笔记本

中国加油！武汉加油！

上一篇：一篇文章带你完全明白序列化与反序列化的过程

下一篇：如何轻松打造大模型优质提示词？大赛优胜者的实战经验深度分享（升级版）