如何使用 CNVkit - 肿瘤基因组测序数据分析专栏

最编程 2024-07-09 20:22:11

...

简介

在进行肿瘤基因组数据分析时，拷贝数变异分析是常用的分析要点之一。CNVkit 则是用于基因组测序 WGS 和 WES 进行 call CNV 的工具之一，基于python 编写，给出的 CNV 结果相对可靠，操作起来也比较简单。

下载安装

推荐使用 conda ，可以解决环境依赖问题：

conda activate wes
conda install cnvkit
cnvkit.py --help

使用方法

该工具的官网教程已经写得很详细，请看：https://cnvkit.readthedocs.io/en/stable/index.html github：https://github.com/etal/cnvkit 可以直接用下面代码，进行 call 肿瘤样本的 somatic CNV：

cnvkit.py batch 5.gatk/Tumor*_bqsr.bam    \
    --normal  5.gatk/Normal*_bqsr.bam     \
    --output-reference ./8.cnv/cnvkit/my_reference.cnn \
    --output-dir ./8.cnv/cnvkit \
    --targets  ${bed}           \
    --fasta ${GENOME}           \
    -p ${nthread}               \
    --method hybrid             \
    --annotate ${refFlat}       \
    --drop-low-coverage         \
    --scatter --diagram         \
    1>./8.cnv/cnvkit/cnvkit.log 2>&1

其中 --targets {bed} 指定 bed 文件，对于 WES 数据，即捕获区间。--fasta {GENOME} 指定参考基因组 fasta 文件，-p {nthread} 指定线程数。而 --annotate {refFlat} 参数是可选的，不加也不影响，其指定注释文件，可以用下面代码下载：

wget -c http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/refFlat.txt.gz
wget -c http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refFlat.txt.gz

得到 cns 结果可以提取seg 文件，这里提供两种方法，一种是 cnvkit 自带的，一种是自己写的 awk 命令。

# 方法1
cnvkit.py export seg *bqsr.cns -o gistic.seg
sed 's/_bqsr//' gistic.seg
# 方法2
awk '{print FILENAME"\t"$0}' *bqsr.cns  | grep -v chromosome |sed 's/_bqsr.cns//g' | awk '{print $1"\t"$2"\t"$3"\t"$4"\t"$8"\t"$6}' >final.seg

得到的 seg 文件可以载入到 IGV 中进行可视化，IGV 会对文件后缀名 seg 进行识别。

但是，需要注意的是，cnvkit 最后还有一步 call 。

CNVkit 的子命令 call

cnvkit 最后有一步 call，可以对拷贝数进行基于B等位基因频率、肿瘤倍性和肿瘤纯度、患者性别等指标进行矫正的（非必须），新生成的 cns 文件多了一列为 cn，即绝对拷贝数 copy number：https://cnvkit.readthedocs.io/en/stable/pipeline.html#call

基于B等位基因频率

如果是基于 B 等位基因频率，则需要提供 vcf 文件，肿瘤和正常配对的，在文档中作者给出的是 mutect 得到的 vcf 再进行一定处理：https://cnvkit.readthedocs.io/en/stable/fileformats.html#vcf，处理好了之后，用下面命令运行：

cnvkit.py call Sample.cns -y -v Sample.vcf -o Sample.call.cns

可是问题在于，mutect 的 vcf 是somatic突变，这不太符合常规认知，应该是 germline 突变才比较合理。在 github 上面已经有人提出这一点，但是作者给出的回复仍然没变：https://github.com/etal/cnvkit/issues/601

image.png

其实，在运行 mutect2 的时候，可以选择保留 germline 位点。只需要运行 mutect2 的时候加上这两个参数 --genotype-germline-sites true --genotype-pon-sites true 就行，示例代码：

$GATK --java-options "-Xmx20G -Djava.io.tmpdir=./tmp" Mutect2 \
    -R ${GENOME}                   \
    -I ./5.gatk/${tumor}_bqsr.bam  \
    -I ./5.gatk/${normal}_bqsr.bam \
    -normal ${normal}              \
    -L ${interval}                 \
    --panel-of-normals ${pon}      \
    --germline-resource ${gnomad}  \
    --genotype-germline-sites true \
    --genotype-pon-sites true      \
    --f1r2-tar-gz ./6.mutect/${id}_f1r2.tar.gz \
    -O ./6.mutect/${id}_mutect2.vcf.gz \
    1>./6.mutect/${id}_mutect.log 2>&1

$GATK  --java-options "-Xmx20G -Djava.io.tmpdir=./tmp" FilterMutectCalls \
    -R ${GENOME}                                               \
    -V ./6.mutect/${tumor}_mutect2.vcf.gz                         \
    -O ./6.mutect/${tumor}_somatic.vcf.gz                         \
    1>>./6.mutect/${tumor}_mutect.log 2>&1

对输出得到的 vcf 文件进行一定的处理。这里提供一句代码处理，其中，config 是正常和肿瘤配对：

$ cat  config
Normal1    Tumor1
Normal2    Tumor2
Normal3    Tumor3
Normal4    Tumor4
....

需要的是 Mutect2 做了 FilterMutectCalls 标记的 vcf 文件，注意，只是标记过滤位点，没有真正过滤，对这些 vcf 文件，需要进行一定的处理，以下处理方法是根据cmvkit文档给出的方法，加上个人理解总结出来的，可能不是很成熟，还请读者仔细阅读 cnvkit 的文档：

cat config  | while read line 
do 
  arr=(${line})
  normal=${arr[0]}
  tumor=${arr[1]}

  zcat 6.mutect/${id}_somatic.vcf.gz | \
  awk 'BEGIN{FS="\t";OFS="\t"} {if($0~"^#") {print $0} else {if($7!="PASS"){$7="REJECT"} print $0} }'  > 8.cnv/cnvkit/${id}_cnvkit.vcf
  sed -i '2 i ##FILTER=<ID=REJECT,Description="Not somatic due to normal call frequency or phred likelihoods: tumor: 35, normal 35.">' 8.cnv/cnvkit/${tumor}_cnvkit.vcf
  sed -i "2 i ##PEDIGREE=<Derived=${tumor},Original=${normal}>" 8.cnv/cnvkit/${tumor}_cnvkit.vcf

  cnvkit.py call 8.cnv/cnvkit/${tumor}_bqsr.cns \
  -v 8.cnv/cnvkit/${tumor}_cnvkit.vcf      \
  --drop-low-coverage                   \
  -o 8.cnv/cnvkit/${tumor}_call.cns 
done

基于肿瘤纯度和肿瘤倍性

可以采用临床病理评估出来的肿瘤纯度进行校正：

cnvkit.py call Sample.cns -y -m clonal --purity 0.65 --ploidy 2 -o Sample.call.cns

也可以用其他工具计算出来结果，如 PureCN 、 THetA2、 PyClone或 BubbleTree 等，文档中给出了 THetA2 的教程：https://cnvkit.readthedocs.io/en/stable/heterogeneity.html#tumor-heterogeneity 如果是使用 PureCN ，可以参考：https://bioconductor.org/packages/devel/bioc/vignettes/PureCN/inst/doc/Quick.html#52_Recommended_CNVkit_usage 当然，除了作者给出的这些工具，也可以用 Absolute 或 Sequenza 的结果。需要注意的是，上面代码中的 -m/--method 参数，是指定 call 的方法，有 3 种，threshold,clonal,none

threshold：基于阈值进行计算绝对拷贝数 cn 列的。为默认参数，且默认阈值是：-t=-1.1,-0.4,0.3,0.7

If log2 value is up to	Copy number
-1.1	0
-0.4	1
0.3	2
0.7	3
…	…

clonal：使用给定的肿瘤纯度和肿瘤倍性，将 log2 值转换为绝对拷贝数，然后简单地将绝对拷贝数值四舍五入到整数。该方法适用于生殖系样品、高纯度肿瘤样品（例如细胞系）
none：不计算 cn

除此之外，还可以基于性别对性染色体进行校正，但是由于该参数要实现批处理比较麻烦，所以这里不展示代码。最后校正好了，可以基于最终结果进行绘图可视化：

cat config  | while read line 
do 
  arr=(${line})
  normal=${arr[0]}
  tumor=${arr[1]}

  cnvkit.py scatter                 \
  8.cnv/cnvkit/${tumor}_bqsr.cnr       \
  -s 8.cnv/cnvkit/${tumor}_call.cns    \
  --title ${tumor} --segment-color red \
  --y-max 2  --y-min -2  -m 3       \
  -v 8.cnv/cnvkit/${tumor}_cnvkit.vcf  \
  -o 8.cnv/cnvkit/${tumor}_scatter.png \

  cnvkit.py diagram                 \
  8.cnv/cnvkit/${tumor}_bqsr.cnr       \
  -s 8.cnv/cnvkit/${tumor}_call.cns    \
  --title ${tumor}                     \
  -o 8.cnv/cnvkit/${tumor}_diagram.pdf \
done

上一篇： CNV 数据分析主题

下一篇： TCGA 拷贝数门户网站：肿瘤拷贝数变异数据中心

如何使用 CNVkit - 肿瘤基因组测序数据分析专栏

简介

下载安装

使用方法

CNVkit 的子命令 call

基于B等位基因频率

基于肿瘤纯度和肿瘤倍性

如何使用 CNVkit - 肿瘤基因组测序数据分析专栏

GATK 基因突变流程 - 肿瘤基因组测序数据分析专栏