使用过滤器进行注释的Annovar-4详解

最编程 2024-08-04 19:33:41

...

Filter-based annotation是基于筛选进行注释的

For frequency of variants in whole-genome data:全基因组数据中变异的频率

1000g2015aug: latest 1000 Genomes Project dataset with allele frequencies in six populations including ALL, AFR (African), AMR (Admixed American), EAS (East Asian), EUR (European), SAS (South Asian). These are whole-genome variants.

kaviar_20150923: latest Kaviar database with 170 million variants from 13K genomes and 64K exomes.

hrcr1: latest Haplotype Reference Consortium database with 40 million variants from 32K samples in haplotype reference consortium

cg69: allele frequency in 69 human subjects sequenced by Complete Genomics. 完整的基因组学为相对较小的一组健康受试者提供全基因组数据,目前，在ANNOVAR中提供了cg46和cg69，分别代表来自46个无关受试者和69个相关受试者（包括46个无关受试者）的等位基因频率数据。46名受试者只有92个常染色体.

gnomad_genome: allele frequency in gnomAD database whole-genome sequence data on multiple populations.

For frequency of variants in whole-exome data:全外显子组数据变异的频率

exac03: latest Exome Aggregation Consortium dataste with allele frequencies in ALL, AFR (African), AMR (Admixed American), EAS (East Asian), FIN (Finnish), NFE (Non-finnish European), OTH (other), SAS (South Asian).

esp6500siv2: latest NHLBI-ESP project with 6500 exomes. Three separate key words are used for 3 population groupings: esp6500siv2_all, esp6500siv2_ea, esp6500siv2_aa.

gnomad_exome: allele frequency in gnomAD database whole-exome sequence data on multiple populations.

For frequency of variants in isolated or less represented populations:孤立或较少代表群体的变异频率

ajews: common alleles in ashkenazi jews

TMC-SNPDB: common alleles in Indian populations

gme: GME (Greater Middle East Variome) allele frequency, including ALL, NWA (northwest Africa), NEA (northeast Africa), AP (Ar* peninsula), Israel, SD (Syrian desert), TP (Turkish peninsula) and CA (Central Asia).

For functional prediction of variants in whole-genome data:功能预测全基因组数据中的变异

gerp++: functional prediction scores for 9 billion mutations based on selective constraints across human genome. You can optionally use gerp++gt2 instead since it includes only RS score greater than 2, which provides high sensitivity while still strongly enriching for truly constrained sites

cadd: Combined Annotation Dependent Depletion score for 9 billion mutations. It is basically constructed by a support vector machine trained to differentiate 14.7 million high-frequency human-derived alleles from 14.7 million simulated variants, using ~70 different features. For known indels, use caddindel.

cadd13: CADD version 1.3.

dann: functional prediction score generated by deep learning, using the identical set of training data as cadd but with much improved performance than cadd.

fathmm: a hidden markov model to predict the functional importance of both coding and non-coding variants (that is, two separate scores are provided) on 9 billion mutations.

eigen: a spectral approach integrating functional genomic annotations for coding and noncoding variants on 9 billion mutations, without labelled training data (that is, unsupervised approach)

gwava: genome-wide annotation of variants that supports prioritization of noncoding variants by integrating various genomic and epigenomic annotations on 9 billion mutations.

For functional prediction of variants in whole-exome data:功能预测全外显子组数据中的变异

dbnsfp30a: this dataset already includes SIFT, PolyPhen2 HDIV, PolyPhen2 HVAR, LRT, MutationTaster, MutationAssessor, FATHMM, MetaSVM, MetaLR, VEST, CADD, GERP++, DANN, fitCons, PhyloP and SiPhy scores, but ONLY on coding variants

For functional prediction of splice variants:剪接变体的功能预测

dbscsnv11: dbscSNV version 1.1 for splice site prediction by AdaBoost and Random Forest, which score how likely that the variant may affect splicing

spidex: deep learning based prediction of splice variants. Unlike dbscsnv11, these variants could be far away from canonical splice sites

For disease-specific variants:疾病特异性变异

clinvar_20160302: ClinVar database with separate columns (CLINSIG CLNDBN CLNACC CLNDSDB CLNDSDBID) for each variant (Please check the download page for the latest version, or read below for creating your own most updated version)

cosmic70: the latest COSMIC database with somatic mutations from cancer and the frequency of occurence in each subtype of cancer. For more updated cosmic, see instructions below on how to make them.

icgc21: International Cancer Genome Consortium version 21 mutations.

nci60: NCI-60 human tumor cell line panel exome sequencing allele frequency data

For variant identifiers:变体标识符

snp142: dbSNP version 142

snp138：dbSNP version 138

avsnp142: an abbreviated version of dbSNP 142 with left-normalization by ANNOVAR developers. avSNP数据集基本上是重新格式化的dbSNP数据集

LJB* (dbNSFP) non-synonymous variants annotation非同义突变注释

到2017年，数据库变为dbnsfp33a，以前的版本为ljb26, ljb23, ljb2, ljb。输出结果包含的有:SIFT_score SIFT_pred Polyphen2_HDIV_score Polyphen2_HDIV_pred Polyphen2_HVAR_score Polyphen2_HVAR_pred LRT_score LRT_pred MutationTaster_score MutationTaster_pred MutationAssessor_score MutationAssessor_pred FATHMM_score FATHMM_pred PROVEAN_score PROVEAN_pred VEST3_score CADD_raw CADD_phred DANN_score fathmm-MKL_coding_score fathmm-MKL_coding_pred MetaSVM_score MetaSVM_pred MetaLR_score MetaLR_pred integrated_fitCons_score integrated_confidence_value GERP++_RS phyloP7way_vertebrate phyloP20way_mammalian phastCons7way_vertebrate phastCons20way_mammalian SiPhy_29way_logOdds

参考：http://annovar.openbioinformatics.org/en/latest/user-guide/filter/#dbsnp-annotations

上一篇：特朗普的就职演说

下一篇：使用鲁班：让编程变得像拼装玩具一样简单

使用过滤器进行注释的Annovar-4详解

For variant identifiers:变体标识符

snp142: dbSNP version 142

snp138：dbSNP version 138

avsnp142: an abbreviated version of dbSNP 142 with left-normalization by ANNOVAR developers. avSNP数据集基本上是重新格式化的dbSNP数据集

LJB* (dbNSFP) non-synonymous variants annotation非同义突变注释