欢迎您访问 最编程 本站为您分享编程语言代码,编程技术文章!
您现在的位置是: 首页

2021-03-17 在 linux 上将 vcf 文件格式化为 plink bed、bim、fam

最编程 2024-03-15 10:22:14
...

我现在的问题是这样的
vcf文件转plink的格式
方法一
vcftools

[lyc@200server ~]$ vcftools --vcf Rice.recode.vcf --plink --out output

出错这样


VCFtools - 0.1.17
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
    --vcf Rice.recode.vcf
    --out output
    --plink

Warning: Expected at least 2 parts in FORMAT entry: ID=RNC,Number=2,Type=Character,Description="Reason for No Call in GT: . = n/a, M = Missing data, P = Partial data, I = gVCF input site is non-called, D = insufficient Depth of coverage, - = unrepresentable overlapping deletion, L = Lost/unrepresentable allele (other than deletion), U = multiple Unphased variants present, O = multiple Overlapping variants present, 1 = site is Monoallelic, no assertion about presence of REF or ALT allele">
After filtering, kept 141 out of 141 Individuals
Writing PLINK PED and MAP files ... 

Unrecognized values used for CHROM: Chr1 - Replacing with 0.
ls

ls
clear

Unrecognized values used for CHROM: Chr2 - Replacing with 0.
Expected at least 2 parts in FORMAT entry: ID=RNC,Number=2,Type=Character,Description="Reason for No Call in GT: . = n/a, M = Missing data, P = Partial data, I = gVCF input site is non-called, D = insufficient Depth of coverage, - = unrepresentable overlapping deletion, L = Lost/unrepresentable allele (other than deletion), U = multiple Unphased variants present, O = multiple Overlapping variants present, 1 = site is Monoallelic, no assertion about presence of REF or ALT allele">
Unrecognized values used for CHROM: Chr3 - Replacing with 0.

Unrecognized values used for CHROM: Chr4 - Replacing with 0.

Unrecognized values used for CHROM: Chr5 - Replacing with 0.

Unrecognized values used for CHROM: Chr6 - Replacing with 0.

Unrecognized values used for CHROM: Chr7 - Replacing with 0.

Unrecognized values used for CHROM: Chr8 - Replacing with 0.

Unrecognized values used for CHROM: Chr9 - Replacing with 0.

Unrecognized values used for CHROM: Chr10 - Replacing with 0.

Unrecognized values used for CHROM: Chr11 - Replacing with 0.

Unrecognized values used for CHROM: Chr12 - Replacing with 0.

Unrecognized values used for CHROM: ChrUn - Replacing with 0.

Unrecognized values used for CHROM: ChrSy - Replacing with 0.

Unrecognized values used for CHROM: chrC - Replacing with 0.
Done.
After filtering, kept 7186300 out of a possible 7186300 Sites
Run Time = 1647.00 seconds
[lyc@200server ~]$ ls
file.log            output.ped                       prettify
file-temporary.bed  output-temporary.bed             Rice.recode.vcf
file-temporary.bim  output-temporary.bim             test
file-temporary.fam  output-temporary.fam             toy.map
LICENSE             plink                            toy.ped
output.log          plink_linux_x86_64_20201019.zip
output.map          ppp
[lyc@200server ~]$ 
[lyc@200server ~]$ ls
file.log            output.ped                       prettify
file-temporary.bed  output-temporary.bed             Rice.recode.vcf
file-temporary.bim  output-temporary.bim             test
file-temporary.fam  output-temporary.fam             toy.map
LICENSE             plink                            toy.ped
output.log          plink_linux_x86_64_20201019.zip
output.map          ppp
[lyc@200server ~]$ clear

[lyc@200server ~]$ Expected at least 2 parts in FORMAT entry: ID=RNC,Number=2,Type=Character,Description="Reason for No Call in GT: . = n/a, M = Missing data, P = Partial data, I = gVCF input site is non-called, D = insufficient Depth of coverage, - = unrepresentable overlapping deletion, L = Lost/unrepresentable allele (other than deletion), U = multiple Unphased variants present, O = multiple Overlapping variants present, 1 = site is Monoallelic, no assertion about presence of REF or ALT allele">
-bash: 未预期的符号 `newline' 附近有语法错误

方法二plink

plink --vcf Rice.recode.vcf --recode --out file
plink --vcf Rice.recode.vcf --recode --out output --double-id
plink --vcf Rice.recode.vcf --recode --out output --const-fid family_id

都是一样的结果

773821 MB RAM detected; reserving 386910 MB for main workspace.
Error: Line 38 of .vcf file has a GT half-call.
Use --vcf-half-call to specify how these should be processed.

就无语
之前我的文件有这些

file.log            LICENSE     plink                            Rice.recode.vcf
file-temporary.bed  output.log  plink_linux_x86_64_20201019.zip  test
file-temporary.bim  output.map  ppp                              toy.map
file-temporary.fam  output.ped  prettify                         toy.ped

现在是这些

file.log            output.ped                       prettify
file-temporary.bed  output-temporary.bed             Rice.recode.vcf
file-temporary.bim  output-temporary.bim             test
file-temporary.fam  output-temporary.fam             toy.map
LICENSE             plink                            toy.ped
output.log          plink_linux_x86_64_20201019.zip
output.map          ppp

然后看到了这些

image.png

https://www.cog-genomics.org/plink/1.9/input
然后我就重新安装了plink为plink2
http://www.cog-genomics.org/plink/2.0/
linux上安装Plink
https://blog.****.net/qq_40605470/article/details/108882992
但是又遇到了新的问题

[lyc@200server ~]$ plink2 --vcf Rice.recode.vcf --recode --out ccc
PLINK v2.00a3LM 64-bit Intel (2 Mar 2021)      www.cog-genomics.org/plink/2.0/
(C) 2005-2021 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to ccc.log.
Options in effect:
  --export ped
  --out ccc
  --vcf Rice.recode.vcf

Start time: Thu Mar 18 20:20:58 2021
773821 MiB RAM detected; reserving 386910 MiB for main workspace.
Using up to 80 threads (change this with --threads).
--vcf: 7146k variants scanned.
Error: Invalid chromosome code 'ChrUn' on line 7146662 of --vcf file.
(Use --allow-extra-chr to force it to be accepted.)
End time: Thu Mar 18 20:21:32 2021

然后发现文件又是临时文件

[lyc@200server ~]$ ls
ccc.log               plink2
ccc-temporary.psam    plink2_linux_x86_64_20210302.zip
file.log              plink2.log
file-temporary.bed    plink_linux_x86_64_20201019.zip
file-temporary.bim    ppp
file-temporary.fam    prettify
LICENSE               Rice.recode.vcf
output.log            test
output.map            toy.map
output.ped            toy.ped
output-temporary.bed  transform.log
output-temporary.bim  transform-temporary.bed
output-temporary.fam  transform-temporary.bim
plink                 transform-temporary.fam

说明又出错了
所以,现在或许又要解决那一行的无效染色体
然后我就使用了--allow-extra-chr

[lyc@200server ~]$ plink2 --vcf Rice.recode.vcf --recode --out ccc --allow-extra-chr
PLINK v2.00a3LM 64-bit Intel (2 Mar 2021)      www.cog-genomics.org/plink/2.0/
(C) 2005-2021 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to ccc.log.
Options in effect:
  --allow-extra-chr
  --export ped
  --out ccc
  --vcf Rice.recode.vcf

Start time: Fri Mar 19 16:31:21 2021
773821 MiB RAM detected; reserving 386910 MiB for main workspace.
Using up to 80 threads (change this with --threads).
--vcf: 7186300 variants scanned.

Error: Line 38 of --vcf file has a GT half-call.
Use --vcf-half-call to specify how these should be processed.
End time: Fri Mar 19 16:32:26 2021

哈,又回到了最初的起点


image.png

可是我看意思,plink2不应该有这个问题啊

[lyc@200server ~]$ plink2 --vcf Rice.recode.vcf --recode --out ccc --allow-extra-chr --vcf-half-call
PLINK v2.00a3LM 64-bit Intel (2 Mar 2021)      www.cog-genomics.org/plink/2.0/
(C) 2005-2021 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to ccc.log.
Options in effect:
  --allow-extra-chr
  --export ped
  --out ccc
  --vcf Rice.recode.vcf
  --vcf-half-call

Start time: Fri Mar 19 16:33:16 2021
Error: Missing --vcf-half-call argument.
For more info, try "plink2 --help <flag name>" or "plink2 --help | more".

应该是 --vcf-half-call的格式不对
然后换回plink1
重看官网
以及https://www.jianshu.com/p/38e603979334?utm_campaign=maleskine&utm_content=note&utm_medium=seo_notes&utm_source=recommendation
发现格式

[lyc@200server ~]$  plink --vcf Rice.recode.vcf --allow-extra-chr --recode --vcf-half-call'missing' --out eee
PLINK v1.90b6.21 64-bit (19 Oct 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to eee.log.
Options in effect:
  --allow-extra-chr
  --out eee
  --recode
  --vcf Rice.recode.vcf
  --vcf-half-callmissing

Error: Unrecognized flag ('--vcf-half-callmissing').
For more information, try "plink --help <flag name>" or "plink --help | more".

发现原来是因为少了空格

[lyc@200server ~]$  plink --vcf Rice.recode.vcf --allow-extra-chr --recode --vcf-half-call 'haploid' --out eee
PLINK v1.90b6.21 64-bit (19 Oct 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to eee.log.
Options in effect:
  --allow-extra-chr
  --out eee
  --recode
  --vcf Rice.recode.vcf
  --vcf-half-call haploid

773821 MB RAM detected; reserving 386910 MB for main workspace.
--vcf: eee-temporary.bed + eee-temporary.bim + eee-temporary.fam written.
7186300 variants loaded from .bim file.
141 people (0 males, 0 females, 141 ambiguous) loaded from .fam.
Ambiguous sex IDs written to eee.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 141 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.878065.
7186300 variants and 141 people pass filters and QC.
Note: No phenotypes present.
--recode ped to eee.ped + eee.map ... done.

终于成功,哭了
https://www.jianshu.com/p/dc82fcbe3cda然后意识到自己搜索关键词有问题

[lyc@200server ~]$ plink --file eee --allow-extra-chr --make-bed --out rice
PLINK v1.90b6.21 64-bit (19 Oct 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to rice.log.
Options in effect:
  --allow-extra-chr
  --file eee
  --make-bed
  --out rice

773821 MB RAM detected; reserving 386910 MB for main workspace.
.ped scan complete (for binary autoconversion).
Performing single-pass .bed write (7186300 variants, 141 people).
--file: rice-temporary.bed + rice-temporary.bim + rice-temporary.fam written.
7186300 variants loaded from .bim file.
141 people (0 males, 0 females, 141 ambiguous) loaded from .fam.
Ambiguous sex IDs written to rice.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 141 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.878065.
7186300 variants and 141 people pass filters and QC.
Note: No phenotypes present.
--make-bed to rice.bed + rice.bim + rice.fam ... done.

这回终于对了


image.png

但问题是第一种转的,不知道应该是哪一种


image.png

我直接用了
我发现用plink2依然也还是这个问题,我看到一个答案貌似说是因为版本还不够新,今年初的那个改进版本或许可以修正这个问题
[lyc@200server ~]$ plink2 --vcf Rice.recode.vcf --allow-extra-chr --make-bed --out test
PLINK v1.90b6.21 64-bit (19 Oct 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to test.log.
Options in effect:
  --allow-extra-chr
  --make-bed
  --out test
  --vcf Rice.recode.vcf

773821 MB RAM detected; reserving 386910 MB for main workspace.

Error: Line 38 of .vcf file has a GT half-call.
Use --vcf-half-call to specify how these should be processed.