python library for large tab/comma delimited text file -
i have big genomic data files analyze, come in 2 forms, 1 individual dosage file this:
id snp1 snp2 snp3 snp4 snp5 snp6 rs1->1000001 dose 1.994 1.998 1.998 1.998 1.830 1.335 rs1->1000002 dose 1.291 1.998 1.998 1.998 1.830 1.335 rs1->100001 dose 1.992 1.998 1.998 1.998 1.830 1.335 rs1->100002 dose 1.394 1.998 1.998 1.998 1.830 1.335 rs1->10001 dose 1.994 1.998 1.998 1.998 1.830 1.335 rs1->1001001 dose 1.904 1.998 1.998 1.998 1.830 1.335 rs1->1002001 dose 1.094 1.998 1.998 1.998 1.830 1.335 rs1->1003001 dose 1.994 1.998 1.998 1.998 1.830 1.335 rs1->1004001 dose 1.994 1.998 1.998 1.998 1.830 1.335 rs1->1005002 dose 1.994 1.998 1.998 1.998 1.830 1.335
the other contains summary info:
snp al1 al2 freq1 maf quality rsq 22_16050607 g 0.99699 0.00301 0.99699 0.00000 22_16050650 c t 0.99900 0.00100 0.99900 0.00000 22_16051065 g 0.99900 0.00100 0.99900 0.00000 22_16051134 g 0.99900 0.00100 0.99900 0.00000 rs62224609 t c 0.91483 0.08517 0.91483 -0.00000 rs62224610 g c 0.66733 0.33267 0.66733 0.00000 22_16051477 c 0.99399 0.00601 0.99399 -0.00000 22_16051493 g 0.99900 0.00100 0.99900 -0.00000 22_16051497 g 0.64529 0.35471 0.64529 0.00000
the snp column in second file corresponds snp1, snp2... in first file. need use summary info in second file quality check , selection, apply statistical analysis on data in first file accordingly.
the question is, there python library suitable task? performance vital here, because these huge files. thanks!
for dealing large files , data high performance , efficient manipulation, there no better module pandas
the following code read file dataframe
, allow easy manipulation:
import pandas pd data = 'my_data.csv' df = pd.read_csv(data)
now df
efficient dataframe containing data! also, don't need it's tab delimiter because pandas "sniffs" delimiter
Comments
Post a Comment