python library for large tab/comma delimited text file -


i have big genomic data files analyze, come in 2 forms, 1 individual dosage file this:

id                      snp1    snp2    snp3    snp4    snp5    snp6 rs1->1000001    dose    1.994   1.998   1.998   1.998   1.830   1.335 rs1->1000002    dose    1.291   1.998   1.998   1.998   1.830   1.335 rs1->100001     dose    1.992   1.998   1.998   1.998   1.830   1.335 rs1->100002     dose    1.394   1.998   1.998   1.998   1.830   1.335 rs1->10001      dose    1.994   1.998   1.998   1.998   1.830   1.335 rs1->1001001    dose    1.904   1.998   1.998   1.998   1.830   1.335 rs1->1002001    dose    1.094   1.998   1.998   1.998   1.830   1.335 rs1->1003001    dose    1.994   1.998   1.998   1.998   1.830   1.335 rs1->1004001    dose    1.994   1.998   1.998   1.998   1.830   1.335 rs1->1005002    dose    1.994   1.998   1.998   1.998   1.830   1.335 

the other contains summary info:

snp         al1 al2 freq1   maf     quality rsq  22_16050607 g     0.99699 0.00301 0.99699 0.00000 22_16050650 c   t   0.99900 0.00100 0.99900 0.00000 22_16051065 g     0.99900 0.00100 0.99900 0.00000 22_16051134   g   0.99900 0.00100 0.99900 0.00000 rs62224609  t   c   0.91483 0.08517 0.91483 -0.00000 rs62224610  g   c   0.66733 0.33267 0.66733 0.00000 22_16051477 c     0.99399 0.00601 0.99399 -0.00000 22_16051493 g     0.99900 0.00100 0.99900 -0.00000 22_16051497   g   0.64529 0.35471 0.64529 0.00000 

the snp column in second file corresponds snp1, snp2... in first file. need use summary info in second file quality check , selection, apply statistical analysis on data in first file accordingly.

the question is, there python library suitable task? performance vital here, because these huge files. thanks!

for dealing large files , data high performance , efficient manipulation, there no better module pandas

the following code read file dataframe , allow easy manipulation:

import pandas pd data = 'my_data.csv' df = pd.read_csv(data) 

now df efficient dataframe containing data! also, don't need it's tab delimiter because pandas "sniffs" delimiter


Comments

Popular posts from this blog

java - Jmockit String final length method mocking Issue -

What is the difference between data design and data model(ERD) -

ios - Can NSManagedObject conform to NSCoding -