regex - Python - Extract and reformat field names from rows of data -
i have flat text file (infile) restructure. has few tab-delimited columns , looks this:
person1 height=60;weight=100;age=22 person2 height=62;weight=101;age=25 person3 height=64;weight=110;age=29
and want this:
person height weight age 1 60 100 22 2 62 101 25 3 64 110 29
you can see second column contains several semicolon-delimited header/value fields, , want restructure them typical column header rows.
right have:
for line in infile: line = line.split("\t") line_meta = line[1].split(";") print line_meta
i thinking best solution loop on line_meta variable, use use regular expressions detect header names (detect strings start multiple capital letters , ends "="_), add each header dictionary key, , store rest of string value. then, next row, if same header detected append existing dictionary.
can code or provice feedback how proceed?
thank you
edit: thank responses. simplified data example, here 1 of actual meta columns looks (still ; delimited, values types mixed):
p=0.9626;ipu=.$.+1t.+1t.+;irf=ncrna;iuc=utr3;ign=ncrna00115;igi=ncrna00115,rp11-206l10.16-001;iet=0;ieo=0;ien=.;iht=0;ihvc=0;ihd=.;ihi=.;ihn=.;idi=.;idn=.;itmaf=.;itamr=.;itasn=.;itafr=.;iteur=.;itnrb=+a;isf=.;isd=.;ism=.;isx=.;
you use 1 regular expression split out key=value pairs:
import re key_value = re.compile('(?p<key>[a-z]+)=(?p<value>\[^\s=;]+)(?:(?=;)|$)')
this expression uses named groups, without if find easier read:
key_value = re.compile('([a-z]+)=([^\s=;])(?:(?=;)|$)')
the (?:..)
group non-capturing group; used here demark |
or symbol applies to. pattern matches uppercase characters before =
symbol, , not whitespace, =
or ;
character, provided there ;
or end of string right after value.
this splits out keys , values each line:
>>> key_value = re.compile('(?p<key>[a-z]+)=(?p<value>[^\s=;]+)(?:(?=;)|$)') >>> key_value.findall('person1\theight=60;weight=100;age=22') [('height', '60'), ('weight', '100'), ('age', '22')]
this can turned dictionary:
>>> dict(key_value.findall('person1\theight=60;weight=100;age=22')) {'age': '22', 'weight': '100', 'height': '60'}
you can write these with, example, using csv.dictwriter()
:
import csv import re key_value = re.compile('(?p<key>[a-z]+)=(?p<value>[^\s=;]+)(?:(?=;)|$)') open(inputfilename) infile, open(outputfilename, 'wb') outfile: writer = csv.dictwriter(outfile, ('person', 'height', 'weight', 'age'), delimiter='\t') writer.writeheader() line in infile: person = line.split('\t', 1)[0] row = dict(key_value.findall(line)) row['person'] = person writer.writerow(row)
demo based on real data sample:
>>> dict(key_value.findall(' p=0.9626;ipu=.$.+1t.+1t.+;irf=ncrna;iuc=utr3;ign=ncrna00115;igi=ncrna00115,rp11-206l10.16-001;iet=0;ieo=0;ien=.;iht=0;ihvc=0;ihd=.;ihi=.;ihn=.;idi=.;idn=.;itmaf=.;itamr=.;itasn=.;itafr=.;iteur=.;itnrb=+a;isf=.;isd=.;ism=.;isx=.;\n')) {'isx': '.', 'itamr': '.', 'idn': '.', 'ism': '.', 'idi': '.', 'isf': '.', 'isd': '.', 'itmaf': '.', 'iuc': 'utr3', 'igi': 'ncrna00115,rp11-206l10.16-001', 'itnrb': '+a', 'ihvc': '0', 'iet': '0', 'itasn': '.', 'iteur': '.', 'itafr': '.', 'ieo': '0', 'ien': '.', 'ign': 'ncrna00115', 'irf': 'ncrna', 'p': '0.9626', 'iht': '0', 'ihi': '.', 'ihn': '.', 'ipu': '.$.+1t.+1t.+', 'ihd': '.'}
Comments
Post a Comment