regex - Python - Extract and reformat field names from rows of data -

- August 15, 2014

i have flat text file (infile) restructure. has few tab-delimited columns , looks this:

 person1    height=60;weight=100;age=22  person2    height=62;weight=101;age=25  person3    height=64;weight=110;age=29

and want this:

 person    height    weight    age  1         60        100       22  2         62        101       25  3         64        110       29

you can see second column contains several semicolon-delimited header/value fields, , want restructure them typical column header rows.

right have:

for line in infile:         line = line.split("\t")         line_meta = line[1].split(";")         print line_meta

i thinking best solution loop on line_meta variable, use use regular expressions detect header names (detect strings start multiple capital letters , ends "="_), add each header dictionary key, , store rest of string value. then, next row, if same header detected append existing dictionary.

can code or provice feedback how proceed?

thank you

edit: thank responses. simplified data example, here 1 of actual meta columns looks (still ; delimited, values types mixed):

       p=0.9626;ipu=.$.+1t.+1t.+;irf=ncrna;iuc=utr3;ign=ncrna00115;igi=ncrna00115,rp11-206l10.16-001;iet=0;ieo=0;ien=.;iht=0;ihvc=0;ihd=.;ihi=.;ihn=.;idi=.;idn=.;itmaf=.;itamr=.;itasn=.;itafr=.;iteur=.;itnrb=+a;isf=.;isd=.;ism=.;isx=.;

you use 1 regular expression split out key=value pairs:

import re  key_value = re.compile('(?p<key>[a-z]+)=(?p<value>\[^\s=;]+)(?:(?=;)|$)')

this expression uses named groups, without if find easier read:

key_value = re.compile('([a-z]+)=([^\s=;])(?:(?=;)|$)')

the (?:..) group non-capturing group; used here demark | or symbol applies to. pattern matches uppercase characters before = symbol, , not whitespace, = or ; character, provided there ; or end of string right after value.

this splits out keys , values each line:

>>> key_value = re.compile('(?p<key>[a-z]+)=(?p<value>[^\s=;]+)(?:(?=;)|$)') >>> key_value.findall('person1\theight=60;weight=100;age=22') [('height', '60'), ('weight', '100'), ('age', '22')]

this can turned dictionary:

>>> dict(key_value.findall('person1\theight=60;weight=100;age=22')) {'age': '22', 'weight': '100', 'height': '60'}

you can write these with, example, using csv.dictwriter():

import csv import re  key_value = re.compile('(?p<key>[a-z]+)=(?p<value>[^\s=;]+)(?:(?=;)|$)')  open(inputfilename) infile, open(outputfilename, 'wb') outfile:     writer = csv.dictwriter(outfile, ('person', 'height', 'weight', 'age'), delimiter='\t')     writer.writeheader()      line in infile:         person = line.split('\t', 1)[0]         row = dict(key_value.findall(line))         row['person'] = person         writer.writerow(row)

demo based on real data sample:

>>> dict(key_value.findall('       p=0.9626;ipu=.$.+1t.+1t.+;irf=ncrna;iuc=utr3;ign=ncrna00115;igi=ncrna00115,rp11-206l10.16-001;iet=0;ieo=0;ien=.;iht=0;ihvc=0;ihd=.;ihi=.;ihn=.;idi=.;idn=.;itmaf=.;itamr=.;itasn=.;itafr=.;iteur=.;itnrb=+a;isf=.;isd=.;ism=.;isx=.;\n')) {'isx': '.', 'itamr': '.', 'idn': '.', 'ism': '.', 'idi': '.', 'isf': '.', 'isd': '.', 'itmaf': '.', 'iuc': 'utr3', 'igi': 'ncrna00115,rp11-206l10.16-001', 'itnrb': '+a', 'ihvc': '0', 'iet': '0', 'itasn': '.', 'iteur': '.', 'itafr': '.', 'ieo': '0', 'ien': '.', 'ign': 'ncrna00115', 'irf': 'ncrna', 'p': '0.9626', 'iht': '0', 'ihi': '.', 'ihn': '.', 'ipu': '.$.+1t.+1t.+', 'ihd': '.'}

Search This Blog

Permission

regex - Python - Extract and reformat field names from rows of data -

Comments

Post a Comment

Popular posts from this blog

java - Jmockit String final length method mocking Issue -

asp.net - Razor Page Hosted on IIS 6 Fails Every Morning -

c++ - wxwidget compiling on windows command prompt -