Read a term-document matrix from csv using python -
the reason classic csv reader doesn't work on term-document arrays first column of csv file terms, not values. file has following syntax:
"";"label1";"label2";"label3" ... "term1";1;0;8;... "term2";0;0;3;... .................................
i need build dictionary keys label1, label3, etc... , values column vectors (here be: dict[label1]-> 1,0 , dict[label2] -> 0,0 etc), meaning terms useless me.
i have implemented custom solution goes this:
.... keys = f.readline().split('";"') #1st line of csv keys = keys[1:] #skipping "" zeros = [0] * len(keys) #dicts initial values 0 d = ordereddict(zip(keys, zeros)) lines = f.readlines() line in lines: ... splittting, stripping etc list values (eg: 1,0,8 - see example above) ... value in values: ....
however reading 8 csv files (total: 12mb) takes on 90 minutes laptop.
does know more efficient way deal this?
you use csv
module anyway read csv files memory, transpose rows using zip(*rows)
or itertools.izip(*rows)
:
with open(somecsv, 'rb') infile: reader = csv.reader(infile, delimiter=';') headers = next(reader) data = list(reader) data = dict(zip(headers, zip(*data)))
this creates data
dictionary headers keys , columns values. can delete ''
'terms' column dictionary if needed.
for input example, data
dictionary looks after executing above code:
{'': ('term1', 'term2'), 'label1': ('1', '0'), 'label2': ('0', '0'), 'label3': ('8', '3')}
Comments
Post a Comment