Read a term-document matrix from csv using python -


the reason classic csv reader doesn't work on term-document arrays first column of csv file terms, not values. file has following syntax:

"";"label1";"label2";"label3" ... "term1";1;0;8;... "term2";0;0;3;... ................................. 

i need build dictionary keys label1, label3, etc... , values column vectors (here be: dict[label1]-> 1,0 , dict[label2] -> 0,0 etc), meaning terms useless me.

i have implemented custom solution goes this:

.... keys = f.readline().split('";"') #1st line of csv keys = keys[1:]                  #skipping "" zeros = [0] * len(keys)          #dicts initial values 0 d = ordereddict(zip(keys, zeros)) lines = f.readlines() line in lines:     ...     splittting, stripping etc list values (eg: 1,0,8 - see example above)     ...     value in values:         .... 

however reading 8 csv files (total: 12mb) takes on 90 minutes laptop.

does know more efficient way deal this?

you use csv module anyway read csv files memory, transpose rows using zip(*rows) or itertools.izip(*rows):

with open(somecsv, 'rb') infile:     reader = csv.reader(infile, delimiter=';')     headers = next(reader)     data = list(reader)     data = dict(zip(headers, zip(*data))) 

this creates data dictionary headers keys , columns values. can delete '' 'terms' column dictionary if needed.

for input example, data dictionary looks after executing above code:

{'': ('term1', 'term2'), 'label1': ('1', '0'), 'label2': ('0', '0'), 'label3': ('8', '3')} 

Comments

Popular posts from this blog

java - Jmockit String final length method mocking Issue -

What is the difference between data design and data model(ERD) -

ios - Can NSManagedObject conform to NSCoding -