python - Finding random pairs in a large directory -
i have ~5m csv files stored in ~100.000 folders. each folder contains same number of files , there's number of files in folder. need find paths these files , load them list in strange order statistical modeling project.
in particular, need following upheld:
- uniqueness: each file must in list once
- pairs: each file must next file same folder (it can next 2 if due randomness)
- randomness: probability of 2 files not "paired" being next each other should same (i.e. wouldn't work iterative on files)
i've created example below.
files
folder_1 - file_a - file_b - file_c - file_d folder_2 - file_e - file_f - file_g - file_h
good result (randomized, upholds rule of pairs)
paths = ['folder_1/file_a', 'folder_1/file_d', 'folder_2/file_g', 'folder_2/file_f', 'folder_2/file_e', 'folder_2/file_h', 'folder_1/file_c', 'folder_1/file_b']
a simple approach might "pick random folder, pick random file in folder , random pair in folder. save these picks in list avoid getting picked again. repeat.". however, take far long. can recommend strategy creating list? randomness requirement can relaxed bit if needed.
one way ensure everything's random use random.shuffle
, shuffles list inplace. way can pair each item neighbor, safe in knowledge pairing random. achieve result example can shuffle , flatten resulting list of pairs. here's example:
from random import shuffle # generate sample directory names ls = [[]] * 5 = 0 while < len(ls): ls[i] = [str(i) + chr(j) j in range(97,101)] += 1 # shuffle files within each directory pairs = [] l in ls: shuffle(l) pairs += list(zip(l[1::2], l[::2])) # shuffle , flatten list of pairs shuffle(pairs) flat = [item sublist in pairs item in sublist] print(flat)
Comments
Post a Comment