python - Dask dataframe: Memory error with merge -
i'm playing github user data , trying create graph of people in same city. need use merge operation in dask. unfortunately github user base size 6m , seems merge operation causing resulting dataframe blow up. used following code
import dask.dataframe dd gh = dd.read_hdf('data/github.hd5', '/github', chunksize=5000, columns=['id', 'city']).dropna() st = dd.read_hdf('data/github.hd5', '/github', chunksize=5000, columns=['id', 'city']).dropna() mrg = gh.merge(st, on='city').drop('city', axis=1) mrg['max'] = mrg.max(axis=1) mrg['min'] = mrg.min(axis=1) mrg.to_castra('github')
i can merge on other criteria such name/username using code memoryerror when try , run above code.
i have tried running using sync/multiprocessing , threaded schedulers.
i'm trying on dell laptop i7 4core 8gb ram. shouldn't dask operation in chunked manner or getting wrong? writing code using pandas dataframe iterators way out?
Comments
Post a Comment