python - Dask dataframe: Memory error with merge -


i'm playing github user data , trying create graph of people in same city. need use merge operation in dask. unfortunately github user base size 6m , seems merge operation causing resulting dataframe blow up. used following code

import dask.dataframe dd gh = dd.read_hdf('data/github.hd5', '/github', chunksize=5000, columns=['id', 'city']).dropna() st = dd.read_hdf('data/github.hd5', '/github', chunksize=5000, columns=['id', 'city']).dropna() mrg = gh.merge(st, on='city').drop('city', axis=1) mrg['max'] = mrg.max(axis=1) mrg['min'] = mrg.min(axis=1) mrg.to_castra('github') 

i can merge on other criteria such name/username using code memoryerror when try , run above code.

i have tried running using sync/multiprocessing , threaded schedulers.

i'm trying on dell laptop i7 4core 8gb ram. shouldn't dask operation in chunked manner or getting wrong? writing code using pandas dataframe iterators way out?


Comments

Popular posts from this blog

mysql - Dreamhost PyCharm Django Python 3 Launching a Site -

java - Sending SMS with SMSLib and Web Services -

java - How to resolve The method toString() in the type Object is not applicable for the arguments (InputStream) -