python - Dask dataframe: Memory error with merge -

- August 15, 2010

i'm playing github user data , trying create graph of people in same city. need use merge operation in dask. unfortunately github user base size 6m , seems merge operation causing resulting dataframe blow up. used following code

import dask.dataframe dd gh = dd.read_hdf('data/github.hd5', '/github', chunksize=5000, columns=['id', 'city']).dropna() st = dd.read_hdf('data/github.hd5', '/github', chunksize=5000, columns=['id', 'city']).dropna() mrg = gh.merge(st, on='city').drop('city', axis=1) mrg['max'] = mrg.max(axis=1) mrg['min'] = mrg.min(axis=1) mrg.to_castra('github')

i can merge on other criteria such name/username using code memoryerror when try , run above code.

i have tried running using sync/multiprocessing , threaded schedulers.

i'm trying on dell laptop i7 4core 8gb ram. shouldn't dask operation in chunked manner or getting wrong? writing code using pandas dataframe iterators way out?

Search This Blog

celery

python - Dask dataframe: Memory error with merge -

Comments

Post a Comment

Popular posts from this blog

mysql - Dreamhost PyCharm Django Python 3 Launching a Site -

java - Sending SMS with SMSLib and Web Services -

php - What are the best practices for creatiang a "settings" model in Laravel 5? -