python - How to load a Spark model for efficient predictions -


when build spark model , call it, predictions take tens of ms return. however, when save same model, load it, predictions take longer. there sort of cache should using?

model.cache() after loading not work, model not rdd.

this works great:

from pyspark.mllib.recommendation import als pyspark import sparkcontext import time  sc = sparkcontext()  # example data r = [(1, 1, 1.0),     (1, 2, 2.0),     (2, 1, 2.0)] ratings = sc.parallelize(r) model = als.trainimplicit(ratings, 1, seed=10)  # call model , time = time.time() t in range(10):     model.predict(2, 2)  elapsed = (time.time() - now)*1000/(t+1)  print "average time model call: {:.2f}ms".format(elapsed)  model.save(sc, 'my_spark_model') 

output: average time model call: 71.18ms

if run following, predictions take more time:

from pyspark.mllib.recommendation import matrixfactorizationmodel pyspark import sparkcontext import time  sc = sparkcontext()  model_path = "my_spark_model" model = matrixfactorizationmodel.load(sc, model_path)  # call model , time = time.time() t in range(10):     model.predict(2, 2)  elapsed = (time.time() - now)*1000/(t+1)  print "average time loaded model call: {:.2f}ms".format(elapsed) 

the output: average time loaded model call: 180.34ms

for big models, i'm seeing prediction times on 10 seconds single call after loading saved model.

in short: no, doesn't seem cache whole model, since it's not rdd.


yu can try use cache(), cannot cache model itself, because not rdd, try this:

model.productfeatures().cache() model.userfeatures().cache() 

it recommended unpersist() them after don't need it, if handle big data, since want protect job out-of-memory errors.

of course, use persist() instead of cache(); might want read: what difference between cache , persist?


remember spark transformations lazily, when load model nothing happens. needs action trigger actual work (i.e. when use model spark attempt load it, causing experience latency, versus on having in memory).

also note that: cache() lazy, can use rdd.count() explicitly load memory.


experiments' output:

average time model call: 1518.83ms average time loaded model call: 2352.70ms average time loaded model call suggestions: 8886.61ms 

by way, after loading of model, should receive kind of warnings:

16/08/24 00:14:05 warn matrixfactorizationmodel: user factor not have partitioner. prediction on individual records slow. 16/08/24 00:14:05 warn matrixfactorizationmodel: user factor not cached. prediction slow. 

but if count trick? won't gain @ all, in fact slower:

... model.productfeatures().cache() model.productfeatures().count() model.userfeatures().cache() model.userfeatures().count() ... 

output:

average time loaded model call: 13571.14ms 

without cache(), keeping count()s, got:

average time loaded model call: 9312.01ms 

important note: timing performed in real-world cluster, nodes given important jobs, toy example may have been preempted during experiments. moreover, communications cost may dominate.

so, if conduct experiments myself too.


in conclusion, there doesn't seem other mechanism available spark on caching model, other that.


Comments

Popular posts from this blog

mysql - Dreamhost PyCharm Django Python 3 Launching a Site -

java - Sending SMS with SMSLib and Web Services -

python 3.5 - Pyqtgraph string in x tick -