Output sorted text file from Google Cloud Dataflow -
i have pcollection<string>
in google cloud dataflow , i'm outputting text files via textio.write.to
:
pcollection<string> lines = ...; lines.apply(textio.write.to("gs://bucket/output.txt"));
currently lines of each shard of output in random order.
is possible dataflow output lines in sorted order?
this not directly supported dataflow.
for bounded pcollection
, if shard input finely enough, can write sorted files sink
implementation sorts each shard. may want refer textsink
implementation basic outline.
Comments
Post a Comment