One of the most remarkable things about Pachyderm v1.8 is the platform's ability to reach a new level of scale and speed. We’ve been hard at work to ensure that Pachyderm is the best data science platform out there and is capable of handling even the most demanding of enterprise workloads. Needless to say, we’ve exceeded even our own expectations with more than a 1000x speed improvement for certain workloads. Bryce McAnally, one of Pachyderm’s lead engineers, spearheaded this effort and provides a deep-dive on how we were able to achieve these massive gains.
Pachyderm has several major customers processing massive workloads and, during our collaboration with them, we identified several scaling bottlenecks in v1.7 that required a few significant changes to the core design of Pachyderm’s storage layer (PFS). These changes were:
Smarter batching of writes to object storage to reduce latency and the performance variance that comes with frequent and concurrent requests to object storage.
Redesigning the format of our metadata to support external sorting of hashtrees (our snapshot of the filesystem) that would scale to billions of files.
Multiple smaller scalability fixes that were revealed during large-scale testing.
For Map/Reduce style workloads in particular, where many large files (1M+ records each) are being processed and grouped by unique recordID, Pachyderm v1.8 has stellar performance gains compared to v1.7. In the following graph, we can see that v1.7 starts to struggle with workloads of more than 25k input files, whereas v1.8’s performance is nearly an order of magnitude better throughout and continues to scale seamlessly at a near-linear rate as the number of input files increases.
Pachyderm v1.8 also scales as the number of input records being processed and the number of output files increases. Here’s a chart showing multiple workloads spanning two orders of magnitude of records, total data size, and number of output files. The times shown include both ingesting the input files as well as processing the data and uploading the final results. Pachyderm v1.8 scales nearly linearly all the way up to hundreds of millions of files, billions of records, and many terabytes of data.
Frequent requests to object storage has both significant performance and cost implications as cloud object storage providers charge per request. By batching the data upload step of each job, Pachyderm is able to reduce the upload time and object storage request costs by nearly two orders of magnitude for large workloads.
As with most engineering decisions, it’s often a matter of tradeoffs. The previous implementation of the upload step would upload each file output per datum independently. While this implementation had the benefit of a much simpler notion of global data deduplication (reduced storage costs), object storage request costs would skyrocket for massive workloads (100M+ files) and would very quickly become the dominating cost factor.
By analyzing the workloads of our users, we realized that the vast majority (95+%) of potential data duplication happened at the local level (subsequent runs of a pipeline outputting the same files). Therefore, we designed our new model to upload all file content for a datum as a single object and lexicographically sort it by file names. In this model, Pachyderm still drastically reduces storage bloat by doing local deduplication – we only check against previous outputs for a datum rather than a global check across all outputs – while also providing incredible performance improvements and compute savings that far outweigh the minor storage tradeoff.
The top contributing factor to the performance improvements in Pachyderm v1.8 is the changes to the formatting and merging of our output hashtrees. Pachyderm uses hashtrees to represent a snapshot of the file system, both for individual datums and an entire job (which may consist of many datums). These hashtrees let us efficiently track and store metadata for all the versioned files in our system.
In previous versions of Pachyderm, hashtrees were backed by a map stored in a protocol buffer. Each datum would output one of these hashtrees that represented the state of the filesystem output by that datum. At the end of a job, all of the datum hashtrees would be merged into a final hashtree for the entire commit.
The merge of these trees would happen on the “master worker” that would then upload the final output hashtree. This hashtree model was not scalable because operations such as merging required the full hashtree to be kept in memory on a single worker. For sufficiently large hashtrees, that could easily exhaust the memory resources of the node.
In Pachyderm 1.8, we use a distributed hashtree model based on the hash of the file paths. We then store filesystem metadata sorted in our own serialization format (a simple byte length encoding) with indexes for quick accesses to subtrees (files under a directory, files that match a glob pattern, etc).
In order to have our metadata in this format at the end of a job, we needed to do an external sort on the datum hashtrees as well as merge paths that showed up in multiple datum hashtrees to create the final job hashtree. This sorting involves a few steps:
When uploading the output of a datum, we walk the local filesystem, which sorts our output datum hashtrees into lexicographical order.
Each worker claims responsibility for one of the hashtree shards.
Each worker that got a shard will check if a previous job output includes any subtrees of the output tree for this job (very useful for workloads that only add datums, which are extremely common).
Finally, merge all of the files belonging to that hastree shard in sorted order to create the final hashtree(s) for the job.
These changes allow Pachyderm's distributed computation and versioning model to scale to 100M+ files, billions of records, and crunch through them at massive speeds.
As a final note, there were several ancillary changes to our approach to improved scalability as well as a few bug fixes. One noteworthy change is that
put-file --split (uploading a large file that you want to split into many tiny files) now makes reasonable use of memory rather than increasing proportionally to the number of splits. Lastly, several sources of OOM kills have been addressed, particularly those where memory used was proportional to metadata size, and superfluous object storage read/writes were removed where possible.