One of the major features included in Pachyderm v1.8 (and being backported to 1.7.11) is improved support for large files of structured data. Specifically, users who want to use Pachyderm as their versioned data lake and dump large swaths of CSV and SQL data into Pachyderm repos to track how those files change over time.
Pachyderm v1.8 now has the ability to ingest structured data as a single file and automatically chunk it up to be run as a distributed workload across the cluster. This was one of the biggest requests from our community members trying to do more ETL and aggregation workloads in Pachyderm.
To ingest SQL data (into the
data repo on the
master branch) and have Pachyderm take care of all the splitting you just need to run:
When you use
pachctl put-file --split sql ... your pg dump file is split into three parts: the header, rows, and the footer. The header contains all the SQL statements in the pg dump that setup the schema and tables. The rows are split into individual files (or if you specify the –target-file-datums or –target-file-bytes multiple rows per file). The footer contains the remaining SQL statements for setting up the tables.
The header and footer are stored on the directory containing the rows. This way, if you request a get-file on the directory, you’ll get just the header and footer. If you request an individual file, you’ll see the header plus the row(s) plus the footer. If you request all the files with a glob pattern, e.g.
/directoryname/*, you’ll receive the header plus all the rows plus the footer, recreating the full pg dump. In this way, you can construct full or partial pg dump files so that can be processed independently.
Of course SQL data is just one example. For CSV data, the behavior is the same, but the steps are slightly different as you need to define the header manually. We’ll be making this smarter in a future release, but now you can ingest a CSV file in two steps.
First, add the data. In this case we’re creating one file for line of our CSV. Just as with SQL, you can easily change that to chunks of rows using
Now we’ll add the header itself:
If you want to learn more details about working with structured data and headers/footers, check out our documentation.