The technique described in the article, seems to use this key-value pair to store pointers to the additional metadata (in this case a distinct index) embedded in the file. Note that we can embed arbitrary binary data in the Parquet file between each data page. This is perfectly valid since all Parquet readers rely on the exact offsets to the data pages specified in the footer.
This means that DataFusion does not need to specify how the metadata is interpreted. It is already well specified as part of the Parquet file format itself. DataFusion is an independent project -- it is a query execution engine for OLAP / columnar data, which can take in SQL statements, build query plan, optimize them, and execute. It is an embeddable runtime with numerous ways to extend it by the host program. Parquet is a file format supported by DataFusion because it is one of the most popular ways of storing data in a columnar way in object storages like S3.
Note that the readers of Parquet need to be aware of any metadata to exploit it. But if not, nothing changes - as long as we're embedding only supplementary information like indices or bloom filters, a reader can still continue working with the columnar data in Parquet as it used to; it is just that it won't be able to take advantage of the additional metadata.
> Note that the readers of Parquet need to be aware of any metadata to exploit it. But if not, nothing changes
The one downside of this approach, which is likely obvious, but I haven't seen mentioned is that the resulting parquet files are larger than they would be otherwise, and the increased size only benefits engines that know how to interpret the new index
There is no spec. Personally I hope that the existing indexes (bloom filters, zone maps) get re-designed to fit into a paradigm where parquet itself has more first class support for multiple levels of indexes embedded in the file and conventions for how those common types. That is, start with Wild West and define specs as needed
> That is, start with Wild West and define specs as needed
Yes this is my personal hope as well -- if there are new index types that are widespread, they can be incorporated formally into the spec
However, changing the spec is a non trivial process and requires significant consensus and engineering
Thus the methods used in the blog can be used to use indexes prior to any spec change and potentially as a way to prototype / prove out new potential indexes
The technique described in the article, seems to use this key-value pair to store pointers to the additional metadata (in this case a distinct index) embedded in the file. Note that we can embed arbitrary binary data in the Parquet file between each data page. This is perfectly valid since all Parquet readers rely on the exact offsets to the data pages specified in the footer.
This means that DataFusion does not need to specify how the metadata is interpreted. It is already well specified as part of the Parquet file format itself. DataFusion is an independent project -- it is a query execution engine for OLAP / columnar data, which can take in SQL statements, build query plan, optimize them, and execute. It is an embeddable runtime with numerous ways to extend it by the host program. Parquet is a file format supported by DataFusion because it is one of the most popular ways of storing data in a columnar way in object storages like S3.
Note that the readers of Parquet need to be aware of any metadata to exploit it. But if not, nothing changes - as long as we're embedding only supplementary information like indices or bloom filters, a reader can still continue working with the columnar data in Parquet as it used to; it is just that it won't be able to take advantage of the additional metadata.