The D4N Caching architecture is a caching middleware between the Clients and Ceph storage. The Apache Arrow integration was aimed at making the S3-Select query parsing more efficient at the client side, as Arrow's standardized data storage format provides significant performace advantage while transferring large amounts of data between systems.
A query result is converted to Arrow format and written to a Arrow file as follows -
- Using the Arrow Arraybuilder API to arrange the query results to columnar format.
- Use Arrow Table API to write it to to a Arrow table.
- Write the Arrow table to a Arrow file using the Arrow IO API.
- Create readers to visually analyse arrow and parquet files
- Parse S3 Select projections based on data types
- Parse S3 Select query results to Arrow by integrating Arrow format in S3 Select Library