S3 Select to Apache Arrow Format
Nov '15
The D4N Caching architecture is a caching middleware between the Clients and Ceph storage. The Apache Arrow integration was aimed at making the S3-Select query parsing more efficient at the client side, as Arrow's standardized data storage format provides significant performace advantage while transferring large amounts of data between systems. A query result is converted to Arrow format and written to a Arrow file as follows - The problems we solved:
  • Create readers to visually analyse arrow and parquet files
  • Parse S3 Select projections based on data types
  • Parse S3 Select query results to Arrow by integrating Arrow format in S3 Select Library
Github Repo