D4N - S3 Select Testing
April '21
D4N is a multi-layer cooperative caching solution which aims to improve performance in distributed systems by implementing a smart caching algorithm, which caches data on the access side of each layer hierarchical network topology, adaptively adjusting cache sizes of each layer based on observed workload patterns and network congestion. The goal of this project is to enhance D4N to directly support S3 Select, a new S3 feature that allows applications to select, transform, and summarize data within S3 objects using SQL query commands. This will allow the clients to read and cache only a subset of object, stored in the Ceph cluster, rather than retrieving the entire object over the network, eventually reducing the traffic of data over the network.
Here's a list of interesting problems I've worked on:
  • Design and implement a prototype S3 select cache strategy or strategies within D4N; S3 Select to read subset of object from Ceph.
  • Cache data and update the global directory, return formatted response in Arrow. Hence, evaluating the result of the S3 Select cache
  • Add support for converting the S3-select query results to Apache Arrow format.
  • Update the Spark jobs to read the response in arrow format.
Github Link