Workshop: Creating the SkyhookDM Ecosystem for the Computational I/O Stack


Date
Sep 29, 2023 1:00 PM — 2:20 PM
Location
University of California, Santa Cruz
1156 High St, Santa Cruz, CA 95064
This event is part of the 2023 UC Open Source Symposium, September 27-29, 2023 (this workshop’s link)

Hardware acceleration for computational I/O, that is the integration of specialized computational devices into the I/O path, is one of the most promising technologies to further improve performance and energy efficiency of analyzing high-volume and high-velocity datasets and streams. Despite the general availability of a number of devices such as Data Processing Units (DPUs, also known as SmartNICs) and Samsung’s SmartSSDs, the open source data science ecosystem lacks an open and shared computational I/O software stack ecosystem. This lack hampers composability and innovation, and increases design cost. To address this. the Center for Research in Open Source Software launched Skyhook Data Management to create open source blueprints for a computational I/O stack that can be adopted by industry. With seed funding from industry component makers, SkyhookDM had a promising start: a blueprint using the unmodified Ceph open source distributed storage system was contributed to Apache Arrow in 2022 and has been included in every release since v7.0.0. It serves as a use case for SNIA Computational Storage TWG, and has attracted world-leading experts from industry and national labs.

This workshop invites participants to help put together a roadmap for an open and shared computational I/O software stack ecosystem at UC Santa Cruz following best practices in open source software techniques, strategies, and governance. We will discuss technical and organizational opportunities, leveraging readily available technologies and institutions.

Carlos Maltzahn
Carlos Maltzahn
Adjunct Professor, Sage Weil Presidential Chair for Open Source Software, Founder & Director of CROSS, OSPO

My research interests include programmable storage systems, big data storage & processing, scalable data management, distributed systems performance management, and practical reproducible research.