NG4S: A Next-generation Geo-distributed Scalable Stateful Stream Processing System
Project Summary
Our society increasingly relies on applications that process streaming data across geo-distributed sites, such as making business decisions from marketing data, identifying spam campaigns in social network streams, and analyzing genome datasets in different labs and countries to track the sources of potential epidemics. Existing systems are mainly designed for stateless stream processing in intra-datacenter settings and do not scale well for running stream applications that contain large distributed states. This project breaks the traditional abstractions of a centralized architecture and hashtable-based stateless operators, redefining them with a new decentralized architecture and new memory-efficient stateful operators, which enables novel approaches to improve overall system performance and scalability.
This project includes three primary research directions. (1) At the architecture layer, a new decentralized ‘many masters/many workers’ architecture is proposed to provide each master with maximum independence. (2) At the operator layer, a new in-memory data structure is proposed to store the application state that minimizes the memory overhead. (3) At the mechanism layer, a new shard-based parallel recovery mechanism is proposed to handle failures and stragglers. All three parts of the project will be prototyped and implemented on real-world stream processing systems.
Participants
Principal Investigators
Dr. Liting Hu, Assistant Professor, UC Santa Cruz and Virginia Tech