NG4S: A Next-generation Geo-distributed Scalable Stateful Stream Processing System

Project Summary

Our society increasingly relies on applications that process streaming data across geo-distributed sites, such as making business decisions from marketing data, identifying spam campaigns in social network streams, and analyzing genome datasets in different labs and countries to track the sources of potential epidemics. Existing systems are mainly designed for stateless stream processing in intra-datacenter settings and do not scale well for running stream applications that contain large distributed states. This project breaks the traditional abstractions of a centralized architecture and hashtable-based stateless operators, redefining them with a new decentralized architecture and new memory-efficient stateful operators, which enables novel approaches to improve overall system performance and scalability.

This project includes three primary research directions. (1) At the architecture layer, a new decentralized ‘many masters/many workers’ architecture is proposed to provide each master with maximum independence. (2) At the operator layer, a new in-memory data structure is proposed to store the application state that minimizes the memory overhead. (3) At the mechanism layer, a new shard-based parallel recovery mechanism is proposed to handle failures and stragglers. All three parts of the project will be prototyped and implemented on real-world stream processing systems.

Participants

Principal Investigators

Members

  • Hailu Xu, Ph.D. student-graduated, now tenure-track Assistant Professor of Computer Science at California State University, Long Beach
  • Pinchao Liu, Ph.D. student-graduated, now Research Scientist at Facebook
  • Boyuan Guan, Ph.D. student, Florida International University
  • Susana Cruz-Diaz, B.S. student-graduated, now Software Engineer Associate at Lockheed Martin
  • Ulises Fernandez, B.S. student, supported by NSF REU program

Publications

Presentations

Pinchao Liu, SR3: Customizable Recovery for Stateful Stream Processing Systems
Middleware’20
[Slides]

Outreach

Curriculum Development Activities

  • Course 6913: Advanced Topics in Information Processing [Video Playlist]

Stream Processing Tutorials

  • Apache Storm Tutorial [Video]
  • Apache Flink Tutorial [Video]
  • Encryption of Datasets in Storm [Video]

Acknowledgement

This material is based upon work supported by the National Science Foundation under Grant NSF-SPX-2202859.