DICE: Fast and Approximate Analytics over Big Data

A Distributed System for Exploratory Analytics over Large-Scale Datasets

Interactive ad-hoc analytics over large datasets has become an increasingly popular use case. We detail the challenges encountered when building a distributed system that allows the interactive exploration of a data cube. We introduce DICE, a distributed system that uses a novel session-oriented model for data cube exploration, designed to provide the user with interactive sub-second latencies for specified accuracy levels. A novel framework is provided that combines three concepts: faceted exploration of data cubes, speculative execution of queries and query execution over subsets of data. We discuss design considerations, implementation details and optimizations of our system. Experiments demonstrate that DICE provides a subsecond interactive cube exploration experience at the billion-tuple scale that is at least 33% faster than current approaches.

Publications

  • Roee Ebenstein, Niranjan Kamat, Arnab Nandi: FluxQuery: An Execution Framework for Highly Interactive Query Workloads: SIGMOD 2016
  • Niranjan Kamat, Arnab Nandi: A Closer Look at Variance Implementations In Modern Database Systems: SIGMOD Record 2015 [arXiv preview]
  • Eugene Wu, Arnab Nandi: Towards Perception-aware Interactive Data Visualization Systems: Data Systems for Interactive Analysis (DSIA) Workshop 2015
  • Arnab Nandi, Ziqi Huang, Man Cao, Micha Elsner, Lilong Jiang, Srinivasan Parthasarathy, Ramiya Venkatachalam: Interactive Tweaking of Text Analytics Dashboards: DNIS 2015 / Springer LNCS [pdf]
  • Prasanth Jayachandran, Niranjan Kamat, Kathik Tunga, Arnab Nandi: Combining User Interaction, Speculative Query Execution and Sampling in the DICE System: VLDB 2014 (demo) [pdf] [video]
  • Niranjan Kamat, Prasanth Jayachandran, Kathik Tunga, Arnab Nandi: Distributed Interactive Cube Exploration: ICDE 2014 [pdf] [slides]

a distributed system with three key concepts

DICE is designed to provide sub-second query response times for large datasets over medium sized clusters. It does this by performing distributed aggregation and a combination of three key concepts: speculative execution, faceted exploration, and sampling.

speculative execution

Since queries are part of a session, DICE considers the time spent by the user perusing the results of the current query as an opportunity to speculatively execute and cache the most likely followup queries. The result for the followup user query can then be returned quickly by from the result cache.

faceted exploration

When speculating, a key challenge is the large number of possible followup queries. In order to bound the number of possible followup queries, faceted traversals restrict speculations and guide the user to the next query. With fewer options, we can better predict what the user is querying next. Facets are both intuitive and expressive -- they allow for fast exploration, but also allow the user to explore the entire data cube.

fast response times using sampling

For extremely large datasets, it is not possible to execute queries over the entire dataset. Hence, the queries need to be run on a sample of the data.