Horae: A Graph Stream Summarization Structure for Efficient Temporal Range Query

Horae is a graph stream summarization structure for efficient temporal range queries. Horae can deal with temporal queries with arbitrary and elastic range while guaranteeing one-sided and controllable errors. More to the point, Horae provides a worst query time of O(log L), where L is the length of query range. Hoare leverages multi-layer storage and Binary Range Decomposition (BRD) algorithm to decompose the temporal range query to logarithmic time interval queries and executes these queries in corresponding layers.

Introduction

The emerging graph stream represents an evolving graph formed as a timing sequence of elements (updated edges) through a continuous stream. Each element in a graph stream is formally denoted as (s_i, d_i,w_i, t_i) (i ≥ 0), meaning the directed edge of a graph G = (V, E), i.e., s_i → d_i (s_i ∈ V, d_i ∈ V, s_i → d_i ∈ E), is produced at time t_i with a weight value w_i. An edge can appear multiple times at different time instants with different weights. Such a general data form is widely used in big data applications, such as user behavior analysis in social networks, close contact tracking in epidemic prevention, and vehicle surveillance in smart cities.

Real-world big data applications can create tremendously large-scale graph stream data. The enormous data scale makes the management of graph streams extremely challenging, especially in the aspects of (1) storing the continuously produced and large-scale datasets, and (2) supporting queries relevant to both graph topology and temporal information. To address these issues, recent research has mainly focused on graph stream summarization techniques which aim at achieving practicable storage and supporting various queries relevant to graph topology at the cost of slight accuracy sacrifice.

However, existing summarization structures are unable to store the temporal information in a graph stream and thus fail to support temporal queries. In this work, we propose Horae, a novel graph stream summarization structure to efficiently support temporal range queries. By exploring a time prefix embedded multi-layer summarization structure, Horae can effectively handle a temporal range query of an arbitrary range length L with a worst query processing time of O(log L). The basic idea of Horae's time prefix embedded multi-layer summarization structure is as follows.

An arbitrary temporal range of length L can be decomposed to at most 2log L sub-ranges, where all the time points in each sub-range have the same binary code prefix. For example, [t₈, t₁₃] = [t₈, t₁₁] + [t₁₂, t₁₃], where all the time points between t₈ (i.e., 1000) and t₁₁ (i.e., 1011) have the same common prefix '10', while all the time points between t₁₂ (i.e., 1100) and t₁₃ (i.e., 1101) have the same prefix '110'. Here, we define the prefix size as the number of binary digits in the common prefix (e.g., the prefix size of '10' is two while that of '110' is three).

A Horae structure contains a number of l = ⌈ log₂(t_u + 1) ⌉ + 1 layers, where t_u is the current time point of a graph stream. To cope with the infinity in the time dimension, the number of layers in Horae dynamically increases as t_u grows. Horae arranges the layers according to different prefix sizes. Each layer leverages a matrix to store the complete graph stream data aggregated by the sub-ranges with the same prefix size. Consider the example with t_u = t₇, Horae has four layers. The first layer contains eight sub-ranges {[t₀] ([0000]), [t₁] ([0001]), ..., [t₇] ([0111])} with prefix size of four; the second layer contains four sub-ranges {[t₀, t₁] ([0000, 0001]), [t₂, t₃] ([0010, 0011]), ..., [t₆, t₇] ([0110, 0111])} with prefix size of three, and so on. Formally, the p^th layer aggregates the graph stream data by the sub-ranges {[{q ⋅ 2^{p - 1}}, {(q + 1) ⋅ 2^{p - 1}} - 1] (q ≥ 0)} with prefix size l - p + 1. The sub-ranges of each layer have the same prefix size, i.e., the same range length. In a nutshell, each layer of the structure represents a summarization of a graph stream with carefully selected granularity (corresponds to a prefix size). During the construction of each layer, Horae combines each edge with the time prefix of the corresponding size for inserting the edge information to the corresponding matrix. Similarly, Horae combines an edge/node and the sub-range prefix to perform a sub-range query.

To efficiently evaluate temporal range queries on top of the Horae structure, we further design a novel Binary Range Decomposition (BRD) algorithm. The BRD algorithm decomposes a temporal range query with an arbitrary length L into at most O(log L) sub-range queries against different layers of the structure. Therefore, Horae reduces the query processing time to a logarithmic scale. Experimental results show that Horae reduces the latency of temporal range queries by two to three orders of magnitude compared to existing designs.

Horae Structure

The Horae structure contains l = ⌈ log₂(T_u + 1) ⌉ + 1 layers. It starts with a single layer initially and creates one new layer whenever the current time slice of the graph stream increases to a larger power of two (T_u = 2ⁱ, i ≥ 0). Horae arranges the layers based on different prefix sizes. Each layer aggregates the complete graph stream data by a corresponding prefix size. In the snapshot, the 1^st layer contains four sub-ranges {[T₀] (000), [T₁] (001), ..., [T₃] (011)}, all with the prefix size of three; the 2^nd layer contains two sub-ranges {[T₀, T₁] (000, 001), [T₂, T₃] (010, 011)}, both with the prefix size of two; the 3^rd layer contains one sub-range {[T₀, T₃] (000, 011)} with the prefix size of one. Formally, the p^th layer aggregates the graph stream data by the sub-ranges {[0, 2^{p - 1} - 1], [2^{p - 1}, 2 ⋅ 2^{p - 1} - 1], ...} with the same prefix size of l - p + 1. All the sub-ranges of the p^th layer have the same range length 2^{p - 1}. We define the same range length of each sub-range in the p^th layer as the granularity of the p^th layer. In a word, the p^th layer of Horae represents a graph stream summarization of granularity 2^{p - 1}.

Binary Range Decomposition

We design the BRD algorithm to quickly decompose an arbitrary time range query to multiple window queries of different layers. For any time range of length L, we can find an m which satisfies 2^m ≤ L < 2^{m + 1} (i.e., m = ⌊ log₂ L ⌋). The granularity of the (m + 1)^th layer is 2^m. BRD is based on the observation: if the time range aligns with one side of a window in the (m + 1)^th layer, decomposing the range equals decomposing its length, i.e., converting L to binary form.

For convenience, we call a sub-range of each layer a window. We use the notation W_p^q to denote the window with prefix q in the p^th layer, and more generally W_p^q = [q ⋅ 2^{p - 1}, (q + 1) ⋅ 2^{p - 1} - 1]. For example, we have W₂² = [T₄, T₅]. For a window [T_i, T_j] of length N, [T_i, T_j] = W_(log₂N+1)^⌊T_i/N⌋ = W_(log₂N+1)^⌊T_j/N⌋. The windows of the p^th and (p + 1)^th layers have the following relationship: the two adjacent windows can be merged into a larger window of the upper layer. Formally, W_p²ⁱ + W_p^{2i + 1} = W_{p + 1}ⁱ (i ≥ 0), where W_p²ⁱ and W_p^{2i + 1} are called sibling windows.

Given an arbitrary query range [T_b, T_e] with the length L = T_e - T_b + 1, we formalize the decomposition result as follows: DE([T_b, T_e]) = W_p₁^q₁ + W_p₂^q₂ + ... + W_{p_f}^q_f, where W_{p_i}^q_i and W_{p_j}^q_j cannot be sibling windows for any two i, j ∈ [1, f] and i ≠ j. Here, we identify an important property of the window: for a time range [T_i, T_j] of length N (N = 2^k (k ≥ 0)), it is a window if and only if (T_j + 1) % N = 0.

For an arbitrary time range, it may align perfectly with a window, align with one side of a window, or not align with any window in the (m + 1)^th layer. Accordingly, we summarize the following three cases and apply different strategies. The figure above shows examples of the three cases (4 ≤ L < 8 and m = 2).

Optimization

To reduce the memory cost, we propose the compacted Horae. In Horae, the matrix of each layer stores complete graph stream data and has the same size. In this way, Horae can achieve a logarithmic scale query processing time. To further improve the space efficiency, the compacted Horae selects partial data to store. The above figure shows the scheme. The compacted Horae still has l layers. The first layer stores all the window data as before, and the other layers only maintain the window with odd prefixes, i.e., each layer except the first layer only stores nearly half of the graph stream data. We set the size of M_p (p > 1) to half of that of M₁, which reduces the memory cost by half. For convenience, we use Horae-cpt to represent the compacted Horae.

Insert of Horae-cpt performs as follows. For each element e_i, if it belongs to the window with an even prefix in the p^th (p > 1) layer, Horae-cpt does not perform the insert operation in this layer. Otherwise, it inserts the element into the corresponding layer like Horae. Horae-cpt performs the extend operation only when the first element of T_i (i = 2^l) appears. Moreover, Horae-cpt does not perform the data copy operation because the data of the upper layer is not complete.

Horae-cpt also decomposes the time range into multiple windows of different layers and then executes the window queries in the corresponding layers. The decomposition process consists of two rounds. Given a time range, we first decompose it into multiple windows by performing BRD once. In the second round, we decompose the windows (length > 1) with even prefixes. Suppose [T_s, T_e] (T_e > T_s) is a window with an even prefix. We divide it into [T_s] and [T_{s + 1}, T_e] and then perform BRD on [T_{s + 1}, T_e]. Note that the result windows of DE([T_{s + 1}, T_e]) are all prefixed with odd numbers. The above figure shows an example in detail. In the first round, DE([T₀, T₆]) = [T₀, T₃] + [T₄, T₅]+[T₆]. In the second round, [T₀, T₃] is decomposed into [T₀], [T₁], and [T₂, T₃] while [T₄, T₅] is decomposed into [T₄] and [T₅]. All the windows after the first round decomposition may need binary range decomposition again. Hence, Horae-cpt can achieve O(log₂L) query latency in the worst case. Horae-cpt achieves a good trade-off between memory cost and query performance.

Evaluation Result

Here, we show some evaluation results on the dataset caida, this dataset contains partial anonymized passive traffic traces from CAIDA's equinox-Chicago monitor in 2015. A node represents an IP address and an edge denotes a communication. The weight of each edge represents the size of the transmitted data. The collected eleven minutes trace contains 2,121,486 nodes, 403,436,907 edges. We set gl to 50 ms and obtain 13,200 time slices.

Temporal edge weight queries. Figure (1) shows the average query time of temporal edge weight queries for the dataset caida. Horae and Horae-cpt both greatly reduce the average latency of GSS and TCM by over two orders of magnitude.

Temporal edge existence queries. We examine the non-existence queries where the edges do not appear within the corresponding time ranges. Figure (2) shows the FPR of the non-existence queries for the dataset caida. The FPR increases significantly for GSS and TCM as L increases. In contrast, it maintains stable in Horae.

Temporal node queries. We use the temporal node-out (resp. node-in) query to represent the temporal source (resp. destination) node query. Figure (3) plots the average time of temporal node-in and node-out queries with different lengths for the dataset caida. Horae reduces the latency of TCM by three to four orders of magnitude when the length is considerable while it reduces that of GSS more. The latency of Horae-cpt is slightly larger than that of Horae. Figure (4) illustrates the ARE of the temporal node queries with different lengths. Horae reduces the ARE of GSS by almost two orders of magnitude for both the temporal node-out queries and the temporal node-in queries. The ARE of temporal node query of TCM is high because it lacks a collision avoidance mechanism. The ARE of node query in Horae-cpt is close to that in Horae.

Memory cost. In the experiment, we configure Horae, GSS, and TCM with the same amount of memory. The result in Figure (5) shows that Horae-cpt greatly reduces the memory cost of Horae by 50%.

Insert throughput. Figure (6) shows the insert throughput of GSS, TCM, and Horae for the four datasets. Horae-para and Horae-seq denote the parallel insert operation and the sequential insert operation, respectively. The parallel insert operation greatly improves the throughput compared to the sequential one. The throughput of Horae-cpt-para is close to those of GSS and TCM.

Source Code

The source code of our design Horae is in the folder "Horae", the "readme.md" file in that folder shows how to build and execute the program horae and horae-compacted. Besides, we also provide the baseline codes, tcm+timeslice (in folder TCM+Timeslice), gss+timeslice (in folder GSS+Timeslice) and segment tree version (in folder DynamicSegmentTree), and we provide "readme.md" files to each folder to illustrate how to build them and run these codes.

Publications

If you want to know more detailed information, please refer to this paper:

Ming Chen, Renxiang Zhou, Hanhua Chen, Jiang Xiao, Hai Jin, Bo Li. Horae: A Graph Stream Summarization Structure for Efficient Temporal Range Query. in Proceedings of the 38th IEEE International Conference on Data Engineering (ICDE 2022), Virtual Event, Kuala Lumpur, Malaysia, May 9-12, 2022.

Authors and Copyright

Horae is developed in National Engineering Research Center for Big Data Technology and System, Cluster and Grid Computing Lab, Services Computing Technology and System Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China by Ming Chen ([email protected]), Renxiang Zhou ([email protected]), Hanhua Chen ([email protected]), Jiang Xiao ([email protected]), Hai Jin ([email protected]).

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

  http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
Dataset		Dataset
DynamicSegmentTree		DynamicSegmentTree
GSS+Timeslice		GSS+Timeslice
Horae		Horae
TCM+Timeslice		TCM+Timeslice
TestFiles		TestFiles
images		images
.gitignore		.gitignore
Horae_ICDE.pdf		Horae_ICDE.pdf
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Horae: A Graph Stream Summarization Structure for Efficient Temporal Range Query

Introduction

Horae Structure

Binary Range Decomposition

Optimization

Evaluation Result

Source Code

Publications

Authors and Copyright

About

Releases

Packages

Contributors 3

Languages

License

CGCL-codes/Horae

Folders and files

Latest commit

History

Repository files navigation

Horae: A Graph Stream Summarization Structure for Efficient Temporal Range Query

Introduction

Horae Structure

Binary Range Decomposition

Optimization

Evaluation Result

Source Code

Publications

Authors and Copyright

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages