Graph Compression: — Part A: Introduction

Martin Wafula
6 min readSep 11, 2023

--

Over the last two decades, there has been a lot of interest in analysing large graphs that capture relationships and interactions between different data entities [1, 2]. These graphs have been used to model problems in machine learning, biological networks, social and information networks [3], transport networks, etc.

Compression of large graphs has caught the attention of both academia and industry to reduce the number of input-output operations, thus offering increased speeds of graph analysis and reducing the amount of data communicated over the network [4]. Besta and Hoefler, in their recent survey on lossless compression of graphs, note that this has become the most timely problem facing the big data research community [2, 5].

In communication networks, the growth of inter-connected data is expected to have an exponential increase with the data from the distributed wireless sensor networks (WSNs) and ad hoc wireless networks that form the Internet of Things (IoT) [6]. The large size of graphs, the dynamic nature of the graph and the properties associated with vertices and edges have made it more challenging to design graph algorithms. This challenge is further increased for graph databases since there is a massive demand for low latency and high throughput of graph queries [7]. These enormous sizes of graphs have brought the need for many hardware resources to store them, and running algorithms on these data is very slow.

Whereas there exists an enormous amount of literature on graph compression for different graphs, the literature on the information-theoretic framework for compression of graphs is far less developed. Most of the work available in graph compression is majorly heuristics. Besta and Hoefler have summarised some techniques in the recent survey [5].

A note on Graph Compression

A simple graph model G = (V, E) consists of a set of V vertices/ nodes
and a set of E ⊆ V × V edges. In this report, we shall let |V | = n be the number of nodes and |E| = m be the number of edges.

A few examples:

Information theory has been used in different areas of graph theory. In compression, entropy is an important measure. It is related to the lower bound of the expected description length in lossless compression.

In unlabelled graph compression, the graph is reconstructed without focusing on the vertex labels. Choi and Szpankowski [8], while looking at the information conveyed by the graph structure, introduced the term structural entropy. The authors study the structures generated by the Erdos-Renyi graph and prove that the structural entropy is given by:

where n is the number of vertices and h(p) = −p log p − (1 − p) log(1 − p) is the binary entropy for a given distribution p and n log n is a reduction of entropy since is we do not need to differentiate isomorphic graphs [9].

The concept of graph entropy, which for instance, has been utilised in the study of topological structure, connectivity and complexity of networks, allows quantitative characterisation of the information content of systems modelled by the graphs [10]. Coon and Badiu [11], show that the graph entropy of a random geometric graph (RGG) is related to the entropy of graphs with fewer nodes. It is normalised by the number of potential edges and shown to decrease with the number of nodes, i.e.,

where n is the number of nodes of the graph. This result was used to give a series of upper bounds on the entropy of RGG, that is,

In [12], Lynn and Basset answer two questions, i.e., how compressible is a network and what features make a network more compressible than the other? Compressibility increases with two common properties: transitivity and degree heterogeneity, and hierarchical structure facilitates compression in complex networks. The network is modelled as an information source (i.e., random walk) and then compressed using rate-
distortion theory with the distortion on the scale, thus lossy compression.

Another example is the lossless compression of the stochastic block model (SBM) recently introduced by E. Abbe, who computed the entropy of the SBM [9]. Abbe presents fundamental limits for lossless compression for different regimes, for example, the entropy of the SBM for the regime where

was presented as:

where n is the number of nodes, p is the prior on the k communities, W is a k × k matrix with the entries in [0, 1] for the connectivity probabilities and H(W ) is a k × k matrix with (H(W ))_{ij} = H(W_{ij} ). The recent design of a universal compressor that achieves the asymptotically optimal compression rate by Bhatt et al. further contributes to the work on lossless compression of the SBM [13]. Universal graph compression does not depend on the underlying edge distribution, community labels or the number of communities of the SBM.

[To be continued …]

References

[1] A. Eisenman, L. Cherkasova, G. Magalhaes, Q. Cai, and S. Katti, “Parallel graph processing on modern multi-core servers: New findings and remaining challenges,” in 2016 IEEE 24th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), pp. 49–58, 2016.

[2] O. Batarfi, R. E. Shawi, A. G. Fayoumi, R. Nouri, S.-M.-R. Beheshti, A. Barnawi, and S. Sakr, “Large scale graph processing systems: Survey and an experimental evaluation,” Cluster Computing, vol. 18, p. 1189–1213, Sept. 2015.

[3] L. V. S. Lakshmanan, “Social network analytics: Beyond the obvious,” in Biomedical Data Management and Graph Online Querying (F. Wang, G. Luo, C. Weng, A. Khan, P. Mitra, and C. Yu, eds.), (Cham), pp. 149–154, Springer International Publishing, 2016.

[4] M. Besta, S. Weber, L. Gianinazzi, R. Gerstenberger, A. Ivanov, Y. Oltchik, and T. Hoefler, “Slim graph: practical lossy graph compression for approximate graph processing, storage, and analytics,” in Proceedings of the International Conference for High-Performance Computing, Networking, Storage and Analysis, SC ’19, pp. 1–
25, ACM, 2019.

[5] M. Besta and T. Hoefler, “Survey and taxonomy of lossless graph compression and space-efficient graph representations,” 2018.

[6] S. Heidari, Y. Simmhan, R. N. Calheiros, and R. Buyya, “Scalable graph processing frameworks: A taxonomy and open challenges,” ACM Comput. Surv., vol. 51, June 2018.

[7] M. Besta, E. Peter, R. Gerstenberger, M. Fischer, M. Podstawski, C. Barthels, G. Alonso, and T. Hoefler, “Demystifying graph databases: Analysis and taxonomy of data organization, system designs, and graph queries,” 2019.

[8] Y. Choi and W. Szpankowski, “Compression of graphical structures: Fundamental limits, algorithms, and experiments,” IEEE Transactions on Information Theory, vol. 58, no. 2, pp. 620–638, 2012.

[9] E. Abbe, “Graph compression: The effect of clusters,” in 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 1–8, 2016.

[10] J. Coon, C. Dettmann, and O. Georgiou, “Entropy of spatial network ensembles,” Physical Review E, vol. 97, 2018.

[11] M.-A. Badiu and J. Coon, “On the distribution of random geometric graphs,” vol. 2018-, pp. 2137–2141, Institute of Electrical and Electronics Engineers Inc., 2018.

[12] C. W. Lynn and D. S. Bassett, “Compressibility of complex networks,” 2020.

[13] A. Bhatt, C. Wang, L. Wang, and Z. Wang, “Universal graph compression: Stochastic block models,” 2020.

--

--

Martin Wafula

DPhil candidate in Engineering Science at University of Oxford. My interests are in information theory, graph compression &network topology inference.