Multiset Bloom Filter: Hadoop and Spark implementations
In this project, we proposed a solution for the multiple-set matching problem, defining a multi-set Bloom Filter. A multi-set Bloom Filter is a space-efficient probabilistic data structure that allows to check the membership of an element in multiple sets. The new data structure presented in this project has associated multiple sets of data and supports the construction operation of all the sets through an efficient construction operation. The IMDb dataset containing ratings of movies has been used as a reference for building and assessing the performances of the Bloom Filter. An implementation based on the MapReduce paradigm is presented, specifically employing the Hadoop and Spark frameworks.