used in the economic-financial, political, social and even institutional
fields. Often the data is stored in huge distributed clouds and is
sometimes generated according to a continuous flow, so large as to make
complete storage unfeasible. In many cases the data pertains to entities
in close relationship with each other and gives rise to massive networks
of connections. Familiar examples for such networks are biological and
social networks, distribution networks, and the Web graph. Furthermore,
the fact that the data is stored in systems managed by third parties
poses integrity problems, which have not been considered in the
classical IT literature in terms of both their type and scale.
This scenario poses unprecedented algorithmic challenges, which are
being considered by a vast audience of researchers. In the last decade,
this effort has produced many innovations on both the methodological and
technological level. This course aims at transferring to the students
some of the most important methodological tools originated from the
research on Big Data algorithms. These methodological tools are
presented within challenging application contexts.
Curriculum
Programma
Data processing in streaming mode: approximate counting, sampling and reservoir sampling, bloom filters, frequent itemsets, number of distinct elementsMultidimensional queries: parallelization, spatial/non-spatial partitioning, orthogonal range searching/counting, closest pair, kD-trees, range trees, layered trees, fractional cascading.
Community detection in social graphs: (bi)connected components, maximal clique, k-cores, k-plexes; algorithms for their calculation in a distributed environment.
DB-Trees: efficient aggregate range queries on any DBMS
Scalability and Big Data: P2P systems, Distributed Hash Tables and Chord, chord-like design in NoSQL DBMSes. Big data integrity: threat model for the cloud, non-scalability of traditional authenticated data structures, scaling with a pipelining approach. Blockchain and Big Data: the scalability trilemma.
Testi Adottati
Slides plus:Mining of Massive Datasets
Jure Leskovec, Anand Rajaraman, Jeff Ullman
Cambridge University Press
http://www.mmds.org/
Bibliografia Di Riferimento
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Cambridge University Press http://www.mmds.org/Modalità Erogazione
Lectures in class.Modalità Frequenza
Lessons in classModalità Valutazione
Written test, project evaluation. The written test can be replaced by partial in-itinere tests.Programma
1) Algorithms for data streams- Approximate counting
- Majority problems
- Sampling and reservoir sampling
- Bloom filters
- Frequent itemsets
2) Sub-linear algorithms
- Diameter approximation
- Property testing
3) Clustering
4) Algorithms and data structures for quantitative features analysis
- 1d-,2d-,3d-range queries
- Skyline (pareto frontier), near-neighbor search, Voronoi diagrams
5) Dimensionality reduction
6) Algorithms for the decomposition of complex networks
- Decomposition into k-connected components
- Decomposition into k-cores, maximal cliques, maximal k-plexes
7) Distributed Hash Tables, Consistent Hashing
8) Integrity of large data sets, consistency in distributed systems, CAP/PACELC theorems and their impact on NoSQL databases
Testi Adottati
Mining of Massive DatasetsJure Leskovec, Anand Rajaraman, Jeff Ullman
Cambridge University Press
http://www.mmds.org/
Bibliografia Di Riferimento
The bibliography is commented in each lesson by the teachers.Modalità Erogazione
Traditional lectures and exercises.Modalità Frequenza
Class lectures are strongly interactive. All the teachers assist to all the lectures. Discussion with the students is the main target of the course.Modalità Valutazione
The exam consists of a project that will be assigned during the course and of an oral examination.Programma
1) Algorithms for data streams- Approximate counting
- Majority problems
- Sampling and reservoir sampling
- Bloom filters
- Frequent itemsets
- Number of distinct elements
2) Dimensionality reduction
- Johnson–Lindenstrauss lemma
- Embedding metric spaces with low distortion
3) Algorithms and data structures for quantitative features analysis
- orthogonal range searching (kd-trees and range trees)
- nearest neighbour search, k-nearest neighbour search
- fractional cascading and simplex range search
4) Algorithms for the decomposition of complex networks
- Decomposition into k-connected components
- Decomposition into k-cores, maximal cliques, maximal k-plexes
5) NoSQL internals: Distributed Hash Tables, chord, consistent hashing
6) Scalable security: integrity of big data sets in the cloud, consistency and scalability issues with authenticated data structures, pipelining, blockchain scalability trilemma.
Testi Adottati
Mining of Massive DatasetsJure Leskovec, Anand Rajaraman, Jeff Ullman
Cambridge University Press
http://www.mmds.org/
Bibliografia Di Riferimento
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Cambridge University Press http://www.mmds.org/Modalità Erogazione
TraditionalModalità Frequenza
Non-mandatoryModalità Valutazione
Written test, project evaluation. The written test can be replaced by partial in itinere tests.Programma
1) Algorithms for data streams- Approximate counting
- Majority problems
- Sampling and reservoir sampling
- Bloom filters
- Frequent itemsets
- Number of distinct elements
2) Dimensionality reduction
-Johnson–Lindenstrauss lemma
Embedding metric spaces with low distortion
3) Algorithms and data structures for quantitative features analysis
- orthogonal range searching (kd-trees and range trees)
- nearest neighbour search, k-nearest neighbour search
- fractional cascading and simplex range search
4) Algorithms for the decomposition of complex networks
- Decomposition into k-connected components
- Decomposition into k-cores, maximal cliques, maximal k-plexes
5) NoSQL internals: Distributed Hash Tables, chord, consistent hashing
6) Scalable security: integrity of big data sets in the cloud, consistency and scalability issues with authenticated data structures, pipelining, blockchain scalability trilemma.
Testi Adottati
Mining of Massive DatasetsJure Leskovec, Anand Rajaraman, Jeff Ullman
Cambridge University Press
http://www.mmds.org/
Bibliografia Di Riferimento
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Cambridge University Press http://www.mmds.org/Modalità Erogazione
TraditionalModalità Frequenza
Non-mandatoryModalità Valutazione
Written test, project evaluation. The written test can be replaced by partial in-itinere tests.Programma
Data processing in streaming mode: approximate counting, sampling and reservoir sampling, bloom filters, frequent itemsets, number of distinct elementsMultidimensional queries: parallelization, spatial/non-spatial partitioning, orthogonal range searching/counting, closest pair, kD-trees, range trees, layered trees, fractional cascading.
Community detection in social graphs: (bi)connected components, maximal clique, k-cores, k-plexes; algorithms for their calculation in a distributed environment.
DB-Trees: efficient aggregate range queries on any DBMS
Scalability and Big Data: P2P systems, Distributed Hash Tables and Chord, chord-like design in NoSQL DBMSes. Big data integrity: threat model for the cloud, non-scalability of traditional authenticated data structures, scaling with a pipelining approach. Blockchain and Big Data: the scalability trilemma.
Testi Adottati
Slides plus:Mining of Massive Datasets
Jure Leskovec, Anand Rajaraman, Jeff Ullman
Cambridge University Press
http://www.mmds.org/
Bibliografia Di Riferimento
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Cambridge University Press http://www.mmds.org/Modalità Erogazione
Lectures in class.Modalità Frequenza
Lessons in classModalità Valutazione
Written test, project evaluation. The written test can be replaced by partial in-itinere tests.Programma
1) Algorithms for data streams- Approximate counting
- Majority problems
- Sampling and reservoir sampling
- Bloom filters
- Frequent itemsets
2) Sub-linear algorithms
- Diameter approximation
- Property testing
3) Clustering
4) Algorithms and data structures for quantitative features analysis
- 1d-,2d-,3d-range queries
- Skyline (pareto frontier), near-neighbor search, Voronoi diagrams
5) Dimensionality reduction
6) Algorithms for the decomposition of complex networks
- Decomposition into k-connected components
- Decomposition into k-cores, maximal cliques, maximal k-plexes
7) Distributed Hash Tables, Consistent Hashing
8) Integrity of large data sets, consistency in distributed systems, CAP/PACELC theorems and their impact on NoSQL databases
Testi Adottati
Mining of Massive DatasetsJure Leskovec, Anand Rajaraman, Jeff Ullman
Cambridge University Press
http://www.mmds.org/
Bibliografia Di Riferimento
The bibliography is commented in each lesson by the teachers.Modalità Erogazione
Traditional lectures and exercises.Modalità Frequenza
Class lectures are strongly interactive. All the teachers assist to all the lectures. Discussion with the students is the main target of the course.Modalità Valutazione
The exam consists of a project that will be assigned during the course and of an oral examination.Programma
1) Algorithms for data streams- Approximate counting
- Majority problems
- Sampling and reservoir sampling
- Bloom filters
- Frequent itemsets
- Number of distinct elements
2) Dimensionality reduction
- Johnson–Lindenstrauss lemma
- Embedding metric spaces with low distortion
3) Algorithms and data structures for quantitative features analysis
- orthogonal range searching (kd-trees and range trees)
- nearest neighbour search, k-nearest neighbour search
- fractional cascading and simplex range search
4) Algorithms for the decomposition of complex networks
- Decomposition into k-connected components
- Decomposition into k-cores, maximal cliques, maximal k-plexes
5) NoSQL internals: Distributed Hash Tables, chord, consistent hashing
6) Scalable security: integrity of big data sets in the cloud, consistency and scalability issues with authenticated data structures, pipelining, blockchain scalability trilemma.
Testi Adottati
Mining of Massive DatasetsJure Leskovec, Anand Rajaraman, Jeff Ullman
Cambridge University Press
http://www.mmds.org/
Bibliografia Di Riferimento
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Cambridge University Press http://www.mmds.org/Modalità Erogazione
TraditionalModalità Frequenza
Non-mandatoryModalità Valutazione
Written test, project evaluation. The written test can be replaced by partial in itinere tests.Programma
1) Algorithms for data streams- Approximate counting
- Majority problems
- Sampling and reservoir sampling
- Bloom filters
- Frequent itemsets
- Number of distinct elements
2) Dimensionality reduction
-Johnson–Lindenstrauss lemma
Embedding metric spaces with low distortion
3) Algorithms and data structures for quantitative features analysis
- orthogonal range searching (kd-trees and range trees)
- nearest neighbour search, k-nearest neighbour search
- fractional cascading and simplex range search
4) Algorithms for the decomposition of complex networks
- Decomposition into k-connected components
- Decomposition into k-cores, maximal cliques, maximal k-plexes
5) NoSQL internals: Distributed Hash Tables, chord, consistent hashing
6) Scalable security: integrity of big data sets in the cloud, consistency and scalability issues with authenticated data structures, pipelining, blockchain scalability trilemma.
Testi Adottati
Mining of Massive DatasetsJure Leskovec, Anand Rajaraman, Jeff Ullman
Cambridge University Press
http://www.mmds.org/
Bibliografia Di Riferimento
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Cambridge University Press http://www.mmds.org/Modalità Erogazione
TraditionalModalità Frequenza
Non-mandatoryModalità Valutazione
Written test, project evaluation. The written test can be replaced by partial in-itinere tests.Programma
Data processing in streaming mode: approximate counting, sampling and reservoir sampling, bloom filters, frequent itemsets, number of distinct elementsMultidimensional queries: parallelization, spatial/non-spatial partitioning, orthogonal range searching/counting, closest pair, kD-trees, range trees, layered trees, fractional cascading.
Community detection in social graphs: (bi)connected components, maximal clique, k-cores, k-plexes; algorithms for their calculation in a distributed environment.
DB-Trees: efficient aggregate range queries on any DBMS
Scalability and Big Data: P2P systems, Distributed Hash Tables and Chord, chord-like design in NoSQL DBMSes. Big data integrity: threat model for the cloud, non-scalability of traditional authenticated data structures, scaling with a pipelining approach. Blockchain and Big Data: the scalability trilemma.
Testi Adottati
Slides plus:Mining of Massive Datasets
Jure Leskovec, Anand Rajaraman, Jeff Ullman
Cambridge University Press
http://www.mmds.org/
Bibliografia Di Riferimento
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Cambridge University Press http://www.mmds.org/Modalità Erogazione
Lectures in class.Modalità Frequenza
Lessons in classModalità Valutazione
Written test, project evaluation. The written test can be replaced by partial in-itinere tests.Programma
1) Algorithms for data streams- Approximate counting
- Majority problems
- Sampling and reservoir sampling
- Bloom filters
- Frequent itemsets
2) Sub-linear algorithms
- Diameter approximation
- Property testing
3) Clustering
4) Algorithms and data structures for quantitative features analysis
- 1d-,2d-,3d-range queries
- Skyline (pareto frontier), near-neighbor search, Voronoi diagrams
5) Dimensionality reduction
6) Algorithms for the decomposition of complex networks
- Decomposition into k-connected components
- Decomposition into k-cores, maximal cliques, maximal k-plexes
7) Distributed Hash Tables, Consistent Hashing
8) Integrity of large data sets, consistency in distributed systems, CAP/PACELC theorems and their impact on NoSQL databases
Testi Adottati
Mining of Massive DatasetsJure Leskovec, Anand Rajaraman, Jeff Ullman
Cambridge University Press
http://www.mmds.org/
Bibliografia Di Riferimento
The bibliography is commented in each lesson by the teachers.Modalità Erogazione
Traditional lectures and exercises.Modalità Frequenza
Class lectures are strongly interactive. All the teachers assist to all the lectures. Discussion with the students is the main target of the course.Modalità Valutazione
The exam consists of a project that will be assigned during the course and of an oral examination.Programma
1) Algorithms for data streams- Approximate counting
- Majority problems
- Sampling and reservoir sampling
- Bloom filters
- Frequent itemsets
- Number of distinct elements
2) Dimensionality reduction
- Johnson–Lindenstrauss lemma
- Embedding metric spaces with low distortion
3) Algorithms and data structures for quantitative features analysis
- orthogonal range searching (kd-trees and range trees)
- nearest neighbour search, k-nearest neighbour search
- fractional cascading and simplex range search
4) Algorithms for the decomposition of complex networks
- Decomposition into k-connected components
- Decomposition into k-cores, maximal cliques, maximal k-plexes
5) NoSQL internals: Distributed Hash Tables, chord, consistent hashing
6) Scalable security: integrity of big data sets in the cloud, consistency and scalability issues with authenticated data structures, pipelining, blockchain scalability trilemma.
Testi Adottati
Mining of Massive DatasetsJure Leskovec, Anand Rajaraman, Jeff Ullman
Cambridge University Press
http://www.mmds.org/
Bibliografia Di Riferimento
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Cambridge University Press http://www.mmds.org/Modalità Erogazione
TraditionalModalità Frequenza
Non-mandatoryModalità Valutazione
Written test, project evaluation. The written test can be replaced by partial in itinere tests.Programma
1) Algorithms for data streams- Approximate counting
- Majority problems
- Sampling and reservoir sampling
- Bloom filters
- Frequent itemsets
- Number of distinct elements
2) Dimensionality reduction
-Johnson–Lindenstrauss lemma
Embedding metric spaces with low distortion
3) Algorithms and data structures for quantitative features analysis
- orthogonal range searching (kd-trees and range trees)
- nearest neighbour search, k-nearest neighbour search
- fractional cascading and simplex range search
4) Algorithms for the decomposition of complex networks
- Decomposition into k-connected components
- Decomposition into k-cores, maximal cliques, maximal k-plexes
5) NoSQL internals: Distributed Hash Tables, chord, consistent hashing
6) Scalable security: integrity of big data sets in the cloud, consistency and scalability issues with authenticated data structures, pipelining, blockchain scalability trilemma.
Testi Adottati
Mining of Massive DatasetsJure Leskovec, Anand Rajaraman, Jeff Ullman
Cambridge University Press
http://www.mmds.org/
Bibliografia Di Riferimento
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Cambridge University Press http://www.mmds.org/Modalità Erogazione
TraditionalModalità Frequenza
Non-mandatoryModalità Valutazione
Written test, project evaluation. The written test can be replaced by partial in-itinere tests.