20810211 - Algorithms for big data

In many application contexts huge volumes of data are produced which are
used in the economic-financial, political, social and even institutional
fields. Often the data is stored in huge distributed clouds and is
sometimes generated according to a continuous flow, so large as to make
complete storage unfeasible. In many cases the data pertains to entities
in close relationship with each other and gives rise to massive networks
of connections. Familiar examples for such networks are biological and
social networks, distribution networks, and the Web graph. Furthermore,
the fact that the data is stored in systems managed by third parties
poses integrity problems, which have not been considered in the
classical IT literature in terms of both their type and scale.

This scenario poses unprecedented algorithmic challenges, which are
being considered by a vast audience of researchers. In the last decade,
this effort has produced many innovations on both the methodological and
technological level. This course aims at transferring to the students
some of the most important methodological tools originated from the
research on Big Data algorithms. These methodological tools are
presented within challenging application contexts.

Curriculum

scheda docente | materiale didattico

Programma

1) Algorithms for data streams
- Approximate counting
- Majority problems
- Sampling and reservoir sampling
- Bloom filters
- Frequent itemsets
- Number of distinct elements
2) Algorithms and data structures for quantitative features analysis
- orthogonal range searching (kd-trees, range trees, and layered range trees)
- median finding
- multidimensional divide and conquer, closest pair
- fractional cascading
3) Algorithms for the decomposition of complex networks
- Decomposition into k-connected components
- Decomposition into k-cores, maximal cliques, maximal k-plexes
4) Locality sensitive hashing for finding similar items
- Min-Hashing
- Nearest neighbour search, k-nearest neighbour search


Testi Adottati

Slides plus:
Mining of Massive Datasets
Jure Leskovec, Anand Rajaraman, Jeff Ullman
Cambridge University Press
http://www.mmds.org/

Bibliografia Di Riferimento

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Cambridge University Press http://www.mmds.org/

Modalità Erogazione

Lectures in class.

Modalità Frequenza

Lessons in class

Modalità Valutazione

Written test, project evaluation. The written test can be replaced by partial in-itinere tests.

scheda docente | materiale didattico

Programma

1) Algorithms for data streams
- Approximate counting
- Majority problems
- Sampling and reservoir sampling
- Bloom filters
- Frequent itemsets
- Number of distinct elements
2) Algorithms and data structures for quantitative features analysis
- orthogonal range searching (kd-trees, range trees, and layered range trees)
- median finding
- multidimensional divide and conquer, closest pair
- fractional cascading
3) Locality sensitive hashing for finding similar items
- Min-Hashing
- Nearest neighbour search, k-nearest neighbour search
4) NoSQL internals & Distributed Hash Tables
- Chord
- consistent hashing
- Kademlia
5) Scalable security:
- integrity of big data sets in the cloud,
- consistency and scalability issues with authenticated data structures

Testi Adottati

Mining of Massive Datasets
Jure Leskovec, Anand Rajaraman, Jeff Ullman
Cambridge University Press
http://www.mmds.org/

Bibliografia Di Riferimento

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Cambridge University Press http://www.mmds.org/

Modalità Erogazione

Traditional

Modalità Frequenza

Non-mandatory

Modalità Valutazione

Written test, project evaluation. The written test can be replaced by partial in itinere tests.

scheda docente | materiale didattico

Programma

1) Algorithms for data streams
- Approximate counting
- Majority problems
- Sampling and reservoir sampling
- Bloom filters
- Frequent itemsets
- Number of distinct elements
2) Algorithms and data structures for quantitative features analysis
- orthogonal range searching (kd-trees, range trees, and layered range trees)
- median finding
- multidimensional divide and conquer, closest pair
- fractional cascading
3) Locality sensitive hashing for finding similar items
- Min-Hashing
- Nearest neighbour search, k-nearest neighbour search
4) NoSQL internals & Distributed Hash Tables
- Chord
- consistent hashing
- Kademlia
5) Scalable security:
- integrity of big data sets in the cloud,
- consistency and scalability issues with authenticated data structures

Testi Adottati

Mining of Massive Datasets
Jure Leskovec, Anand Rajaraman, Jeff Ullman
Cambridge University Press
http://www.mmds.org/


Bibliografia Di Riferimento

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Cambridge University Press http://www.mmds.org/

Modalità Erogazione

Traditional

Modalità Frequenza

Non-mandatory

Modalità Valutazione

Written test, project evaluation. The written test can be replaced by partial in-itinere tests.

scheda docente | materiale didattico

Programma

1) Algorithms for data streams
- Approximate counting
- Majority problems
- Sampling and reservoir sampling
- Bloom filters
- Frequent itemsets
- Number of distinct elements
2) Algorithms and data structures for quantitative features analysis
- orthogonal range searching (kd-trees, range trees, and layered range trees)
- median finding
- multidimensional divide and conquer, closest pair
- fractional cascading
3) Locality sensitive hashing for finding similar items
- Min-Hashing
- Nearest neighbour search, k-nearest neighbour search
4) NoSQL internals & Distributed Hash Tables
- Chord
- consistent hashing
- Kademlia
5) Scalable security:
- integrity of big data sets in the cloud,
- consistency and scalability issues with authenticated data structures


Testi Adottati

Mining of Massive Datasets
Jure Leskovec, Anand Rajaraman, Jeff Ullman
Cambridge University Press
http://www.mmds.org/


Bibliografia Di Riferimento

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Cambridge University Press http://www.mmds.org/

Modalità Frequenza

Attendance is not compulsory but highly recommended

Modalità Valutazione

Written test, project evaluation. The written test can be replaced by partial in itinere tests.

scheda docente | materiale didattico

Programma

1) Algorithms for data streams
- Approximate counting
- Majority problems
- Sampling and reservoir sampling
- Bloom filters
- Frequent itemsets
- Number of distinct elements
2) Algorithms and data structures for quantitative features analysis
- orthogonal range searching (kd-trees, range trees, and layered range trees)
- median finding
- multidimensional divide and conquer, closest pair
- fractional cascading
3) Algorithms for the decomposition of complex networks
- Decomposition into k-connected components
- Decomposition into k-cores, maximal cliques, maximal k-plexes
4) Locality sensitive hashing for finding similar items
- Min-Hashing
- Nearest neighbour search, k-nearest neighbour search


Testi Adottati

Slides plus:
Mining of Massive Datasets
Jure Leskovec, Anand Rajaraman, Jeff Ullman
Cambridge University Press
http://www.mmds.org/

Bibliografia Di Riferimento

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Cambridge University Press http://www.mmds.org/

Modalità Erogazione

Lectures in class.

Modalità Frequenza

Lessons in class

Modalità Valutazione

Written test, project evaluation. The written test can be replaced by partial in-itinere tests.

scheda docente | materiale didattico

Programma

1) Algorithms for data streams
- Approximate counting
- Majority problems
- Sampling and reservoir sampling
- Bloom filters
- Frequent itemsets
- Number of distinct elements
2) Algorithms and data structures for quantitative features analysis
- orthogonal range searching (kd-trees, range trees, and layered range trees)
- median finding
- multidimensional divide and conquer, closest pair
- fractional cascading
3) Locality sensitive hashing for finding similar items
- Min-Hashing
- Nearest neighbour search, k-nearest neighbour search
4) NoSQL internals & Distributed Hash Tables
- Chord
- consistent hashing
- Kademlia
5) Scalable security:
- integrity of big data sets in the cloud,
- consistency and scalability issues with authenticated data structures

Testi Adottati

Mining of Massive Datasets
Jure Leskovec, Anand Rajaraman, Jeff Ullman
Cambridge University Press
http://www.mmds.org/

Bibliografia Di Riferimento

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Cambridge University Press http://www.mmds.org/

Modalità Erogazione

Traditional

Modalità Frequenza

Non-mandatory

Modalità Valutazione

Written test, project evaluation. The written test can be replaced by partial in itinere tests.

scheda docente | materiale didattico

Programma

1) Algorithms for data streams
- Approximate counting
- Majority problems
- Sampling and reservoir sampling
- Bloom filters
- Frequent itemsets
- Number of distinct elements
2) Algorithms and data structures for quantitative features analysis
- orthogonal range searching (kd-trees, range trees, and layered range trees)
- median finding
- multidimensional divide and conquer, closest pair
- fractional cascading
3) Locality sensitive hashing for finding similar items
- Min-Hashing
- Nearest neighbour search, k-nearest neighbour search
4) NoSQL internals & Distributed Hash Tables
- Chord
- consistent hashing
- Kademlia
5) Scalable security:
- integrity of big data sets in the cloud,
- consistency and scalability issues with authenticated data structures

Testi Adottati

Mining of Massive Datasets
Jure Leskovec, Anand Rajaraman, Jeff Ullman
Cambridge University Press
http://www.mmds.org/


Bibliografia Di Riferimento

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Cambridge University Press http://www.mmds.org/

Modalità Erogazione

Traditional

Modalità Frequenza

Non-mandatory

Modalità Valutazione

Written test, project evaluation. The written test can be replaced by partial in-itinere tests.

scheda docente | materiale didattico

Programma

1) Algorithms for data streams
- Approximate counting
- Majority problems
- Sampling and reservoir sampling
- Bloom filters
- Frequent itemsets
- Number of distinct elements
2) Algorithms and data structures for quantitative features analysis
- orthogonal range searching (kd-trees, range trees, and layered range trees)
- median finding
- multidimensional divide and conquer, closest pair
- fractional cascading
3) Locality sensitive hashing for finding similar items
- Min-Hashing
- Nearest neighbour search, k-nearest neighbour search
4) NoSQL internals & Distributed Hash Tables
- Chord
- consistent hashing
- Kademlia
5) Scalable security:
- integrity of big data sets in the cloud,
- consistency and scalability issues with authenticated data structures


Testi Adottati

Mining of Massive Datasets
Jure Leskovec, Anand Rajaraman, Jeff Ullman
Cambridge University Press
http://www.mmds.org/


Bibliografia Di Riferimento

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Cambridge University Press http://www.mmds.org/

Modalità Frequenza

Attendance is not compulsory but highly recommended

Modalità Valutazione

Written test, project evaluation. The written test can be replaced by partial in itinere tests.

scheda docente | materiale didattico

Programma

1) Algorithms for data streams
- Approximate counting
- Majority problems
- Sampling and reservoir sampling
- Bloom filters
- Frequent itemsets
- Number of distinct elements
2) Algorithms and data structures for quantitative features analysis
- orthogonal range searching (kd-trees, range trees, and layered range trees)
- median finding
- multidimensional divide and conquer, closest pair
- fractional cascading
3) Algorithms for the decomposition of complex networks
- Decomposition into k-connected components
- Decomposition into k-cores, maximal cliques, maximal k-plexes
4) Locality sensitive hashing for finding similar items
- Min-Hashing
- Nearest neighbour search, k-nearest neighbour search


Testi Adottati

Slides plus:
Mining of Massive Datasets
Jure Leskovec, Anand Rajaraman, Jeff Ullman
Cambridge University Press
http://www.mmds.org/

Bibliografia Di Riferimento

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Cambridge University Press http://www.mmds.org/

Modalità Erogazione

Lectures in class.

Modalità Frequenza

Lessons in class

Modalità Valutazione

Written test, project evaluation. The written test can be replaced by partial in-itinere tests.

scheda docente | materiale didattico

Programma

1) Algorithms for data streams
- Approximate counting
- Majority problems
- Sampling and reservoir sampling
- Bloom filters
- Frequent itemsets
- Number of distinct elements
2) Algorithms and data structures for quantitative features analysis
- orthogonal range searching (kd-trees, range trees, and layered range trees)
- median finding
- multidimensional divide and conquer, closest pair
- fractional cascading
3) Locality sensitive hashing for finding similar items
- Min-Hashing
- Nearest neighbour search, k-nearest neighbour search
4) NoSQL internals & Distributed Hash Tables
- Chord
- consistent hashing
- Kademlia
5) Scalable security:
- integrity of big data sets in the cloud,
- consistency and scalability issues with authenticated data structures

Testi Adottati

Mining of Massive Datasets
Jure Leskovec, Anand Rajaraman, Jeff Ullman
Cambridge University Press
http://www.mmds.org/

Bibliografia Di Riferimento

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Cambridge University Press http://www.mmds.org/

Modalità Erogazione

Traditional

Modalità Frequenza

Non-mandatory

Modalità Valutazione

Written test, project evaluation. The written test can be replaced by partial in itinere tests.

scheda docente | materiale didattico

Programma

1) Algorithms for data streams
- Approximate counting
- Majority problems
- Sampling and reservoir sampling
- Bloom filters
- Frequent itemsets
- Number of distinct elements
2) Algorithms and data structures for quantitative features analysis
- orthogonal range searching (kd-trees, range trees, and layered range trees)
- median finding
- multidimensional divide and conquer, closest pair
- fractional cascading
3) Locality sensitive hashing for finding similar items
- Min-Hashing
- Nearest neighbour search, k-nearest neighbour search
4) NoSQL internals & Distributed Hash Tables
- Chord
- consistent hashing
- Kademlia
5) Scalable security:
- integrity of big data sets in the cloud,
- consistency and scalability issues with authenticated data structures

Testi Adottati

Mining of Massive Datasets
Jure Leskovec, Anand Rajaraman, Jeff Ullman
Cambridge University Press
http://www.mmds.org/


Bibliografia Di Riferimento

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Cambridge University Press http://www.mmds.org/

Modalità Erogazione

Traditional

Modalità Frequenza

Non-mandatory

Modalità Valutazione

Written test, project evaluation. The written test can be replaced by partial in-itinere tests.

scheda docente | materiale didattico

Programma

1) Algorithms for data streams
- Approximate counting
- Majority problems
- Sampling and reservoir sampling
- Bloom filters
- Frequent itemsets
- Number of distinct elements
2) Algorithms and data structures for quantitative features analysis
- orthogonal range searching (kd-trees, range trees, and layered range trees)
- median finding
- multidimensional divide and conquer, closest pair
- fractional cascading
3) Locality sensitive hashing for finding similar items
- Min-Hashing
- Nearest neighbour search, k-nearest neighbour search
4) NoSQL internals & Distributed Hash Tables
- Chord
- consistent hashing
- Kademlia
5) Scalable security:
- integrity of big data sets in the cloud,
- consistency and scalability issues with authenticated data structures


Testi Adottati

Mining of Massive Datasets
Jure Leskovec, Anand Rajaraman, Jeff Ullman
Cambridge University Press
http://www.mmds.org/


Bibliografia Di Riferimento

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Cambridge University Press http://www.mmds.org/

Modalità Frequenza

Attendance is not compulsory but highly recommended

Modalità Valutazione

Written test, project evaluation. The written test can be replaced by partial in itinere tests.