Hello, I am an Engineering Manager at Facebook with 13+ years in Ad Technology, Natural Language Processing and Data mining. (Learn More)
by Pravin Paratey

Set data structures & Implementation

Sets are an important mathematical concept. A set can be defined as a collection of distinct elements. In mathematical literature, curly braces are used to denote a set. So, a set S containing the elements 1, 2 and 3 is denoted by S = {1, 2, 3}.

Set illustration

A set supports the following basic operations,

  1. Union - The union of two sets is a set that contains elements from both sets. For instance, given,
    • S1 = {1, 2, 3} and
    • S2 = {3, 4, 5}
    • S1 ∩ 2 = {1, 2, 3, 4, 5}
  2. Intersection - The intersection of two sets is a set that contains elements common to both sets. For instance, given,
    • S1 = {1, 2, 3} and
    • S2 = {3, 4, 5}
    • S1 ∪ 2 = {3}
  3. Difference - The difference of two sets is a set that contains elements from the first set with the elements common to both sets removed. Given,
    • S1 = {1, 2, 3} and
    • S2 = {3, 4, 5}
    • S1 − 2 = {1, 2}

We also have the following additional operations,

  1. IsMember - Tells you if an element is a member of the set. It is denoted by the ∈ symbol. So,
    • 1 ∈ {1, 2, 3} but
    • 4 ∉ {1, 2, 3}

Implementing sets

Let’s look at the different ways we can implement the Set data structure and analyse each implementation.

Bit-vectors

Bit vectors are an array of bits (x1, x2, … xn) such that xi = 1 if i is a member of the set. For example,

S1 = {1, 2, 3} will be represented as 11100 while
S2 = {3, 4, 5} will be represented as 00111.

Advantages

  1. IsMember(), Add() and Delete() are O(1) operations.
  2. Union(), Intersection() and Difference() are at most O(N) where N is the size of the universe.
  3. For non-sparse sets, bit-vectors can represent the set quite tightly.
  4. Bit operations are typically faster.

Disadvantages

  1. Takes too much memory for large universes.
  2. Needs a mapping function for non-integer members.

Linked-list

A linked list implementation keeps a sorted list of elements for every set. This means,

S1 = {1, 2, 3} will be stored as 1 -> 2 -> 3 while
S2 = {3, 4, 5} will be represented as 3 -> 4 -> 5.

Properties

  1. IsMember() is at most O(n) operation where n is the size of the set.
  2. Union(), Intersection(), Difference(), Add() and Delete() are take at most O(mn) where m is the size of the first set and n is the size of the second set.

Advantages

  • For very large universes, this outperforms a Bit-vector implementation.

Disadvantages

  • Typically the slowest implementation compared to the rest.

Hashtables

A set can also be implemented as a hashmap. The time complexity of the various functions is dependent on the particular implementation of a hashtable.

Properties

  1. IsMember() is typically a O(1) operation.
  2. Add() and Difference() are typically O(n) where n is the size of the second set.
  3. Union() and Intersection() are typically O(mn) where m is the size of the first set and n is the size of the second one.

Ordered trees

Sets can also be implemented as ordered trees. Priority Queues are an example of an ordered tree. A Binary search tree implementation of a set would give you the following properties,

Properties

  1. IsMember() is an O(log n) operation.
  2. Add(), Difference(), Union() and Intersection() are O(log n).

Applications

  1. Information retrieval - Inverted index.
  2. AND & OR queries.