What is Adjusted Rand Index and How it works!

8 minute read

Published: April 14, 2020

Adjusted Rand Index(ARI) is one of the widely used metrics for validating clustering performance

Rand Index(RI) and Adjusted Rand index(ARI) is different.

ARI is easy to implement and needs ground truth to execute

Let’s Talk about ARI in details….

What can we learn from this article?

What is ARI?
Where to use ARI?
How to code ARI?
Why to use ARI?
Math Behind ARI.

What is ARI?

Before talking about the ARI, we need to go through the Rand Index(RI) first.

Rand Index:

Rand Index was motivated by the idea of comparing one classifcation result to an another classification result. The naive theory for measuring the performance of a classification would be calculating the correctly classified items to all items. For RI, the above mention performance measre theory was extened to the comparing two cluster. Here, instead of counting single elemsnts, RI counts the correctly classied pairs of items. Hence, we can have the definition of Rand Index from the mathematical point of view as:

$R = a+d/a+b+c+d$

Here, R ranges from 0(no pair classified in the same way under both clusterings) to 1(similar clustering). The term a and b can be seen as agreements and b,c as disagreements.

ARI:

A major problem with the RI is that the expected value of Rand Index of two random cluster or partition does not take a constant value. To solve the problem , Adjusted Rand Index was introduced where the generalized hypergeometric distribution considered as the model of randomenss.

Let U and V are two random partitions with multiple cluster inside. Let $n_{ij}$ be the numer of objects that are in both cluster $u_i$ and $v_j$. Let ni adn $n_j$ be the number of objects or elements in cluster $u_i$ and cluster $v_i$ respectively. The notations are illustrated in the following table.

$ARI CTable$

This table is called Contingency Table. You have to first create this contingency table out of two paritions. Then this table will help to calculate the ARI coeffiecients.

The simple term of an index with a constant expected value is

which is boudnded by 1, and takes the value 0 when the index equals to its expected value. With the consideration fo generalized hypergeometric model, it can be shown as:

ARI Concept

The term $(n/2)$ is called binomial coefficients and it was previously $a+d$. Finally, the adjusted rand index can be written as:

ARi Equation

Lets consider a real exaple to break the calcualtion down. We will take some synthetic data for the calculation:

>>> x = [1 2 3 3 2 1 1 3 3 1 2 2]

>>> y = [3 2 3 2 2 1 1 2 3 1 3 1]

From the above data we can see there are three different clusters in each partitions. Each value represents each cluster and the index is really important to calculate ARI. We just need to count the co-occurrences to build the contingency table. $n_{11}$ would be the number of times an element occurs in cluster 1 of X and cluster 1 of Y; this occurs three times: the sixth, seventh, and tenth elements.

Lets find out the Contengincy Table from the above clustered data.

ARi Equation

The ARI formula is basically consists of three parts:

ARi Equation

Here, $\sum$ means the sum and $i$,$j$ refers to th row number and column number respectively. $a$ refers to the sum of row value and $b$ refers to the sum of column value. Lets calculate the three parts of ARI from the above data.

ARi ctable calcualtion

Finally, we can get the final ARI value of the above two partions data.

ARi Result

Now, lets go trought the code and example with much simpler way.

How to use?

Python(use scikit-learn)
R (package)
It is very easy to use ARI with Python. Thanks to the Sci-kit learn , which has almost all of the machine learning algorithm. Let’s look at the simple example about how to use ARI module from scikit-learn using python.

Python

Perfectly matching labelings have a score of 1 even

>>> from sklearn.metrics.cluster import adjusted_rand_score
>>> adjusted_rand_score([0, 0, 1, 1], [0, 0, 1, 1])
1.0
>>> adjusted_rand_score([0, 0, 1, 1], [1, 1, 0, 0])
1.0

Labelings that assign all classes members to the same clusters are complete be not always pure, hence penalized:

>>> adjusted_rand_score([0, 0, 1, 2], [0, 0, 1, 1])
0.57...

ARI is symmetric, so labelings that have pure clusters with members coming from the same classes but unnecessary splits are penalized:

>>> adjusted_rand_score([0, 0, 1, 1], [0, 0, 1, 2])
0.57...

If classes members are completely split across different clusters, the assignment is totally incomplete, hence the ARI is very low:

>>> adjusted_rand_score([0, 0, 0, 0], [0, 1, 2, 3])
0.0

The above explanation and exale with python is so easy to use that sometimes you may forget how to code.

But for understanding the implementation from OOP coding point of view, i am going to implement this with GoLang. With GoLang code we can easily understand how the ARI should be implemented in details with any OOP language.

We need to have three class for the implementation.

One for the interface
One for the Contingency Table creation
One for the calculating ARI coefficients

I always use interface for such kind of things like any distance metrics or validation metrics. As there are multiple metrics available for this kind of task, having interface is helpful to implement multiple metrics.

Interface:

package graph

/**
 * An abstract interface to measure the clustering performance
 * M.K Hasan
 * hasan.alive@gmail.com
 */

var ValidationFunctionRegistry = map[string]func() ValidationComputation{}

type ValidationComputation interface {
	Measure(y1, y2 []int) int
	GetName() string
}

type ValidationFunction struct {
	Computation ValidationComputation
}

func (vf *ValidationFunction) Measure(y1, y2 []int) int {
	return vf.Computation.Measure(y1, y2)
}

func (vf *ValidationFunction) GetName() string {
	return vf.Computation.GetName()
}

Contingency Table

package validation

import (
	"fmt"
)

/**
 * The contigency table of the two cluster
 * M.k Hasan
 * hasan.alive@gmail.com
 */
type CTableAttribute struct {
	n                                  int /* total number of observation */
	NumberOfCluster1, NumberOfCluster2 int /** Total number of cluster in first and second cluster **/
	/** two dimensional contingency table */
}

/** creating the contingency table **/
func (e *CTableAttribute) ContingencyTable(y1, y2 []int) ([][]int, []int, []int) {

	var table [][]int

	if len(y1) != len(y2) {
		fmt.Errorf("length of the two cluster elements should be equal")
	}

	for i := 0; i < e.NumberOfCluster1; i++ {
		for j := 0; j < e.NumberOfCluster2; j++ {
			nij := 0
			for k := 0; k < e.n; k++ {
				if i == y1[k] && j == y2[k] {
					nij = nij + 1
				}
			}
			table[i][j] = nij
		}

	}
	var a []int //rowsum
	var b []int //columnsum

	for i := 0; i < e.NumberOfCluster1; i++ {
		for j := 0; j < e.NumberOfCluster2; j++ {
			a[i] = a[i] + table[i][j]
			b[i] = b[i] + table[i][j]
		}
	}
	return table, a, b
}

Calculate functions

package validation

import (
	"gonum.org/v1/gonum/stat/combin"
	g "ki/graph"
)

func init() {
	g.ValidationFunctionRegistry["ARI"] = func() g.ValidationComputation {
		return &ARI{}
	}
}

type ARI struct {
	CTable CTableAttribute
}

func (e *ARI) Measure(y1, y2 []int) int {
	return e.CalculateRand1(y1, y2)
}

/** Calculate the adjusted Rand Index **/

func (e *ARI) CalculateRand1(y1, y2 []int) int {
	var rand1 int
	count, a, b := e.CTable.ContingencyTable(y1, y2)
	for i := 0; i < e.CTable.NumberOfCluster1; i++ {
		for j := 0; j < e.CTable.NumberOfCluster2; j++ {
			if count[i][j] >= 2 {
				rand1 += combin.Binomial(count[i][j], 2)
			}
		}
	}
	rand2a := 0
	for i := 0; i < e.CTable.NumberOfCluster1; i++ {
		if a[i] >= 2 {
			rand2a += combin.Binomial(a[i], 2)
		}
	}

	rand2b := 0
	for j := 0; j < e.CTable.NumberOfCluster2; j++ {
		if b[j] >= 2 {
			rand2b += combin.Binomial(b[j], 2)
		}
	}

	rand3 := rand2a * rand2b
	rand3 = rand3 / combin.Binomial(e.CTable.n, 2)
	randN := rand1 - rand3

	// D
	rand4 := (rand2a + rand2b) / 2
	randD := rand4 - rand3

	rand := randN / randD
	return rand

}

func (e *ARI) GetName() string {
	return "ARI"
}

Share on

Twitter Facebook LinkedIn

M.k Hasan

What is Adjusted Rand Index and How it works!

Let’s Talk about ARI in details….

What can we learn from this article?

What is ARI?

Rand Index:

ARI:

How to use?

Python

Interface:

Contingency Table

Calculate functions

Share on

You May Also Enjoy

Anatomy of Semi-Structured(JSON) data with PYSPARK

Outlines

Anomaly Detection with Docker, Filebeat, Kafka, ELK Stack and Machine Learning (Part -1)

What is ELK stands for?

What is the future for Machine learning/Deep learning with GoLang !

Deploying YOLO Object Detection Model With DeepLearning 4 Java & Apache Flink in Kubernets Cluster Using Raspberry PI!