Machine learning for non-metric proximity data: Benchmarks

Similarity benchmark

This pages hosts a benchmark of 13 non-standard similarity data sets. The matrices are in general symmetric non-psd (indefinite, non-positive, negative kernels) from various application fields (audio analysis, sign recognition, shape recognition, bioinformatics).

The data are matrix data aka kernels and hence a bit large. Some of the matrices are given as sparse matrices - a reconstruction function is included in the mat file. All dissimilarity matrices have been converted to similarity matrices.

These datasets have been in parts collected by the group of Maya Gupta at the University of Washington and the group around P.W. Duin at the Delft University. The Washington data were not any longer available and the Delft data are a bit hard to find. The raw data e.g. protein sequences are from other sources mentioned in the descriptions.

We summarize and backup these sources here. A few of the datasets (swissprot, bacteria, sonatas) are created within my former group at the University of Bielefeld.

The dataset comes as a 1GB Matlab file available at:

http://www.techfak.uni-bielefeld.de/~fschleif/data/similarity_data_benchmark.mat

If you use this benchmark data set please refer to the following paper:
Data Analysis of (Non-)Metric Proximities at Linear Costs

The following datasets are included:

Aural Sonar:

the Aural Sonar data set is taken from \cite{4098942}, investigating the human ability to distinguish

different types of sonar signals by ear. The signals were returns from a broadband active sonar system, with 50 target-of-interest signals and 50 clutter signals. Every pair of signals was assigned a similarity score from 1 to 5 by two randomly chosen human subjects unaware of the true labels, and these scores were added to produce a 100 × 100 similarity matrix with integer values from 2 to 10 \cite{Chen2009747} with a signature of (62,38,0)

Bacteria: The Bacteria data set (Bacteria), with a signature (2007,0,0), consists of 2007 samples of bacteria mass spec fingerprints in 30 classes taken as a subset from a commercial database provided by \cite{Maier2006a} (The database is not public but part of the sold product the article references to, here we use the version with 3034 bacteria groups. Details can be obtained by contacting the authors at Bruker). The selected bacteria classes are the most prominent ones, consisting of 22 up-to 203 entries. The underlying similarity measure and data generation are discussed in \cite{Maier2006a}. Basically,the similarities are measures of the alignment of two different spectra and the spectra encode a peptide snapshot of the considered bacterium.

Caltech: The Caltech-101 data set (Fei-Fei et al., 2004) is an object recognition benchmark data set

consisting of 8677 images from 101 object categories. Similarities between images were computed

using the pyramid match kernel (Grauman and Darrell, 2007) on SIFT features (Lowe, 2004). Here,

the similarity is PSD.

Copenhagen:

the Copenhagen Chromosomes data constitute a benchmark from cytogenetics. 4,200 human chromosomes from 21 classes are represented by grey-valued images. These are transferred to strings measuring the thickness of their silhouettes. The string indicates the thickness of the gray levels of the image.These strings can be directly compared using the edit distance based on the differences of the numbers and insertion/deletion costs 4.5 \citep{neuhaus}. The obtained proximity matrix has a signature of (2258,1899,43). The classification problem is to label the data according to the chromosome type.

Delft gestures:

the Delft gestures (DS5, 1500 points, 20 classes, balanced, signature: (963,536,1)) taken from \cite{PrTools:2012:Online} is a set of dissimilarities generated from a sign-language interpretation problem. It consists of 1500 points with 20 classes and 75 points per class. The gestures are measured by two video cameras observing the positions of the two hands in 75 repetitions of creating 20 different signs. The dissimilarities are computed using a dynamic time warping procedure

on the sequence of positions \cite{Lichtenauer20082040}.

Face Rec:

the Face Rec data set consists of 945 sample faces of 139 people from the NIST Face Recognition

Grand Challenge data set.7 There are 139 classes, one for each person. Similarities for

pairs of the original three-dimensional face data were computed as the cosine similarity between

integral invariant signatures based on surface curves of the face \cite{4301238} with a

a signature of (794,150,1)

Patrol:

the Patrol data set was collected in \cite{Driskell2008a}. Members of seven patrol

units were asked to name five members of their unit; in some cases the respondents inaccurately

named people who were not in their unit, including people who did not belong to any unit. Of

the original 385 respondents and named people, only the ones that were named at least once were

kept, reducing the data set to 241 points. The similarity between any two people a and b is

(N(a, b) + N(b, a))/2, where $N(a, b)$ is the number of times person a names person b. Thus, this

similarity has a range {0, 0.5, 1}. The classification problem is to estimate to which of the seven

patrol units a person belongs, or to correctly place them in an eighth class that corresponds to “not

in any of the units.” The signature is (233,8,0).

ProDom:

the ProDom dataset with signature (1502,680,422) consists of 2604 protein sequences with 53

labels. It contains a comprehensive set of protein families and

appeared first in the work of \cite{DBLP:conf/nips/RothLBM02}. The pairwise structural alignments are computed by \cite{DBLP:conf/nips/RothLBM02}.

Each sequence belongs to a group labeled by experts, here we use the data as provided in \cite{PrTools:2012:Online}.

Protein:

the Protein data set has sequence-alignment similarities for 213 proteins from 4 classes,

where class one through four contains 72, 72, 39, and 30 points, respectively \cite{DBLP:journals/pami/HofmannB97}. The signature is (170,40,3).

Sonatas:

the Sonatas data set contains complex symbolic data with a signature (1063,4,1) taken from \cite{DBLP:conf/gbrpr/MokbelHH09}.It is comprised of pairwise dissimilarities between 1,068 sonatas from the classical period (by Beethoven, Mozart and Haydn) and the baroque era (by Scarlatti and Bach). The musical pieces were given in the MIDI file format, taken from the online MIDI collection \emph{Kunst der Fuge Kunst der Fuge

Their mutual dissimilarities were measured with the normalized compression distance (NCD), see~\citep{ncd}. The musical pieces are classified according to their composer.

SwissProt:

the SwissProt data set (SWISS), with a signature (8487,2500,1), consists of 5,791 points of protein sequences in 10 classes taken as a subset from the popular SwissProt database

of protein sequences \cite{swissprot}. The considered subset of the SwissProt database refers to the release 37. A typical protein sequence consists of a string of amino acids, and the length of the full sequences varies between 30 to more than 1000 amino acids depending on the sequence.

The 10 most common classes such as Globin, Cytochrome b, Protein kinase st,

etc.\ provided by the Prosite labeling \cite{prosit} where taken leading to 5,791 sequences.

Due to this choice, an associated classification problem maps the sequences to their

corresponding Prosite labels. These sequences are compared using Smith-Waterman which computes a local alignment of sequences \cite{gusfield}. This database is the standard source for identifying and analyzing protein sequences such that an automated classification and processing technique

would be very desirable.

Voting:

the Voting data set comes from the UCI Repository. It is a two-

class classification problem with 435 points, where each sample is a categorical feature vector with

16 components and three possibilities for each component. We compute the value difference metric \cite{Stanfill:1986:TMR:7902.7906} from the categorical data, which is a dissimilarity that uses the training class labels to weight different components differently so as to achieve maximum probability of class separation. The signature is (178,163,94).

Zongker:

the Zongker digit dissimilarity data (2000 points in 10 classes) from \citep{PrTools:2012:Online}

is based on deformable template matching. The dissimilarity measure was computed between 2000 handwritten NIST digits in 10 classes, with 200 entries each, as a result of an iterative optimization of the non-linear deformation of the grid \citep{zongker}. The signature is (1039,961,0).

References can be found in

http://www.techfak.uni-bielefeld.de/~fschleif/data/references_benchmark_data.bib

Further datasets with indefinite proximities are available at

http://lmb.informatik.uni-freiburg.de/people/haasdonk/datasets/distances.en.html

Machine learning for non-metric proximity data

Benchmarks

Similarity benchmark

No comments:

Post a Comment