AI in the News

Machine Learning and Data Mining - Datasets

	Version	Size / MD5	Description
Download	v1.0 Matlab V6 (04/06/2006)	4.3Mb (8c68b1c84edb8d28ec80a6824c6b06f1)	This is the Yale Face Database B in a Matlab-friendly format. Please check with the original authors about what papers to cite before using this data. It contains all the images scaled down to 30x40 pixels (we used this for clustering). You might need Rar to unpack it. Also included are the indizes for the images that were used in the random 90/10 splits. To try out the data and display image 999, type into Matlab: load yale_facedatabase_B.mat image(reshape(bigMatrix(999,:),30,40)) colormap(gray);
Download	v1.0 Matlab V6 (03/05/2007)	2.5Mb (21d58a4f7e63564b3e7c52ae2974458a)	This archive contains all the datasets we used for our ICML 2005 paper "Clustering through ranking on Manifolds" ready for use in Matlab. Please ensure you cite the sources of the data (e.g. UCI control, USPS, 20 Newsgroups, the face-database). Note that the uncompressed data is > 250MByte.

All data provided for your personal research-use only and "AS IS". All other rights reserved. No warranties of any kind. Insert Disclaimer here.

Links

Gunnar Raetsch's Benchmark Datasets Various benchmark datasets prepared for Matlab (V6 and V7). Includes BreastCancer, Cards, chess, Circle, credit, Heart1, hepatitis, HouseVotes84, Ionosphere, liver, monks3, musk, PimaIndiansDiabetes, promotergene, ringnorm, Sonar, Spirals, threenorm, tictactoe, titanic and twonorm. Those are Benchmark Data Sets used in [RaeOnoMue01] and [MikRaeWesSchMue99]. Very good for classification tasks. [RaeOnoMue01 Mirror] [MikRaeWesSchMue99 Mirror]
Data from "Benchmarking Support Vector Machines"[MeyerLeischHornik02]. Very good for comparing your classifier or regression algorithm against other algorithms (SVM, KNN, Neural Nets, Bagging, Boosting, Random Forests and others). Includes many data sets such as liver, hepatitis, credit, monks3, HouseVotes84, Sonar, tictactoe, ringnorm, musk, Spirals, threenorm, Ionosphere, BreastCancer, Circle, titanic, Heart1, chess, PimaIndiansDiabetes, promotergene, twonorm, Cards. The data is in images of R. To extract it, you can use the following R-command: for(i in (1:100)){load(sprintf("%i.RData",i)); write.table(train,file=sprintf("%itrain.txt",i));}
UCI Machine Learning Repository - Many useful datasets
DMOZ - Data sets for machine learning
A dataset for path-finding in images (Field Robotics)
LETOR - package of benchmark data sets for LEarning TO Rank
Delve Datasets
KIN40K regressions data set
Clustering Data Sets (Mammals, Birth/Death Rates, New Haven Schools, Nutrients)
UCI and UCIKDD data sets classification and regression in Weka ARFF format. More ARFF datasets such as Protein & Biomedical data, drug design, Reuters21578 as the ModApte split, and various agricultural data sets can be found here.
Clustering data sets
Fundamental Clustering Problem Suite (FCPS). Includes clustering problems such as Hepta, Lsun, Tetra, Chainlink, Atom, EngyTime, Target, TwoDiamonds, Wingnut and Golfball.
RCV1 Text Categorization Test Collection

Blog

Consider checking out my blog

Email me