AI in the News
Machine Learning and Data Mining - Datasets
| Version | Size / MD5 | Description | |
|---|---|---|---|
| Download | v1.0 ASCII (10/11/2003) | 1.6Mb (d78b138786b158fb20e64af6d5d2b46b) |
This is the USPS (ZIP-Code) dataset in ASCII/Matlab readable format. I got the data from here
and disliked the odd file format. After a day or so I finally got it converted into something useful. The first column is the digit followed by the 256 features. The data is NOT in sparse format. You might need Rar to unpack it. To load the data using Matlab just use load -ASCII. |
| Download | v1.0 Matlab V6 (04/06/2006) | 4.3Mb (8c68b1c84edb8d28ec80a6824c6b06f1) |
This is the Yale Face Database B in a Matlab-friendly format.
Please check with the original authors about what papers to cite before using this data.
It contains all the images scaled down to 30x40 pixels (we used this for clustering).
You might need Rar to unpack it.
Also included are the indizes for the images that were used in the random 90/10 splits.
To try out the data and display image 999, type into Matlab:
load yale_facedatabase_B.mat image(reshape(bigMatrix(999,:),30,40)) colormap(gray); |
| Download | v1.0 Matlab V6 (03/05/2007) | 2.5Mb (21d58a4f7e63564b3e7c52ae2974458a) | This archive contains all the datasets we used for our ICML 2005 paper "Clustering through ranking on Manifolds" ready for use in Matlab. Please ensure you cite the sources of the data (e.g. UCI control, USPS, 20 Newsgroups, the face-database). Note that the uncompressed data is > 250MByte. |
Links
- Gunnar Raetsch's Benchmark Datasets Various benchmark datasets prepared for Matlab (V6 and V7). Includes BreastCancer, Cards, chess, Circle, credit, Heart1, hepatitis, HouseVotes84, Ionosphere, liver, monks3, musk, PimaIndiansDiabetes, promotergene, ringnorm, Sonar, Spirals, threenorm, tictactoe, titanic and twonorm. Those are Benchmark Data Sets used in [RaeOnoMue01] and [MikRaeWesSchMue99]. Very good for classification tasks. [RaeOnoMue01 Mirror] [MikRaeWesSchMue99 Mirror]
- Data from "Benchmarking Support Vector Machines"[MeyerLeischHornik02]. Very good for comparing your classifier or regression algorithm against other algorithms (SVM, KNN, Neural Nets, Bagging, Boosting, Random Forests and others). Includes many data sets such as liver, hepatitis, credit, monks3, HouseVotes84, Sonar, tictactoe, ringnorm, musk, Spirals, threenorm, Ionosphere, BreastCancer, Circle, titanic, Heart1, chess, PimaIndiansDiabetes, promotergene, twonorm, Cards. The data is in images of R. To extract it, you can use the following R-command: for(i in (1:100)){load(sprintf("%i.RData",i)); write.table(train,file=sprintf("%itrain.txt",i));}
- UCI Machine Learning Repository - Many useful datasets
- DMOZ - Data sets for machine learning
- Clustering data sets
- A dataset for path-finding in images (Field Robotics)
- LETOR - package of benchmark data sets for LEarning TO Rank
- Delve Datasets
- KIN40K regressions data set