===== Presentation ===== This subsets have been extracted from the 20-Newsgroup dataset that can be found on http://people.csail.mit.edu/jrennie/20Newsgroups/ author: Clément Grimal http://membres-lig.imag.fr/grimal/ Questions, suggestions or comments are appreciated! date: April, 2012 ===== Description ===== 4 subsets have been built from the following 10 newsgroups: comp.graphics, misc.forsale, rec.autos, rec.sport.baseball, rec.sport.hockey, sci.crypt, sci.med, sci.space, soc.religion.christian, talk.politics.mideast The 4 subsets corresponds, to different number of documents per class: - c10d40: 40 documents per class (400 total) - c10d80: 80 documents per class (800 total) - c10d160: 160 documents per class (1600 total) - c10d320: 320 documents per class (3200 total) - c10d640: 640 documents per class (6400 total) Every subsets contains 10 samples, the documents have been selected randomly, and the words have been selected with the Partition Around Medoïds (PAM) algorithm, for different number of words: - PAM500: 500 words selected - PAM1000: 1000 words selected - PAM1500: 1500 words selected - PAM2000: 2000 words selected - PAM2500: 2500 words selected - PAM3000: 3000 words selected - PAM3500: 3500 words selected - PAM4000: 4000 words selected ===== Files ===== All the files are encoded in UTF8. _.mtx -- the documents-words matrix, containing the number of co-occurences, in the Matrix Market coordinate format (sparse). __.mtx -- the documents-words matrix, after word selection, containing the number of co-occurences, in the Matrix Market coordinate format (sparse). _.mapcol.txt -- word_id word ... The mapping between the columns of the matrix and the words. __.mapcol.txt -- word_id word ... The mapping between the columns of the matrix and the words, after word selection. _.maprow.txt -- document_id document_path ... The mapping between the rows of the matrix and the path to the document according within the NG20 dataset (e.g. sci.electronics/53847). _act.txt -- contains the list of the affectations of the documents to a topic.