===== Presentation ===== This subsets have been extracted from the 20-Newsgroup dataset that can be found on http://people.csail.mit.edu/jrennie/20Newsgroups/ authors: Clément Grimal http://membres-lig.imag.fr/grimal/ Questions, suggestions or comments are appreciated! See: An Improved Co-Similarity Measure for Document Clustering, Syed Fawad Hussain, Clément Grimal, Gilles Bisson, ICMLA'2010. date: October, 2010 ===== Description ===== The archive contains 6 subsets : * M2: talk.politics.mideast, talk.politics.misc (500 documents) * M5: comp.graphics, rec.motorcycles, rec.sport.baseball, sci.space, talk.politics.mideast (500 documents) * M10: alt.atheism, comp.sys.mac.hardware, misc.forsale, rec.autos, rec.sport.hockey, sci.crypt, sci.electronics, sci.med, sci.space, talk.politics.gun (500 documents) * NG1: rec.sports.baseball, rec.sports.hockey (400 documents) * NG2: comp.os.ms-windows.misc, comp.windows.x, rec.motorcycles, sci.crypt, sci.space (1000 documents) * NG3: comp.os.ms-windows.misc, comp.windows.x, misc.forsale, rec.motorcycles, sci.crypt, sci.space, talk.politics.mideast, talk.religion.misc (1600 documents) Every subsets contains 10 samples, the documents have been selected randomly, and the words have been selected with the Partition Around Medoïds (PAM) algorithm. ===== Files ===== All the files are encoded in UTF8. _.txt -- the documents-words matrix, containing the number of co-occurences. _.colMapping.txt -- word_id||word ... The mapping between the columns of the matrix and the words. _.rowMapping.txt -- document_id||document_path ... The mapping between the rows of the matrix and the path to the document according within the NG20 dataset (e.g. sci.electronics/53847). _act.txt -- contains the list of the affectations of the documents to a topic.