1:下载地址ak/lucene/mahout/
我下载的是: mahout-0.          17-Mar-2010 02:12   47M 
2:解压
tar -xvf mahout-0.


3:配置环境
export HADOOP_HOME=/home/hadoopuser/hadoop-0.19.2 export HADOOP_CONF_DIR=/home/hadoopuser/hadoop-0.19.2/conf


4:使用看看先
bin/mahout --help
会列出很多可以用的算法


5:使用kmeans聚类看看先


bin/mahout kmeans --input /user/hive/warehouse/tmp_data/complex.seq   --clusters  5 --output  /home/


关于 kmeans需要的参数等等通过如下命令可以查看:
bin/mahout kmeans --help
mahout下处理的文件必须是SequenceFile格式的,所以需要把txtfile转换成sequenceFile。


SequenceFile是hadoop中的一个类,允许我们向文件中写入二进制的键值对,具体介绍请看


eyjian写的www.hadoopor/viewthread.php?tid=144&;highlight=sequencefile
下载apache
mahout中提供了一种将指定文件下的文件转换成sequenceFile的方式。
(You may find Tika (/tika) helpful in converting binary documents to text.)
使用方法如下:
$MAHOUT_HOME/bin/mahout seqdirectory \
--input <PARENT DIR WHERE DOCS ARE LOCATED> --output <OUTPUT DIRECTORY> \
<-c <CHARSET NAME OF THE INPUT DOCUMENTS> {UTF-8|}> \
<-chunk <MAX SIZE OF EACH CHUNK in Megabytes> 64> \
<-prefix <PREFIX TO ADD TO THE DOCUMENT ID>>

举个例子:
bin/mahout seqdirectory --input /hive/hadoopuser/ --output /mahout/seq/ --charset UTF-8
运行kmeans的简单的例子:

1:将样本数据集放到hdfs中指定文件下,应该在testdata文件夹下
$HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata
例如:
bin/hadoop fs   -put /home/hadoopuser/mahout-0.3/test/synthetic_control.data  /user/hadoopuser/testdata/

2:使用kmeans算法
$HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
例如:
bin/hadoop jar /home/hadoopuser/mahout-0.3/mahout-examples-0.1.job org.apache.mahout.clustering.syntheticcontrol.kmeans.Job


3:使用canopy算法
$HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.clustering.syntheticcontrol.canopy.Job
例如:
bin/hadoop jar /home/hadoopuser/mahout-0.3/mahout-examples-0.1.job org.apache.mahout.clustering.syntheticcontrol.canopy.Job

4:使用dirichlet 算法
$HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.clustering.syntheticcontrol.dirichlet.Job

5:使用meanshift算法
meanshift : $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.anshift.Job

6:查看一下结果吧
bin/mahout vectordump --seqFile /user/hadoopuser/output/data/part-00000
这个直接把结果显示在控制台上。

Get the data out of HDFS  and have a look
All example jobs use testdata as input and output to directory output
Use bin/hadoop fs -lsr output to view all outputs
Output:
KMeans is placed into output/points
Canopy and MeanShift results are placed into output/clustered-points

英文参考链接:
/MAHOUT/syntheticcontroldata.html
TriJUG: Intro to Mahout Slides and Demo examples
First off, big thank you to TriJUG and all the attendees for allowing me to present Apache Mahout last night.  Also a big thank you to Red Hat for providing a most excellent meeting space.  Finally, to Manning Publications for providing vouchers for Taming Text and Mahout In Action for the end of the night raffle.  Overall, I think it went well, but that’s not for me to judge.  There were a lot of good questions and a good sized audience.
The slides for the Monday, Feb. 15 TriJUG talk are at: Intro to Mahout Slides(Intro Mahout (PDF)).
For the “ugly demos”, below is a history of the commands I ran for setup, etc.  Keep in mind that you can almost always run bin/mahout <COMMAND> –help to get syntax help for any given command.
Here’s the preliminary setup stuff I did:
1. Get and preprocess the Reuters content perwww.lucenebootcamp/lucene-boot-camp-preclass-training/
2. Create the sequence files: bin/mahout seqdirectory –input <PATH>/content/reuters/reuters-out –output <PATH>/content/reuters/seqfiles –charset UTF-8
3. Convert the Sequence Files to Sparse Vectors, using the Euclidean norm and the TF weight (for LDA): bin/mahout seq2sparse –input <PATH>/content/reuters/seqfiles –output <PATH>/content/reuters/seqfiles-TF –norm 2 –weight TF
4. Convert the Sequence Files to Sparse Vectors, using the Euclidean norm and the TF-IDF weight (for Clustering): bin/mahout seq2sparse –input<PATH>/content/reuters/seqfiles –output <PATH>/content/reuters/seqfiles-TF-IDF –norm 2 –weight TFIDF
For Latent Dirichlet Allocation I then ran:
1. ./mahout lda –input  <PATH>/content/reuters/seqfiles-TF/vectors/ –output  <PATH>/content/reuters/seqfiles-TF/lda-output –numWords 34000 –numTopics 20
2. ./mahout org.apache.mahout.clustering.lda.LDAPrintTopics –input <PATH>/content/reuters/seqfiles-TF/lda-output/state-19 –dict <PATH>/content/reuters/seqfiles-TF/dictionary.file-0 –words 10 –output <PATH>/content/reuters/seqfiles-TF/lda-output/topics –dictionaryType sequencefile
For K-Means Clustering I ran:
1. ./mahout kmeans –input <PATH>/content/reuters/seqfiles-TFIDF/vectors/part-00000 –k 15 –output <PATH>/content/reuters/seqfiles-TFIDF/output-kmeans –clusters <PATH>/content/reuters/seqfiles-TFIDF/output-kmeans/clusters
2. Print out the clusters: ./mahout clusterdump –seqFileDir /Volumes/Content/grantingersoll/content/reuters/seqfiles-TFIDF/output-kmeans/clusters-1
5/ –pointsDir /Volumes/Content/grantingersoll/content/reuters/seqfiles-TFIDF/output-kmeans/points/ –dictionary /Volumes/Content/grantingersoll/content/reuters/seqfiles-TFIDF/dictionary.file-0 –dictionaryType sequencefile –substring 20

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。