mahout下载和安装--688IT编程网

1：下载地址ak/lucene/mahout/
我下载的是： mahout-0. 17-Mar-2010 02:12 47M
2：解压
tar -xvf mahout-0.

3:配置环境
export HADOOP_HOME=/home/hadoopuser/hadoop-0.19.2 export HADOOP_CONF_DIR=/home/hadoopuser/hadoop-0.19.2/conf

4：使用看看先
bin/mahout --help
会列出很多可以用的算法

5：使用kmeans聚类看看先

bin/mahout kmeans --input /user/hive/warehouse/tmp_data/complex.seq --clusters 5 --output /home/

关于 kmeans需要的参数等等通过如下命令可以查看：
bin/mahout kmeans --help

mahout下处理的文件必须是SequenceFile格式的，所以需要把txtfile转换成sequenceFile。

SequenceFile是hadoop中的一个类，允许我们向文件中写入二进制的键值对，具体介绍请看

eyjian写的www.hadoopor/viewthread.php?tid=144&;highlight=sequencefile
下载apache
mahout中提供了一种将指定文件下的文件转换成sequenceFile的方式。
（You may find Tika (/tika) helpful in converting binary documents to text.）
使用方法如下：
$MAHOUT_HOME/bin/mahout seqdirectory \
--input <PARENT DIR WHERE DOCS ARE LOCATED> --output <OUTPUT DIRECTORY> \
<-c <CHARSET NAME OF THE INPUT DOCUMENTS> {UTF-8|}> \
<-chunk <MAX SIZE OF EACH CHUNK in Megabytes> 64> \
<-prefix <PREFIX TO ADD TO THE DOCUMENT ID>>

举个例子：
bin/mahout seqdirectory --input /hive/hadoopuser/ --output /mahout/seq/ --charset UTF-8

运行kmeans的简单的例子：

1：将样本数据集放到hdfs中指定文件下,应该在testdata文件夹下
$HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata
例如：
bin/hadoop fs -put /home/hadoopuser/mahout-0.3/test/synthetic_control.data /user/hadoopuser/testdata/

2：使用kmeans算法
$HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
例如：
bin/hadoop jar /home/hadoopuser/mahout-0.3/mahout-examples-0.1.job org.apache.mahout.clustering.syntheticcontrol.kmeans.Job

3：使用canopy算法
$HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.clustering.syntheticcontrol.canopy.Job
例如：
bin/hadoop jar /home/hadoopuser/mahout-0.3/mahout-examples-0.1.job org.apache.mahout.clustering.syntheticcontrol.canopy.Job

4：使用dirichlet 算法
$HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.clustering.syntheticcontrol.dirichlet.Job

5：使用meanshift算法

meanshift : $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.anshift.Job

6：查看一下结果吧
bin/mahout vectordump --seqFile /user/hadoopuser/output/data/part-00000
这个直接把结果显示在控制台上。

Get the data out of HDFS and have a look
All example jobs use testdata as input and output to directory output
Use bin/hadoop fs -lsr output to view all outputs
Output:
KMeans is placed into output/points
Canopy and MeanShift results are placed into output/clustered-points

英文参考链接：
/MAHOUT/syntheticcontroldata.html

TriJUG: Intro to Mahout Slides and Demo examples

First off, big thank you to TriJUG and all the attendees for allowing me to present Apache Mahout last night. Also a big thank you to Red Hat for providing a most excellent meeting space. Finally, to Manning Publications for providing vouchers for Taming Text and Mahout In Action for the end of the night raffle. Overall, I think it went well, but that’s not for me to judge. There were a lot of good questions and a good sized audience.

The slides for the Monday, Feb. 15 TriJUG talk are at: Intro to Mahout Slides(Intro Mahout (PDF)).

For the “ugly demos”, below is a history of the commands I ran for setup, etc. Keep in mind that you can almost always run bin/mahout <COMMAND> –help to get syntax help for any given command.

Here’s the preliminary setup stuff I did:

1. Get and preprocess the Reuters content perwww.lucenebootcamp/lucene-boot-camp-preclass-training/

2. Create the sequence files: bin/mahout seqdirectory –input <PATH>/content/reuters/reuters-out –output <PATH>/content/reuters/seqfiles –charset UTF-8

3. Convert the Sequence Files to Sparse Vectors, using the Euclidean norm and the TF weight (for LDA): bin/mahout seq2sparse –input <PATH>/content/reuters/seqfiles –output <PATH>/content/reuters/seqfiles-TF –norm 2 –weight TF

4. Convert the Sequence Files to Sparse Vectors, using the Euclidean norm and the TF-IDF weight (for Clustering): bin/mahout seq2sparse –input<PATH>/content/reuters/seqfiles –output <PATH>/content/reuters/seqfiles-TF-IDF –norm 2 –weight TFIDF

For Latent Dirichlet Allocation I then ran:

1. ./mahout lda –input <PATH>/content/reuters/seqfiles-TF/vectors/ –output <PATH>/content/reuters/seqfiles-TF/lda-output –numWords 34000 –numTopics 20

2. ./mahout org.apache.mahout.clustering.lda.LDAPrintTopics –input <PATH>/content/reuters/seqfiles-TF/lda-output/state-19 –dict <PATH>/content/reuters/seqfiles-TF/dictionary.file-0 –words 10 –output <PATH>/content/reuters/seqfiles-TF/lda-output/topics –dictionaryType sequencefile

For K-Means Clustering I ran:

1. ./mahout kmeans –input <PATH>/content/reuters/seqfiles-TFIDF/vectors/part-00000 –k 15 –output <PATH>/content/reuters/seqfiles-TFIDF/output-kmeans –clusters <PATH>/content/reuters/seqfiles-TFIDF/output-kmeans/clusters

2. Print out the clusters: ./mahout clusterdump –seqFileDir /Volumes/Content/grantingersoll/content/reuters/seqfiles-TFIDF/output-kmeans/clusters-1

5/ –pointsDir /Volumes/Content/grantingersoll/content/reuters/seqfiles-TFIDF/output-kmeans/points/ –dictionary /Volumes/Content/grantingersoll/content/reuters/seqfiles-TFIDF/dictionary.file-0 –dictionaryType sequencefile –substring 20

688IT编程网

mahout下载和安装

发表评论

推荐文章

应用程序的安全检测方法、装置、电子设备和存储介质

nginx map用法正则

VBA之正则表达式(1)--基础篇

Prometheus监控学习笔记之初识PromQL

关于PHP中的webshell

热门文章

m函数数字提取

jest断言方法大全

中兴ZXSEC US 管理员手册

keras系列(一):参数设置

Qt从QString中提取出数字

element input 金额千分位格式化

freemaker 参数解析正则

C#正则验证数字

form表单验证正则

scanf正则表达式用法

grafana value的正则表达式

Android平台浮点数运算应用

js-(JS正则表达式验证数字)

判断Python输入是否是整数,字符,或浮点数

c语言 sscanf 正则规则

从文本中提取数值技巧

js将整数转换成两位浮点数的方法

vue正则限制浮点数

8到20的结尾的正则

shell 正则表达式最后一行

最新文章

应用程序的安全检测方法、装置、电子设备和存储介质

VBA之正则表达式(1)--基础篇

代码编辑的辅助方法、装置及电子设备

SHELL查字符串中包含字符的命令

String方法中replace和replaceAll的区别详解(源码分析)

双字节符号正则

标签列表

688IT编程网

mahout下载和安装

发表评论

推荐文章

应用程序的安全检测方法、装置、电子设备和存储介质

nginx map用法 正则

VBA之正则表达式(1)--基础篇

Prometheus监控学习笔记之初识PromQL

关于PHP中的webshell

热门文章

m函数数字提取

jest断言方法大全

中兴ZXSEC US 管理员手册

keras系列(一):参数设置

Qt从QString中提取出数字

element input 金额千分位格式化

freemaker 参数解析正则

C#正则验证数字

form表单验证正则

scanf正则表达式用法

grafana value的正则表达式

Android平台浮点数运算应用

js-(JS正则表达式验证数字)

判断Python输入是否是整数,字符,或浮点数

c语言 sscanf 正则规则

从文本中提取数值技巧

js将整数转换成两位浮点数的方法

vue正则限制浮点数

8到20的结尾的正则

shell 正则表达式 最后一行

最新文章

应用程序的安全检测方法、装置、电子设备和存储介质

VBA之正则表达式(1)--基础篇

代码编辑的辅助方法、装置及电子设备

SHELL查字符串中包含字符的命令

String方法中replace和replaceAll的区别详解(源码分析)

双字节符号正则

标签列表

nginx map用法正则

shell 正则表达式最后一行