Hadoop:MapReduce之倒排索引(Combiner和Partitioner的使用)--688IT编程网

Hadoop：MapReduce之倒排索引（Combiner和Partitioner的使⽤）Hadoop：MapReduce之倒排索引

前⾔

本案例有⼀定门槛，需要⼀点Java基础，Hadoop⼊门级知识，涉及Maven管理，pom配置⽂件，Maven打包，Linux虚拟机的使⽤，Hadoop集，若阅读期间感觉吃⼒请⾃⾏补课。当然有疑问，也欢迎评论留意或私信我。

⼀、案例要求

1) 实现倒排索引效果：统计每个单词在不同⽂件中的出现次数；查看下⽅的案例说明；

2) 输⼊：⾃⼰编辑⼏个⽂件，例如 a.,c.txt。

每个⽂件的内容为若⼲⾏单词，单词之间以空格分开，

并将这些⽂件上传到 hdfs 的/reversed ⽬录下；例如a.txt的内容：

hadoop google scau

map hadoop reduce

hive hello hbase

3) 编写程序实现单词的倒排索引效果；

4) 分区要求：以 A-M 字母开头（包含⼩写）的单词出现

在 0 区；以 N-Z 字母开头的单词出现在 1 区；其余开

头的单词出现在 2 区；

5) 单词的输出形式：->->1，其中

hadoop 是单词（也作为输出的 key）,”

<->2,

<->1”表⽰输出的 value，即表⽰

hadoop 单词在 a.txt ⽂件中出现次数为 2，在 b.txt

⽂件中出现次数为 1；

案例说明：

第⼀次 MapReduce，统计各⽂档中不同单词的出现次数；SCAU

输出结果（K，V）的形式⽰例（可以⾃定义，默认以\t 分

隔）如下：

hadoop-& 2

hadoop-& 1

map-& 1

第⼆次 MapReduce，将以上结果(路径)作为输⼊，处理后

输出倒排索引；

输出结果（K，V）的形式为：

->->1

其他：根据 context 获取⽂件名：

FileSplit inputSplit = (FileSplit)

Path path = Path();

String filename = Name();

⼆、实现过程

1.IntelliJ IDEA 创建Maven⼯程

项⽬层次结构如图：

2.完整代码

ReversedMapper.java

package reversedindex;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.lib.input.FileSplit;

import java.io.IOException;

public class ReversedMapper extends Mapper<LongWritable, Text,Text,Text>{

private Text outKey =new Text();

private Text outValue =new Text("1");

@Override

protected void map(LongWritable key, Text value, Context context)throws IOException, InterruptedException { FileSplit inputSplit =(InputSplit();

String fileName = Path().getName();

String[] words = String().split(" ");

for(String word : words){

outKey.set(word+"->"+fileName);

context.write(outKey,outValue);

}

ReversedCombiner.java

import java.io.IOException;

public class ReversedCombiner extends Reducer<Text,Text,Text, Text>{

private Text outKey =new Text();

private Text outValue =new Text();

@Override

protected void reduce(Text key, Iterable<Text> values, Context context)throws IOException, InterruptedException { int count =0;

for(Text value : values){

count+=Integer.String());

}

String[] words = String().split("->");

outKey.set(words[0]);

outValue.set(words[1]+"->"+count);

context.write(outKey,outValue);

}

ReversedPartitioner.java

package reversedindex;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Partitioner;

public class ReversedPartitioner extends Partitioner<Text,Text>{

@Override

public int getPartition(Text text, Text text2,int i){

char head = String().charAt(0));

if(head>='a'&& head<='m')

return0;

else if(head>'m'&& head<='z')

return1;

else

return2;

}

ReversedReducer.java

import java.io.IOException;

public class ReversedReducer extends Reducer<Text,Text, Text,Text>{

private Text outValue =new Text();

@Override

protected void reduce(Text key, Iterable<Text> values, Context context)throws IOException, InterruptedException { StringBuilder stringBuilder =new StringBuilder();

for(Text value : values){

stringBuilder.String()).append(",");

}

String outStr = stringBuilder.substring(0,stringBuilder.length()-1);

maven打包本地jar包outValue.set(outStr);

context.write(key,outValue);

}

ReversedIndex.java

package reversedindex;

import org.f.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class ReversedIndex{

public static void main(String[] args)throws Exception {

Job job = Instance(new Configuration());

job.setJarByClass(ReversedIndex.class);

job.setMapperClass(ReversedMapper.class);

job.setMapOutputKeyClass(Text.class);

job.setMapOutputValueClass(Text.class);

job.setReducerClass(ReversedReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(Text.class);

job.setCombinerClass(ReversedCombiner.class);

job.setPartitionerClass(ReversedPartitioner.class);

job.setNumReduceTasks(3);

FileInputFormat.setInputPaths(job,args[0]);

FileOutputFormat.setOutputPath(job,new Path(args[1]));

boolean result = job.waitForCompletion(true);

}

<?xml version="1.0" encoding="UTF-8"?>

<project xmlns="/POM/4.0.0"

xsi="/2001/XMLSchema-instance"

schemaLocation="/POM/4.0.0 /xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion>

<groupId&le</groupId>

<artifactId>MapReduceExp3</artifactId>

<version>1.0-SNAPSHOT</version>

<hadoop.version>3.1.3</hadoop.version>

<mavenpiler.source>8</mavenpiler.source>

<mavenpiler.target>8</mavenpiler.target>

</properties>

<groupId>org.apache.hadoop</groupId>

<artifactId>hadoop-client</artifactId>

<version>${hadoop.version}</version>

</dependency>

<groupId>org.apache.hadoop</groupId>

<artifactId>hadoop-common</artifactId>

<version>${hadoop.version}</version>

</dependency>

<groupId>org.apache.hadoop</groupId>

<artifactId>hadoop-hdfs</artifactId>

<version>${hadoop.version}</version>

</dependency>

<groupId>org.apache.hadoop</groupId>

<artifactId>hadoop-mapreduce-client-core</artifactId>

<version>${hadoop.version}</version>

</dependency>

<groupId&stng</groupId>

<artifactId>testng</artifactId>

<version>RELEASE</version>

</dependency>

</dependency>

<groupId>junit</groupId>

<artifactId>junit</artifactId>

</dependency>

</dependencies>

<build>

<groupId>org.apache.maven.plugins</groupId>

<artifactId>maven-compiler-plugin</artifactId>

</configuration>

</plugin>

</plugins>

</build>

688IT编程网

Hadoop:MapReduce之倒排索引(Combiner和Partitioner的使用)

发表评论

推荐文章

随机森林算法介绍及R语言实现

基于随机森林优化的神经网络算法在冬小麦产量预测中的应用研究_百度文 ...

基于正则化贪心森林算法的情感分析方法研究

随机森林算法和grandientboosting算法

基于随机森林的图像分类算法研究

热门文章

随机森林特征选择原理

自动驾驶系统中的随机森林算法解析

随机森林算法及其在生物信息学中的应用

监督学习中的随机森林算法解析(六)

随机森林算法在数据分析中的应用

机器学习——随机森林,RandomForestClassifier参数含义详解

随机森林的算法

随机森林算法作用

监督学习中的随机森林算法解析(十)

随机森林算法案例

随机森林案例

二分类问题常用的模型

绘制ssd框架训练流程

一种基于信息熵和DTW的多维时间序列相似性度量算法

SVM训练过程范文

如何使用支持向量机进行股票预测与交易分析

二分类交叉熵损失函数binary

tinybert_训练中文文本分类模型_概述说明

基于门控可形变卷积和分层Transformer的图像修复模型及其应用

人工智能开发技术的测试和评估方法

最新文章

基于随机森林的数据分类算法改进

人工智能中的智能识别与分类技术

基于人工智能技术的随机森林算法在医疗数据挖掘中的应用

随机森林回归模型的建模步骤

r语言随机森林预测模型校准曲线

《2024年随机森林算法优化研究》范文

标签列表

688IT编程网

Hadoop:MapReduce之倒排索引(Combiner和Partitioner的使用)

发表评论

推荐文章

随机森林算法介绍及R语言实现

基于随机森林优化的神经网络算法在冬小麦产量预测中的应用研究_百度文 ...

基于正则化贪心森林算法的情感分析方法研究

随机森林算法和grandientboosting算法

基于随机森林的图像分类算法研究

热门文章

随机森林特征选择原理

自动驾驶系统中的随机森林算法解析

随机森林算法及其在生物信息学中的应用

监督学习中的随机森林算法解析(六)

随机森林算法在数据分析中的应用

机器学习——随机森林,RandomForestClassifier参数含义详解

随机森林 的算法

随机森林算法作用

监督学习中的随机森林算法解析(十)

随机森林算法案例

随机森林案例

二分类问题常用的模型

绘制ssd框架训练流程

一种基于信息熵和DTW的多维时间序列相似性度量算法

SVM训练过程范文

如何使用支持向量机进行股票预测与交易分析

二分类交叉熵损失函数binary

tinybert_训练中文文本分类模型_概述说明

基于门控可形变卷积和分层Transformer的图像修复模型及其应用

人工智能开发技术的测试和评估方法

最新文章

基于随机森林的数据分类算法改进

人工智能中的智能识别与分类技术

基于人工智能技术的随机森林算法在医疗数据挖掘中的应用

随机森林回归模型的建模步骤

r语言随机森林预测模型校准曲线

《2024年随机森林算法优化研究》范文

标签列表

随机森林的算法