Hadoop:MapReduce之倒排索引(Combiner和Partitioner的使⽤)Hadoop:MapReduce之倒排索引
前⾔
本案例有⼀定门槛,需要⼀点Java基础,Hadoop⼊门级知识,涉及Maven管理,pom配置⽂件,Maven打包,Linux虚拟机的使⽤,Hadoop集,若阅读期间感觉吃⼒请⾃⾏补课。当然有疑问,也欢迎评论留意或私信我。
⼀、案例要求
1) 实现倒排索引效果:统计每个单词在不同⽂件中的出现次数;查看下⽅的案例说明;
2) 输⼊:⾃⼰编辑⼏个⽂件,例如 a.,c.txt。
每个⽂件的内容为若⼲⾏单词,单词之间以空格分开,
并将这些⽂件上传到 hdfs 的/reversed ⽬录下;例如a.txt的内容:
hadoop google scau
map hadoop reduce
hive hello hbase
3) 编写程序实现单词的倒排索引效果;
4) 分区要求:以 A-M 字母开头(包含⼩写)的单词出现
在 0 区;以 N-Z 字母开头的单词出现在 1 区;其余开
头的单词出现在 2 区;
5) 单词的输出形式:->->1,其中
hadoop 是单词(也作为输出的 key),”
<->2,
<->1”表⽰输出的 value,即表⽰
hadoop 单词在 a.txt ⽂件中出现次数为 2,在 b.txt
⽂件中出现次数为 1;
案例说明:
第⼀次 MapReduce,统计各⽂档中不同单词的出现次数;SCAU
输出结果(K,V)的形式⽰例(可以⾃定义,默认以\t 分
隔)如下:
hadoop-& 2
hadoop-& 1
map-& 1
map-& 1
第⼆次 MapReduce,将以上结果(路径)作为输⼊,处理后
输出倒排索引;
输出结果(K,V)的形式为:
->->1
->->1
其他:根据 context 获取⽂件名:
FileSplit inputSplit = (FileSplit)
Path path = Path();
String filename = Name();
⼆、实现过程
1.IntelliJ IDEA 创建Maven⼯程
项⽬层次结构如图:
2.完整代码
ReversedMapper.java
package reversedindex;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import java.io.IOException;
public class ReversedMapper extends Mapper<LongWritable, Text,Text,Text>{
private Text outKey =new Text();
private Text outValue =new Text("1");
@Override
protected void map(LongWritable key, Text value, Context context)throws IOException, InterruptedException { FileSplit inputSplit =(InputSplit();
String fileName = Path().getName();
String[] words = String().split(" ");
for(String word : words){
outKey.set(word+"->"+fileName);
context.write(outKey,outValue);
}
}
}
ReversedCombiner.java
import java.io.IOException;
public class ReversedCombiner extends Reducer<Text,Text,Text, Text>{
private Text outKey =new Text();
private Text outValue =new Text();
@Override
protected void reduce(Text key, Iterable<Text> values, Context context)throws IOException, InterruptedException { int count =0;
for(Text value : values){
count+=Integer.String());
}
String[] words = String().split("->");
outKey.set(words[0]);
outValue.set(words[1]+"->"+count);
context.write(outKey,outValue);
}
}
ReversedPartitioner.java
package reversedindex;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;
public class ReversedPartitioner extends Partitioner<Text,Text>{
@Override
public int getPartition(Text text, Text text2,int i){
char head = String().charAt(0));
if(head>='a'&& head<='m')
return0;
else if(head>'m'&& head<='z')
return1;
else
return2;
}
}
ReversedReducer.java
import java.io.IOException;
public class ReversedReducer extends Reducer<Text,Text, Text,Text>{
private Text outValue =new Text();
@Override
protected void reduce(Text key, Iterable<Text> values, Context context)throws IOException, InterruptedException { StringBuilder stringBuilder =new StringBuilder();
for(Text value : values){
stringBuilder.String()).append(",");
}
String outStr = stringBuilder.substring(0,stringBuilder.length()-1);
maven打包本地jar包outValue.set(outStr);
context.write(key,outValue);
}
}
ReversedIndex.java
package reversedindex;
import org.f.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class ReversedIndex{
public static void main(String[] args)throws Exception {
Job job = Instance(new Configuration());
job.setJarByClass(ReversedIndex.class);
job.setMapperClass(ReversedMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setReducerClass(ReversedReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setCombinerClass(ReversedCombiner.class);
job.setPartitionerClass(ReversedPartitioner.class);
job.setNumReduceTasks(3);
FileInputFormat.setInputPaths(job,args[0]);
FileOutputFormat.setOutputPath(job,new Path(args[1]));
boolean result = job.waitForCompletion(true);
}
}
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="/POM/4.0.0"
xsi="/2001/XMLSchema-instance"
xsi="/2001/XMLSchema-instance"
schemaLocation="/POM/4.0.0 /xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion>
<groupId&le</groupId>
<artifactId>MapReduceExp3</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<hadoop.version>3.1.3</hadoop.version>
<mavenpiler.source>8</mavenpiler.source>
<mavenpiler.target>8</mavenpiler.target>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId&stng</groupId>
<artifactId>testng</artifactId>
<version>RELEASE</version>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.17</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
<scope>test</scope>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
<encoding>UTF-8</encoding>
</configuration>
</plugin>
</plugins>
</build>
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论