Introduction to the cloud computing

Transcript Introduction to the cloud computing

Homework 3(上交时间：11月19号)
使用MapReduce构建Hbase索引
背景介绍


Hbase中的表根据行健被分成了多个Regions，通常一
个Region的一行会包含较多的数据，如果以列值作为
查询条件，就只能从第一行开始往下查找。这显然
很低效。相反，如果将经常被查询的列作为行健，
行健作为列重新构造一张表，既可以实现根据列值
快速的定位相关数据所在的行。因为Hbase的表中，
行健是以B-Tree的形式组织的，所以即使列值较多的
时候，也能较快的查询到相关的行健。这就是索引。
下图给出了索引表的示例
索引表示例(原始表)
列族info
行健
name
1
peter
[email protected]
Absorb abilities
2
hiro
[email protected]
Bend time and space
3
sylar
[email protected]
Know how things work
4
claire
[email protected]
Heal
5
noah
[email protected]
Cath the people with
abilities
email
power
索引表示例（索引表）
clair
4
1
info:name=pteter
hiro
2
2
info:name=hiro
noah
5
3
info:name=sylar
peter
1
4
info:name=clair
sylar
3
5
info:name=noah
..
实现过程



InpuFormat类。Hbase实现了TableInputFormatBase类，该类
提供了对表数据大部分操作，其子类TableInputFormat则提供
了完整的实现。TableInputFormat类将数据按照Region分割成
split，即有多少个Regions就有多少个splits。然后将Region按
行键分成<key, value>形式，key对应行键，value为改行所包
含的数据。
Mapper类。Hbase实现了TableMapper类和TableReducer类，
其中TableMapper类并没有实现具体的功能只是将输入的<key,
value>对的类型限定为ImmutableBytesWritable和Result。
OutPutFormat类。Hbase实现的TableOutputFormat将输出的
<key, value>对写到指定的Hbase表中。
实现过程



利用TableInputFormat类从Hbase表中提取数据。
在Map中，对每一行的数据提取出需要建立索引的列
的值，加入到索引表中输出。
本实验不用Reduce函数
实现过程

程序Map大致框架
map(ImutableBytesWritable rowKey, Result result, Context contex)
{
for (对于每个需要建立索引的列) //列用列簇:标示符表示
{
byte[] value = result.getValue(family, qualifier);
//插入索引表中，这里value作为行键
Put put = new Put(value);//加入一行
//然后在索引表的指定列中插入rowKey
put.add(index_col, index_qualifer, rowKey.get())
//最后输出, tableName为索引表名
context.write(tableName, put);
}
}