信息网络安全 ›› 2016, Vol. 16 ›› Issue (9): 267-271.doi: 10.3969/j.issn.1671-1122.2016.09.051

• • 上一篇    下一篇

针对微博信息分析的HBase存储结构设计

陈希林(), 马丁   

  1. 中国人民公安大学,北京 102623
  • 收稿日期:2016-07-25 出版日期:2016-09-20 发布日期:2020-05-13
  • 作者简介:

    作者简介: 陈希林(1992—),女,河南,硕士研究生,主要研究方向为网络安全、大数据处理;马丁(1966—),女,辽宁,教授,博士,主要研究方向为网络安全、电子证据。

Design of Storage Structure in HBase for Microblog Information Analysis

Xilin CHEN(), Ding MA   

  1. People's Public Security University of China, Beijing 102623, China
  • Received:2016-07-25 Online:2016-09-20 Published:2020-05-13

摘要:

随着互联网的发展,微博对人们生活的影响日益加深。由于微博用户的激增,微博数据量已经非常庞大,且每时每刻都在急速增长。面对这种形势,传统数据库对于海量数据的处理效率已经难以满足需求,于是NoSQL数据库应运而生。文章采用的HBase是目前比较受欢迎的开源NoSQL之一。作为依赖于HDFS分布式存储架构的新型NoSQL数据库,HBase不仅能满足高效的结构化数据存储,并通过Mapreduce实现高效处理,还能存储非结构化数据,为海量数据提供相对灵活的信息存储管理。最重要的是,HBase的集群扩展起来非常方便,只需要增加Slave节点机器即可,比传统数据库的读写分离、分表等扩展操作要简便得多。文章研究了针对微博信息的HBase行键设计,从深度信息、广度信息等不同角度探讨行键的设计,并通过二级索引改善HBase的查询效率。在不更改HBase源代码的前提下,文章解决了信息查询在很大程度上受到行键设计制约的问题,并充分考虑了适用于微博图片、链接等信息的存储方式,满足微博信息的高效管理。

关键词: 微博, Hadoop, NoSQL, HBase, 二级索引

Abstract:

With the development of the Internet, microblog's impact on people's life is getting deeper. Due to the surge of microblog users, it has a very large amount of data, and every moment in the rapid growth.As this situation, the traditional database for massive data processing has been difficult to meet the demand. So NoSQL database came into being.Among them, HBase which mentioned in this paper is one of the most popular open source NoSQL currently. HBase, as a new type of NoSQL database which is based on Hadoop Distributed File System, can not only meet the efficient storage of structured data, and achieve efficient processing through the Mapreduce ,but also store unstructured data provide relatively flexible information storage and management for massive data.What’s the most important is HBase cluster is very convenient to expand. It only need to increase the slave node machine,which will be easier than the expansion operation of traditional database,such as read and write separation, with separate tables. In this paper, we studied the design of Row-key for microblog's information in HBase. We discussed from the angle of depth and breadth of information.The query efficiency of HBase is improved by two level index. In the premise of not changing HBase source code, we solved the problem that the information query subject to the design of Key-rows in a large extent, and gave full consideration to the applicable storage mode for microblog information such as photos, links,etc, to meet the efficient management of the microblog information.

Key words: microblog, Hadoop, NoSQL, HBase, two-layer based index

中图分类号: