ADistributed In-Memory Database Solution for Mass Data Applications

2010-06-05 08:15:16DongHaoLuoShengmeiZhangHengsheng

ZTE Communications 2010年4期

Dong Hao,Luo Shengmei,Zhang Hengsheng

(Communication Service R&D Institute,ZTE Corporation,Nanjing 210012,P.R.China)

Abstract:In this paper,a Distributed In-Memory Database(DIMDB)system is proposed to improve processing efficiency in mass data applications.The system uses an enhanced language similar to Structured Query Language(SQL)with a key-value storage schema.The design goals of the DIMDB system is described and its system architecture is discussed.Operation flow and the enhanced SQL-like language are also discussed,and experimental results are used to test the validity of the system.

Key Words:distributed in-memory system;enhanced key-value schema;mass data application

A nalyzing and processing mass data is a common function of telecommunications and Internet service applications.As the number of service users increases,traditional on-disk database systems struggle to satisfy demand for mass data processing.Typical service applications include fast querying of large-scale user databases for social networking services,data processing of mass logs,and data analysis and mining of mass databases.In these tasks,response time plays a critical role.

An in-memory database that relies on main memory for computer data storage has been widely used in recent years.In contrast to database management systems,which employ a disk-optimized storage mechanism,main memory databases are faster.Internaloptimization algorithms of main memory databases are simpler and execute fewer CPU instructions.Accessing data in the main memory is faster and more predictable than accessing data on disk[1].Storage services require high availability,good performance,and strong consistency[2].In-Memory Databases(IMDBs)[3]satisfy these requirements and have emerged as a way of improving the performance of short transactions[4].Since IMDBs can be accessed directly from the memory,response time is quicker,and transaction throughput is improved when compared to a Disk-Resident Database(DRDB).This is especially important for real-time applications where transactions need to be completed within a specified timeframe[5].

The number service application users is many times higher than in the past.But memory capacity and CPU processing limitations on a single computer means the IMDB system often cannot deal with the mass data in these applications.For example,the main memory in a computer is normally 4 GB to 8 GB,while 100 GB may be needed for a database of 100 million users where every user record needs 1 Kb.In many cases,the IMDB system on a single computer cannot store all the data of certain types of applications.For some applications,logic processing is complicated and processing time is lengthy.So assigning all these processing tasks to one computer is not ideal.

To improve data processing efficiency,a Distributed In-Memory Database(DIMDB)system is proposed.The DIMDBsystem in this paper supports expanded Structured Query Language(SQL)grammar with a key-value storage schema.

1 Design Goals of DIMDB System

The design of DIMDB has three goals:

(1)High Performance

A DIMDB system should be capable of high-performance data access and processing.Many current telecom and Internet service applications produce mass data every day,and this data should be easily accessible and efficiently processed.Mass data might include detailed call records of telecommunications systems,user subscription information,web access logs of Internet Service Providers(ISPs),and monitoring data derived from sensor networks.These kinds of mass data could be stored in a DIMDB system for ease of access and high-performance processing.

(2)High Scalability DIMDB is a distributed system with multiple data nodes to accommodate mass data applications.As the amount of data and processing increases,new data nodes can be added online without interrupting the running service.If the DIMDBsystem has more data nodes than necessary,redundant nodes can be removed.

(3)High Reliability

▲Figure 1.The basic architecture of a DIMDBsystem.

▲Figure 2.The DIMDB-master functionalstructure.

Data stored in one node always has one or more duplicate copies in other nodes.If a data node fails,an application can use duplicate data on other nodes.The DIMDB management node is an active-standby system for Home Agent(HA).

2 Architecture of a DIMDB System

Figure 1 shows the basic architecture of a DIMDBsystem consisting of three elements:DIMDB-Master,DIMDB-Client,and DIMDB-Data Node(DIMDB-DN).The DIMDB-Client generally receives data processing instructions and access requests from upper-level applications.After analyzing these instructions,it sends the requests to DIMDB-Master for data distribution information,and returns the information according to the data distribution policy.The DIMDB-Client then sends commands to the specific DIMDB-DN for processing and data access.

2.1 DIMDB-Master

The DIMDB-Master:

·Allows applications to configure data distribution policies;

·Allows DIMDB-Client to query configured policies;

·Activates or deactivates DIMDB-DNs and queries the status of DIMDB-DNs.

The functionalstructure of DIMDB-Master is shown in Figure 2.

There are five modules in DIMDB-Master:main control module,http communication module,configure module,node management module,and policy storage.

The main control module is the key functional module of DIMDB-Master.It interacts with other modules to complete essentialfunctions of DIMDB-Master.

The policy storage is used to save data distribution policies.

The http communication module provides M1interface for interaction with DIMDB-Client.M1interface is used to receive query requests from DIMDB-Client and to send data distribution information to DIMDB-Client.

The configuration module interacts with the external operation and maintenance platform to configure data distribution policies on each DIMDB-DN through the M2interface.

The node management module connects to each DIMDB-DN using M3interface.It monitors the status of DIMDB-DNs,and activates or deactivates DIMDB-DNs.Status information includes CPUefficiency,the efficiency and capacity of memory used,and whether the DIMDB-DN is operational or non-operational.

2.2 DIMDB-Client

The functionalmodules of DIMDB-Client are shown in Figure 3.

The APImodule in DIMDB-Client is used by upper-level applications to call the data operation functions of DIMDB-Client.

The statement parser is used to transform received SQL-like statements into two parts:database operation statement and input condition(used for requesting data distribution information).

The execution engine is a dynamic link library connected to DIMDB-ND.It is provided by the database management system on a data node,and is used to bring the statement into operation.

The policy acquisition module is used to obtain the data distribution policy from DIMDB-Master according to the input condition.

2.3 DIMDB-DN

DIMDB-DN is a conventional in-memory database used for data storage.It is a stable and efficient database management system supporting Open Database Connectivity(ODBC)and Java Database Connectivity(JDBC)interfaces.

3 Enhanced SQL-Like Language and Operation Flows

3.1 Enhanced SQL-Like Language

▲Figure 3.DIMDB-client functional structure.

Data operation and query in the proposed system is carried out using an enhanced SQL-like language.This language mimics SQLsyntax for creating tables,loading data into tables,and querying tables.Enhanced SQL-like language also allows data distribution information to be embedded into statements.When the DIMDB-Client receives an SQL-like statement,the statement parser transforms it into a normal SQL statement and input condition.The input condition is a kind of key-value range pair.The policy acquisition module then sends the condition to DIMDB-Master,and receives data distribution DIMDB-DNinformation through C2interface.This information includes node IPaddresses and ports oriented to each key-value sub-range pair.According to the information received,DIMDB-Client rewrites the SQLstatement and divides it into sub-statements.DIMDB-Client connects to DIMDB-DNs,executes the statements through the execution engine,and collects the data operation results.

Enhanced SQL-like language used in the proposed system includes the condition for data distribution information.For example,the following statement creates a TABLEt1 with additional conditions:

CREATETABLEt1(Index int not null,Name char(50)not null,Age int,Height

The following statement inserts a record into TABLEt1:

INSERTINTO t1(Index,Name,Age,

When receiving statements(1)and(2),DIMDB-Client abstracts and sends the key-value pair to DIMDB-Master,which returns the data node information according to the key-value pair.Then DIMDB-Client performs the data operations on DIMDB-DNs according to the received data node information.

If the application needs to query data in TABLEt1,a statement is sent to DIMDB-Client as follows:

SELECT·FROM t1 WHEREAge=25

After receiving query statement(3),DIMDB-Client abstracts the key-value pair Key=Index,Minvalue=1001,Maxvalue=1100 and sends it to DIMDB-Master.DIMDB-Master then returns the data node information DIMDB-ND_1 and DIMDB-ND_2,and corresponding key-value sub-range pairs Key=Index,Minvalue=1001,Maxvalue=1050;and Key=Index,Minvalue=1051,Maxvalue=1100.Then DIMDB-Client divides and rewrites the query statement according to the received data node information.The first sent to DIMDB-ND_1 is:

SELECT·FROM t1 WHEREAge=25 and(Index＞=1001

And the second sent to DIMDB_ND_2 is:

SELECT·FROM t1 WHERE Age=25 and(Index＞=1050

After obtaining query results from the two DIMDB-DNs,DIMDB-Client combines the results and returns them to the application.

3.2 Operation Flows

The operation flow of the DIMDB system is shown in Figure 4.

Step 1:The application calls the API interface in DIMDB-Client with the parameter of an enhanced SQL-like statement.

Step 2:The statement is sent to the statement parser module.

Step 3:After being analyzed,the normal SQLstatement is saved,and the key-value range pair is sent to the policy acquisition module.

Steps 4 and 5:The policy acquisition module interacts with DIMDB-Master to obtain the data distribution information.The input is the key-value range pair,and the output is the IPaddresses and ports of DIMDB-DN for each divided key-value sub-range pair.

Step 6:The data distribution information is sent to API.Step 7:The APIcomposes multiple SQL statements and calls the function of the execution engine with the parameters of new statements and data node information.

▲Figure 4.The DIMDBsystem operation flow.

Step 8.1-8.n:The execution engine connects to multiple DIMDB-DNs and executes the SQL statements.

Step 9.1-9.n:The execution engine obtains the data results.

Step 10:The execution engine sends the data results to API.

Step 11:APIcombines the all data results and returns them to the application.

With these 11 steps,the operation is finished.

4 Experiments

In this section,some experiments are provided to evaluate the performance and scalability of the DIMDB system.

4.1 Experiment Environment

The configuration of the experiment environment is listed in Table I.

4.2 Experimental Results

The bar chart in Figure 5 shows the query time for different total data scales,being 100 k,1 M and 10 M records in the databases of all DIMDB-DNs.Different colored columns denote different scales.The query efficiency for large-scale data in the DIMDB system is notably better than in traditional databases.

As shown in Figure 6,query time for different DIMDB-DNs in the system is different.The query time for DIMDB-DN 3 is notably shorter than that for DIMDB-DNs 1 and 2.When the scale of data is increased,more DIMDB-DNs should be added to the system.

▼Table I.Configuration of experiment environment

?Figure 5.Query time for different totaldata scales.

?Figure 6.Comparison of query time for different DIMDB-DNs in the system.

The above illustrates the effectiveness of the prototype in this paper.Efficiency of the DIMDBsystem is expected to improve with optimizations such as table indexing,and more reasonable data distribution policies.When expanded to hundreds of data nodes,the DIMDB system could be used in telecommunications and Internet applications requiring mass data processing.

5 Conclusion

In this paper,limitations of current in-memory database systems are analyzed,and an a DIMDB system is proposed to improve data processing ability in mass data applications.An enhanced language similar to SQLis used,which has the advantage of a key-value storage schema.DIMDB could be widely used in applications that require mass data processing.Future research will need to be conducted into optimizing data distribution policies and table dividing schemas,and improving overall stability of the system.

Acknowledgment:Thank you to Tang Jue for his support and patient guidance,and to Wang Zhiping,Zhou Yang,Ye Xiaoweiand Lin Xiangdong for their contributions to this work.

ZTE Communications2010年4期

ZTE Communications的其它文章: Cloud Computing(4); ZTE Showcases World’s First VoLTE Call Based on IMSwith Existing Mobile Networksat Mobile Asia Congress2010; An Approach for Telecom Operators to Achieve Converged Telecom and Internet Services; ZTE Wins"Leone D’Oro per la Comunicazione"(Golden Lion for Communications)Award; ZTE Deploysthe World’s First Large-Scale Commercial GPON-Based Mobile Backhaul Network for TELKOM Indonesia; Cloud Storage Technology and Its Applications