Umstor Hadapter: Big Data and Object Storage Liu Dish

introduction

However, all the people born before the millennium have a martial arts complex, which is a martial arts world from Jin Yong, Gu Long, a martial arts world that is piled up with Hong Kong and Taiwan martial arts drama. Although the movie can be developed to make the audience see all kinds of fantasy effects, but the aftertaste, it seems that it is not as good as Jin Yong’s next to the rumor to the rivers and lakes.

After smilling, the heart of the heart is smile, and the alone is full of life, the martial arts is fascinating that a small person is not only divided into Zhengxin. Everyone has its own independence, through unremitting efforts, ultimately in the rivers and lakes At this big stage, the exhibitions, Jiangshan talents, all the leaders, the swords and swords, let people call more than addiction.

Computer technology is not a river. Specifically, such as Windows and Linux system levels; to abstract, there is a private cloud and IOE concept. Although the technology is not as direct as the commander of the chivalrous, but the dark tide behind it can still let people smell the breath of sparkling.

Today we have to discuss, of course, is not a river, but we want to pull “data lake”.

Two major factions under the lake

This concept should be earliered in 2011 in 2011 by CTO and writer Dan Woods in the Cito Research website. Simply, the data lake is an information system and complies with the following two characteristics:

A parallel system that can store big data

Data calculations can be performed without additional mobile data

In my understanding, the current data lake form is generally divided into the following:

Calculate store a family

Computing resources and storage resources are integrated together to deal with different business needs with a cluster. It can be imagined. If the company’s volume increases, different traffic lines have different calculation requirements for data lakes. There will be a battle for computing resources before the business; at the same time, the calculation and storage are correspondingly expanded accordingly. Not so convenient.

Calculate storage of a pro PRO

In order to cope with resources in the above scheme, the general solution is to allocate a data lake for each business line, and the isolation of the cluster allows each business line to have its own computing resources, which can guarantee good business isolation. But the following problems are also obvious: Data is island. Imagine several traffic lines may require the same data set to complete their respective analysis, but because the storage cluster is also separated by one, it is necessary to copy this data set to each cluster. In this way, the redundancy of the data is too large, and the storage overhead is too large. At the same time, the computing and storage expansion problems also still exist.

Calculate storage

As the saying goes, the distance is beautiful. In this mode, the calculation and storage are separated. Each business line can have its own computational cluster to meet its business needs. Both in the background point to the same shared storage pool, thereby solving data redundancy problems in the second solution. And due to the calculation, storage separation, in the later expansion, each can be expanded separately. This separation also conforms to the features of elastic calculation, so that the on-demand assignment is possible.

We will pay the programs one and the program 2 can be “calculated storage fusion”, the most representative should be Hadoop HDFS, this big data default storage background has high-fault, easy to expand, etc., very suitable for deployment On the cheap equipment; the plan three can be taken separately, and is the “calculation storage separation” faction, the most representative is Amazon EMR. The EMR has a unique cloud computing power with AWS and supplemented by S3 object storage support, so that big data analysis is very simple, cheap.

In private cloud scenes, we generally use virtualization techniques to create a computational cluster to support the calculation requirements of the upper big data application. Storage This is generally used Ceph object storage service to provide the shared storage background of the data lake, and then provide the connection between the two by S3A, allowing Hadoop applications to seamless access to Ceph object storage services.

In summary, we can see that under the concept of “Data Lake”, in fact, it has also been divided into two factions: “Computing Storage Fusion”, “Computing Storage Separation”. Below, let us talk about the advantages and disadvantages of these two factions.

Blueme

In this section, we will put “computing storage fusion” and “calculation storage separation” to the table to discuss their respective advantages and disadvantages.

Calculate storage fusion – HDFS

When HDFS clients are written to HDFS, they are generally divided into the following brief steps:

HDFS client sends a request for creation files to NameNode

After the NameNode traverses, verify the file as a new file, then respond to the client’s entry

The HDFS client makes a segment according to the default block size and the size of the file to upload. For example, Default Block size is 128M, and the upload file is 300m, then the file will be split into 3 blocks.

The client requests upload block, NameNode returns the DataNode that the block needs to be uploaded by analyzing the cluster. Since the default HDFS redundant policy is three copies, then 3 DataNode addresses are returned. The client uploads block data to the corresponding DataNode by establishing Pipeline.

When a block is uploaded to 3 DataNode, the client is ready to send the second block, which is reciprocated until the file transfer is completed.

The HDFS read data step is not described here. For the steps of HDFS writing data, I think important is important to have the following:

Create a file, you need to access NameNode when you upload block

NameNode stores the metadata corresponding to the file, block information

HDFS client is directly interacting with DataNode when uploading and reading

As a representative of “calculating storage fusion”, the center thinking is achieved by the concept of D Ata Locality, that is, Hadoop is trying to make the computing task to be closer to the corresponding data node when running the Mapper task. This reduces the transmission between the data in the network to achieve great reading performance. It is because of the characteristics of Data Locality, then it is necessary to make Block big enough (default 128M), if it is too small, then the Data Locality will be greatly reduced.

But big block also brings two drawbacks:

Data balance is not good

3 DataNode’s storage resources are only called when uploading a single block, and there is no use of the storage upper limit of the entire cluster.

Calculate storage separation – S3A

We have already introduced in the foregoing, in private cloud deployment, the computational storage separation framework of the data lake is generally stored by Ceph’s object storage. The CEPH object storage service is provided by RGW, and RGW provides an S3 interface that allows big data applications to access Ceph object storage via S3A. Due to storage and calculation separation, the file’s block information is no longer stored on NameNode, and NameNode is no longer needed in S3A, and its performance bottleneck does not exist.

Ceph’s object storage service provides great convenience for data management. For example, the CloudSync module allows the data in the CEPH object storage to other public clouds; the LCM characteristics also make data hot and heat analysis, migration becomes possible. In addition, RGW supports the test code to do data redundancy, and is already a mature scheme. Although HDFS has recently supported the test code, its maturity has some need to test, and general HDFS customers will rarely use the test code, more or multiple redundancy.

We simply analyze the step of S3A upload data by this picture: HDFS client When uploading data, it is necessary to package the request to http by calling S3A and then send it to the RGW, then turn it to the RADOS request to send to Ceph The cluster is thus achieving the purpose of data upload.

Since all data need to pass RGW, then subsequently submit the request to OSD, RGW is obviously easy to become a performance bottleneck. Of course, we can put the load all the load by deploying multiple RGWs, but on the request IO path, the request cannot be sent directly from the client to OSD, which is always rGW this jump.

In addition, due to the innate characteristics of object storage, the cost of List Objects and Rename is relatively large, relatively slower than HDFS. And in the community version, RGW cannot support additional upload, and add uploading in some big data scenarios.

As a result, we have the advantages and disadvantages of HDFS and S3A:

Advantages of adism HDFS1.DATA LOCALITY features make data read efficiency

2. The client is written, and the DataNode interacts directly when reading data.

1.NameNode stores size metadata, Block information, may become a performance bottleneck

2. Calculating storage is not separated, the post-extension is not good, no flexibility

3. Due to the large BLOCK, the balance of the data is not good, the write bandwidth is not large enough.

S3a

1. Stored in the calculation separation, convenient for later expansion

2.RGW can manage data more conveniently

3. Mature test code solutions make storage utilization

1. All requests need to send RGW and then send it to OSD

2. Community version does not support additional upload

3. List Object and Rename have a large price, slower

Obviously, S3A eliminates the problem that must be extended together, and there is a greater advantage in storage management, but all requests must pass the RGW, then pay the OSD, not like HDFS, can directly let HDFS customers Direct data directly with DataNode. Obviously, we can see that “computing storage fusion” is particularly unique, and there is a unique advantage of “calculating storage separation”. So, is there possible to combine both the advantages? That is to say, the excellent characteristics of the object storage are retained, and can it need RGW to complete access to CEPH objects?

Liu dark flower

Before talking to UmStor Hadapter, we still need to talk about NFS-Ganesha software, because we are getting inspiration by it. NFS-GaneSHA is an advantage of more flexible memory allocation, stronger portability, more convenient access control management, compared to NFSD, NFS-GaneSha has a more flexible memory allocation, more portability, and more convenient access control management.

NFS-GaneSHA supports many backeepers storage systems, including Ceph object storage services.

The figure above is the use of NFS-GaneSHA to share the use of BUCKET1 stored in a CepH object, you can see that NFS-GaneSHA uses librGW to implement access to Ceph objects. LibrGW is a library provided by Ceph, which is the main purpose to allow the client to directly access the Ceph object storage service through the function call. LibRGW can translate the client directly into the librados request, and then communicate with OSD through Socket, that is, we no longer need to send HTTP requests to RGW, then let RGW communicate with OSD to complete an access.

From the above figure, it is known that the app overlgw is more than app over RGW, and the request is less hop on the IO call chain, so it is theoretically, using librGW can get better read and write performance.

Is this not what we seek? If “calculating storage fusion” and “calculating storage separation” are uncomfortable, then LibrGW is the key to open this lock.

Umstor Hadapter

Based on the librGW kernel, we built a new Hadoop storage plugin – Hadapter. Libuds is the core function library of the entire Hadapter, which encapsulates librgw. When the Hadoop client sends a request with a UDS: / / prefix, the Hadoop cluster will send the request to Hadapter, then call the librGW function by libdu, let librgw call the librados function library to request OSD, resulting in one The completion of the request.

Hadapter itself is just a JAR package, as long as this JAR package is placed directly to the corresponding big data node, it is also very convenient to deploy it. At the same time, we also made some secondary development for LibrGW, such as letting librgW can support additional upload, compensate for the S3A’s short-transmitted short board.

We make a lot of performance comparison tests for HDFS, S3a, and Hadapter, although different test sets have their unique IO characteristics, but we have obtained similar results in most tests: HDFS> Hadapter> S3a. We are here with a relatively typical MapReduce test: Word Count 10GB Dataset is coming to see the three performance.

In order to control the variable, all nodes use the same configuration, and the redundancy strategy of Ceph is also consistent with HDFS, and uses three copies. The version of Ceph is 12.2.3, while Hadoop uses version 2.7.3. All computing nodes deployed Hadapter. Under this test, the result we finally obtained is:

HDFS

S3a

Hadapter

Time Cost

3min 2.410s

6min 10.698s

3min 35.843s

It can be seen that HDFS has achieved the best results with its Data Locality characteristics; while Hadapter is slower than HDFS, but not too bad, only behind the 35S; while S3A is different from this side A magnitude, eventually time consuming twice the HDFS. We have said before, theoretically LibrGW, better read and write performance than RGW, and it has been confirmed in this test. Customer case

Hadapter welcomed a heavyweight guest last year. The customer is an operator’s professional video company. We built a storage background solution combined with big data, machine learning, streaming services, and resource pools. The cluster size reaches around 35PB.

Hadapter offers background support for HBase, Hive, Spark, Flume, Yarn, etc. under this big data platform. It is currently available.

Conclusion

Ok, now we take HDFS, S3a, and Hadapter to compare:

Advantage

Disadvantage

HDFS

1.Data Locality features a high data read efficiency

2. The client is written, and the DataNode interacts directly when reading data.

1.NameNode stores size metadata, Block information, may become a performance bottleneck

2. Calculating storage is not separated, the post-extension is not good, no flexibility

3. Due to the large BLOCK, the balance of the data is not good, the write bandwidth is not large enough.

S3a

1. Stored in the calculation separation, convenient for later expansion

2.RGW can manage data more conveniently

3. Mature test code solutions make storage utilization

1. All requests need to send RGW and then send it to OSD

2. Community version does not support additional upload

3. List Object and Rename have a large price, slower

Hadapter

Advantages of RGW

1. Support additional upload

2. Allow the Hadoop client to communicate directly with Ceph OSD, bypassing RGW, thus making better reading and writing performance

1.List Object and Rename are large, slower

Although the above mentioned the shortcomings of HDFS, it has to be recognized that HDFS is still a custom sea needle of “calculating storage fusion” camp, and can even say that in most big data players, HDFS is orthodox. However, we also saw the new future of “calculating storage separation” on Hadapter. At present, the Umstor team is working hard to build Hadapter 2.0, hoping to bring better compatibility and stronger read and write performance.

This contest, perhaps kicked off.