HBase high concurrent read write performance optimization in PORA2 applications

Taobao search personalized offline real-time analysis system PORA has been upgraded to PORA2, and PORA2 is developed based on YARN-based streaming framework iStream, while using HBase more in the real-time processing system that guarantees data and messages. A typical high concurrent reads of HBase distributed applications.

The system encountered more serious performance problems at the beginning of the release, and the processing speed could not keep in real time log, and the entire Hadoop / HBase cluster pressure is large, and the other applications are affected. After investigating the problem, the problem is mainly in the use of HBase. Several typical use of HBase will be summarized below, I hope to provide reference for other similar applications.

HBase’s Periodic Flusher

From the various statistical indicators of the system, the system mainly slowly reads and writes HBASE, and the HBase log found that each RegionServer is frequent flush and compact. After the analysis, the current HBase version has a mechanism of Periodic Flusher. If the data in the memstore does not have flush without flush, HBase will automatically trigger Flush, this time interval is 1 hour. Learn this is the new feature introduced after HBase 0.94.8, the original intention is to prevent some MEMSTOREs from being lost for a long time, and the data is lost when the WAL is not enabled and the Region Server is encountered.

Because our HBase each Region Server has nearly 100 Regions, REGON has regions that triggered flush because of an hour of time interval, and most of the FLUSH files are small, and the number of flush is more. Cause Compaction, so frequent flush and compactions make the Region Server processing speed slowly. After we adjust this configuration to 10 hours, you can see the HBase Flush Queue Size and FS Pread Latency from the following figure.

Note: Personally think that this new Feature application scenario of HBase is limited, and should not be configured as a default, it is recommended to disable it directly by configuring.

Uninterrupted frequent SCANs cause large pressure to Region Server

PORA2 uses a HBASE-implemented message queue HQueue, downstream user, by reading this message queue, the first time gets the latest message for processing.

The process of reading the message is equivalent to scan. In the first PORA2 version, we did not control the frequency of the HQueue, causing some read HQueue’s Worker to launch new SCAN, even if HQueue data has been read, still Re-create Scan immediately, and this Scan will recreate after the data is not read quickly. This constitutes a unnecessary pressure of the Region Server end such that the new Scanner is newly built.

After discovering this problem, we modified this part of the program code. After reading the data, the SLEP was re-scan after a few seconds.

After modifying the configuration of HBase and adds the SCAN frequency control of the read HQueue, PORA2 is significantly improved, but the speed of processing accumulated data is still not fast enough.

Oversized and post-HBASE

From the statistical indicator, the system is still slow on the HBASE, and there are various TimeoutException from time to time to access the HBase logs. The regionServer discovered in the tracking log is found, some Region Server logs continue to report the following exception:

The machine has a lot of connections, but LOAD is very low, and once PORA2 is parked, these phenomena will disappear quickly.

From this phenomenon analysis, it is determined that PORA2 has too much concurrent connections to HBase, so that the Handler of Region Server is not enough, the Server end has not been processed, and the client end has arrived at timeout.

To this end, we greatly reduce the number of processes accessed to reduce concurrent connections to HBase, in order not to reduce processing power, use more processing threads inside the process. Since the thread is connected to HBase, the number of threads does not increase the number of HBASEs.

Through this adjustment, the pressure of HBase Region Server is largely alleviated.

Avoid HBase access hotspots

After making more optimized improvements, there are still several Worker more slow. Tracking the slow Worker logs found that reading HBase frequently timeout, find the timeout region server, observed this Server’s read and write request from the HMASTER UI It is obvious that several times the other Server. I started suspected that the data was tilted, and there was a hot spot to fall on this machine. In the HBase UI, check the HBase table for PORA2 by one by one, and the number of requests in one of its first Region is more than one or two orders higher than other region. According to our design expectations, this table’s Rowkey is added with a Hash prefix. It should be in theory that there is a hotspot, and finally detects the code to generate the code of RowKey, the code is used to generate a prefix. % RegionNum, the result has a lot of Key’s havehcode returns to negative numbers, making many prefixes negative, all falling on the first Region.

For HBase, once a region has a hotspot, it will cause the Region Server in the Region to slow down, which in turn makes the Region access on other tables on this Server, which affects the performance of the entire HBASE.

Major compaction of Bulk Load data

A Bulk Load data was found in the survey found that there is no problem caused by Major Compaction.

We have a table’s data is Bulk Load to HBase every day, and the data of this table is read-only. After a few days, after a few days, because there is no data writing, no compaction is triggered, and there is a new HFile every day to come in, causing more HFILE number under each region, and finally dragging down this table. Read performance.

After discovering the problem, we executed a Major Compact on this table after the end of the Bulk Load, and effectively solved the problem.

Summarize

The application of high concurrent reading and writing HBase needs to ensure reasonable use of HBase, unreasonable use may result in problems with the performance of a Region Server or even the entire HBase cluster, and HBase performance issues have made all application performance drops At this time, if the application selection continues to increase the concurrent access to HBase, it is even possible to fall into a vicious circle that continues to deteriorate.