Image Service Architecture Learning Zimg

ZIMG is an open source program for the design and development of a Chinese person for image processing servers to solve the following three problems in the picture service:

Big traffic: For some small and medium-sized websites, traffic issues are cost problems. The picture is connected to the text, and the flow rate has increased a quantity, and every byte, which is a silver flower. Therefore, all Internet applications involving pictures should be coordinated, reduce traffic saving expenses.

High concurrency: The problem of high mergeneration is almost unable to appear in the user, but once the user climbs, or encounters a hot event, such as the website has uploaded an explosive news picture, it will pour in a short period of time. Browse requests, if the architecture is not well design, no emergency, it is likely to cause a lot of waiting, more page refresh and more requests. In general, it is to make the performance of the picture service is good enough.

Mass storage: In the article introduced to the Facebook image stored, the Facebook user uploaded a picture 1.5 billion, with a total capacity exceeded 1.5pb, such an order of magnitude is unbearable. Although it is difficult to make an application that can be used with Facebook, the good expansion plan is still there from the perspective of architecture design. It is necessary to design the most suitable vast image data storage solutions and easy to operate in advance to address the growing business needs.

The above three problems are actually mutually restrictions and clamps. For example, if you want to reduce traffic, you need a lot of calculations, causing a request processing time to extend, the processing capacity within the system unit is declining; for example, in order to store more pictures, it is inevitable. To consume resources on a search, it will also reduce processing power. Therefore, although the picture service seems to be simple, it is not a small thing. The following will be introduced below architecture design, code logic, and performance testing.

general idea

Want to show the best performance in the show, first need to separate the picture service part from the overall business. There are many benefits to using separate domain names and build a separate image server, such as:

CDN diversion. If you pay attention, the image address of the hot website has special domain names, such as WW1.Sinaimg.cn, everyone is fmn.xnpic.com, etc., the domain name can be done at the level of CDN parsing To a very obvious optimization effect.

The browser concurrent connection limit. In general, the browser will establish a lot of connections when the HTML resource is loaded, and the resources are downloaded in parallel. Different browsers are different to the same host’s concurrent connections, such as IE8 is 10, Firefox is 30. If the picture server is independent, it will not occupy the number of places for the number of primary links to a certain extent, to a certain extent.

The browser cache. Now the browser has a cache function, but due to the presence of cookies, most browsers do not cache a request with cookie, resulting in a large number of pictures requests can’t hit, can only download it. The image server of the independent domain name can greatly alleviate this issue.

After the image server is independent, it will face two options. The mainstream scheme is that the front end uses nginx, the middle is the module of PHP or its own development, the backend is physical storage; more particularly, such as Facebook, they handle the request And stored in one, called HayStack, this is the advantage that HayStack will only process requests related to the image, stripping the complicated functions of ordinary HTTP servers, more lightweight, and reduced deployment and operational and maintenance difficulties.

Zimg uses a similar to Facebook, returning to themselves, and most things are handled by themselves unless otherwise necessary, the third party module is introduced minimally.

Architecture design

For the ultimate performance performance, Zimg is all developed in C language, and is divided into three levels, the front-end HTTP processing layer, the intermediate image processing layer, and the rear end storage layer. The picture below shows the zimg architecture design.

The HTTP treatment layer introduces libevent-based LibevHTP libraries to deal with basic HTTP requests.

The image processing layer uses an ImageMagick library.

The storage layer adopts the Memcached cache plus directly read and write the hard disk, and the later period may introduce TFS4, etc.

In order to avoid the performance bottleneck brought by the database, ZIMG does not introduce a structured database, and the lookup of the picture is resolved by Hash. In fact, the image server is designed, is a game process between I / O and CPU operations. The best strategy is of course continued to remove: CPU sensitive HTTP and picture processing layers deployed on machines, memory, memory Sensitive Cache layers deployed on a larger machine, I / O sensitive physical storage layers are placed on machines equipped with SSDs, but not everyone can afford such a luxurious configuration. ZIMG is compromised in cost and business needs, and currently only needs to be deployed on one server. Due to different server hardware, I / O and CPU computational speeds are very different, it is difficult to die. Zimg’s idea is to minimize I / O, put pressure on the CPU, it turns out that such ideas are basically true, the effect is more obvious on the machine that is very poor in the performance of the hard disk; even if the SSD is fully popular, the CPU operates The ability will also improve accordingly, and generally, ZIMG’s scheme will not be too unbalanced. Code level

Although Zimg has no division on the binary entity, the reason has been mentioned, at this stage, in small and medium-sized services, stand-alone deployment, but the code is separated.

Main.c is the entry of the program, the main function is to process the startup parameters, some parameter functions are as follows:

-p [port] listening port number, default 4869

-t [thread_num] thread number, default 4, please adjust the number of CPU cores for specific servers

-k [max_keepalive_num] Highest hold connection number, default 1, does not enable long connection, 0 is enabled

-l Enable log, it will bring great performance loss, whether to open itself

-M [memcached_ip] Enables cache connection IP

-m [memcached_port] Enable cache connection port

-b [backlog_num] Each thread’s maximum connection number, default 1024, discretion

zhttpd.c is part of the HTTP request, divided into two parts of GET and POST, and the GET request looks for the picture according to the request URL parameter and transfer to the image processing layer processing, and finally return the result to the user; POST receives upload, then Put the picture into the calculated path. In order to achieve the overall design vision of zimg, ZHTTPD took on a large part of the work, and there were some key points. Here, the following:

The only Key value of the picture in zimg is the MD5 of the picture, which can hide the path, reducing the front end (referring to the part of ZIMG, which may be your application server) and ZIMG itself storage pressure, is to avoid the introduction of structured The key to the storage portion, so all GET requests are based on MD5 splicing. If your site needs to show a picture, this image is the size of 1000 * 1000, but the place you want to show is only 300 * 300, what do you do? Generally, it relies on CSS to control, but this will cause a lot of traffic. To this end, ZIMG provides a picture cutting function. What you need to do is to add W300 & H300 (Width and Height) after the picture URL.

In the image upload section, if our image server front end uses nginx, the upload function is implemented with PHP, and the code that needs to be written is very small, but the performance is very poor. First PHP receives the request from NGINX, the binary file is separated according to the HTTP protocol (RFC1867), stored in a temporary directory, wait for us to use $ _FILES [“UPFILE”] [TMP_NAME] in PHP code After the file, the MD5 is calculated to store it to the specified directory. Once in this process, a write file is redundant. In fact, the best case is the binary of the HTTP request (preferably in memory), directly Calculate MD5 and store it. So I read the source code of PHP, I realized the POST file resolution, so that the HTTP layer directly and the storage layer together, improve the performance of uploading pictures. In addition to the POST request, there are many things in the ZIMG code to reflect this “Reduce disk I / O, try to read and write” and “avoid memory replication”, and will eventually bring Excellent performance.

Zimg.c is a part of calling the ImageMagick handling a picture, at which the ZIMG service is served in the Single-level picture server that stores the TB level, so the storage path uses a 2-level subdirectory. Since Linux is best not to more than 2,000 subdirectory in the same directory, coupled with the value of MD5 itself is 32-bit hexadecimal number, Zimg takes a very good way: according to the top six of MD5 Hash, 1-3 bits are converted to a hexadecimal number of 4, and the scope is exactly in 1024, and the number is used as the first subdirectory; 4-6-bit is also treated as the second subdirectory; The secondary subdirectory is a folder named by MD5. The original map of the picture in each MD5 folder and other version of the store is stored, assume that a picture averages the space 200kb, and the total capacity supported by ZIMG server can be Calculated: 1024 * 1024 * 1024 * 200KB 200TB

In addition to path planning, Zimg is another mass function is to compress pictures. From the user’s perspective, the picture that zimg returns simply seems to be almost the same as the original image, and if it does require the original map, you can also get all the parameters. Based on such conditions, Zimg.c compresses all converted pictures. After compression, the naked eye is almost unable to distinguish, but the volume will be reduced by 67.05%. The specific processing method is:

Picture cropping uses a lanczosfilter filter;

Compressed by 75% compression rate;

Remove the EXIF ??information of the image;

Convert to JPEG format.

After such treatment, you can reduce traffic to achieve design goals.

Zcache.c is part of the introduction of Memcached cache, which is important, especially after the image is increased. Coached in ZIMG as a very important feature, almost all of all the lookups in Zimg.c will first check if the cache exists. For example: I want A (representing a MD5) picture Cropped to 100 * 100 and then gray the version, then the process is going to find the cache of A & W100 & H100 & G1 exists, do not exist, to find this file (this request corresponds to The file name is A / 100 * 100PG), there is no color map cache to find this resolution, if there is still no existence, it is going to find the color chart file (corresponding file name is A / 100 * 100P) If still not, then go to query the original scales, the original map is still missed, only open the original file, and then start cropping, graylated, then returning to the user and deposits.

It can be seen that if a link is cached during the above process, the number of calculations for I / O or image processing will be reduced accordingly. It is great to know that the memory and hard disk read and write speed is huge, then such a design will be very important for hotspots.

In addition to the above core code, some supportive code, such as the log section, the MD5 computing section, the UTIL section, etc.

Reference link:

Http://zimg.buaa.us/arch_design.html

http://www.laronce.com/2009/09/26/1103.html