Application status of Hadoop at home and abroad

Text | Zhai Zhouwei

This article is taken from the book Hadoop Core Technology.

Hadoop is an open source and efficient basic platform for cloud computing, which is not only widely used in the field of cloud computing, but also supports search engine services. Hadoop, as the underlying infrastructure system of search engine, is increasingly favored in the fields of massive data processing, data mining, machine learning, scientific computing and so on. This article will talk about the application status of hadoop at home and abroad.

Application Status of Hadoop in Foreign Countries

1. Yahoo Company of the United States (providing Internet information retrieval service)

Yahoo is the biggest supporter of Hadoop. As of 20 12, the total number of Hadoop nodes in Yahoo exceeded 42? 000, and there are more than 654.38+10,000 core CPUs running Hadoop. The largest single master node cluster has 4,500 nodes (each node is a dual-channel 4-core CPUboxesw, 4× 1TB disk, 16GBRAM). The total storage capacity of the cluster exceeds 350PB, and the number of jobs submitted every month exceeds 654.38+million. More than 60% Hadoop jobs in Pig are written and submitted by Pig.

Yahoo's Hadoop application mainly includes the following aspects:

Support advertising system

User behavior analysis

Support network search

Optical content reading

Member anti-abuse

Agile content

Personalized recommendation

At the same time, Pig researches and tests Hadoop system supporting very large-scale node cluster.

2. Facebook

Facebook uses Hadoop to store internal logs and multidimensional data as a data source for reporting, analysis and machine learning. At present, Hadoop cluster has more than 1400 machine nodes, * * *1/? 200 core CPUs, exceeding the original storage capacity of 15PB. Each commercial computer node is equipped with 8-core CPU and 12TB data storage, mainly using StreamingAPI and JavaAPI programming interfaces. At the same time, Facebook established an advanced data warehouse framework named Hive based on Hadoop, and Hive officially became the first-level project of Apache based on Hadoop. In addition, the fuze implementation on HDFS is also developed.

3.A9.com

A9.com used Hadoop to build a commodity search index for Amazon, mainly using StreamingAPI, C++, Perl and Python tools, while using Java and StreamingAPI to analyze and process millions of conversations every day. A9.com's indexing service for Amazon runs on Hadoop cluster, with about 100 nodes.

4. Brick clay

Adobe mainly uses Hadoop and HBase, which is the same as supporting social service computing, structured data storage and processing. Hadoop-HBase production cluster with about 30 nodes. Adobe directly and continuously stores data in HBase, runs MapReduce job processing with HBase as the data source, and then saves the running results directly to HBase or external systems. From June 5, 2008 to1October 5, 2008, Adobe has applied Hadoop and HBase to production clusters.

5.CbIR

Since April 2008, CBIR (Content-based Information Retrieval) Company of Japan has used Hadoop to build the image processing environment of image product recommendation system on AmazonEC2. Use Hadoop environment to generate the source database, which is convenient for Web applications to access quickly, and use Hadoop to analyze the similarity of user behavior.

6. Data chart

Datagraph mainly uses Hadoop to process a large number of RDF data sets in batches, especially using Hadoop to index RDF data. Datagraph also uses Hadoop to perform long-running offline SPARQL queries for customers. Datagraph uses AmazonS3 and Cassandra to store the input and output files of RDF data, and develops a Ruby framework for processing RDF data based on MapReduce-RDF Grid.

Datagraph mainly uses Ruby, RDF.rb and RDFgrid framework developed by itself to process RDF data, and mainly uses HadoopStreaming interface.

7.ebay

A single cluster is a cluster with more than 532 nodes, a single-node 8-core CPU, and the capacity exceeds 5.3PB storage. Java interface, Pig and Hive of MapReduce are widely used to process large-scale data, and HBase is also used for search optimization and research.

8. International Business Machines Corporation

IBM Yun Lan also uses Hadoop to build a cloud infrastructure. The technologies used by IBM Yun Lan include Linux operating system image virtualized by Xen and PowerVM and Hadoop parallel workload scheduling, and have released their own Hadoop distribution and big data solutions.

9.Last.Fm

Finally. Fm is mainly used for chart calculation, patent application, log analysis, A/B testing, data set merging, etc. Hadoop is also used for large-scale audio feature analysis of more than one million tracks.

There are more than 100 machines in the node, and the cluster node is configured with dual quad-core Xeon l5520 @ 2.27ghzl5630 @ 2.13ghz, 24GB of memory and 8TB(4×2TB) storage.

10. Commercial interpersonal network

LinkedIn has Hadoop clusters with various hardware configurations. The main cluster configurations are as follows:

800-node cluster, HP SL 170X and 2×4 core based on Westmere, 24GB memory, 6× 2TBSATA.

1900 node cluster, ultramicro HX8DTT based on Westmere, and 2×6 cores, 24GB memory and 6× 2TBSATA.

1400 node cluster, based on SandyBridge Ultramicro, 2×6 cores, 32GB memory, 6× 2TBSATA.

The software used is as follows:

Operating system uses RHEL6.3.

JDK uses SUNJDK 1.6.0_32.

ApacheHadoop Hadoop0.20.2 and ApacheHadoop patch 1.0.4.

Azkaban and Azkaban are used for job scheduling.

Hive, Avro, Kafka, etc.

1 1. Mobile analytical TV

Hadoop is mainly used in the field of parallelization algorithms, and the MapReduce application algorithms involved are as follows.

Information retrieval and analysis.

Machine-generated content-documents, text, audio, video.

Natural language processing.

The project portfolio includes:

Mobile social network.

Web crawler.

Text to speech conversion.

Automatic generation of audio and video.

12.Openstat

Hadoop is mainly used to customize network log analysis and generate reports. In its production environment, there are more than 50 node clusters (dual quad-core Xeon processors, 16GB RAM, 4 ~ 6 hard disks), and two relatively small clusters are used for personalized analysis, handling about 5 million events every day and trading data of $65.438+05 billion per month. The cluster generates about 25GB of reports every day.

The main technologies used are CDH, cascade and Janino.

13. Quantum Broadcasting

3000 CPU cores, 3500TB storage, processing more than 1PB data every day, and using Hadoop scheduler with completely customized data path and sorter, which has made outstanding contributions to KFS file system.

14. Lapliff

Clusters with more than 80 nodes (2 dual-core CPUs per node, 2TB×8 storage,16 GB RAM); Hadoop and Hive are mainly used to process personal data on the Web, and cascade is introduced to simplify the data flow in each processing stage.

15. World jargon

There are more than 44 servers on the hardware (each server has 2 dual-core CPUs, 2TB storage and 8GB memory), and each server runs Xen. Start a virtual machine instance to run Hadoop/HBase, and then start a virtual machine instance to run Web or application server, that is, there are 88 available virtual machines. Run two independent Hadoop/HBase clusters, each with 22 nodes. Hadoop is mainly used to run HBase and MapReduce jobs, scan HBase data tables and perform specific tasks. As an extensible fast storage backend, HBase is used to store millions of documents. At present,120,000 documents are stored, and the near-term goal is to store 450 million documents.

16. TerrierTeam of Glasgow University

An experimental cluster with more than 30 nodes (each node is equipped with XeonQuadCore2.4GHz, 4GB memory and 1TB storage). Use Hadoop to promote information retrieval research and experiments, especially for TREC and TerrierIR platforms. Terrier's open source distribution contains a large-scale distributed index based on HadoopMapReduce.

17. Dutch Computing Center, University of Nebraska

Run a medium-sized Hadoop cluster (* * * 1.6PB storage), store and provide physical data, and support the calculation of the compact muon spiral magnetic spectrometer (CMS) experiment. This requires the support of the file system, which can download data at several Gbps and process data at a higher speed.

18. Visual measures

Hadoop, as a component of extensible data pipeline, is finally used in products such as VisibleSuite. Use Hadoop to summarize, store and analyze data streams related to online video viewers' viewing behavior. At present, the grid includes more than 128 CPU cores and more than 100TB of storage, and it is planned to expand greatly.

Application Status of Hadoop in China

Hadoop is mainly used by Internet companies in China. The following mainly introduces companies that use Hadoop or study Hadoop on a large scale.

1. Baidu

Baidu began to pay attention to Hadoop in 2006 and began to investigate and use it. In 20 12, the total scale of its cluster reached nearly ten, with more than 2,800 machine nodes in a single cluster and tens of thousands of Hadoop machines. The total storage capacity exceeds 100PB, and more than 74PB has been used. Thousands of jobs are submitted every day, and the amount of data input every day exceeds 7500TB.

Baidu's Hadoop cluster provides unified computing and storage services for the entire company's data team, large search team, community product team, advertising team and LBS group. The main applications include:

Data mining and analysis.

Log analysis platform.

Data warehouse system.

Recommendation engine system.

User behavior analysis system.

At the same time, Baidu also developed its own log analysis platform, data warehouse system and unified C++ programming interface based on Hadoop, and deeply reformed Hadoop to develop HadoopC++ extended HCE system.

2. Alibaba

As of 20 12, Alibaba's Hadoop cluster has about 3200 servers, about 30? 000 physical CPU cores, total memory 100TB, total storage capacity over 60PB, and daily jobs over 150? 000, hivequery queries more than 6,000 times a day, the average daily scanned data is about 7.5PB, the average daily scanned files are about 400 million, the storage utilization rate is about 80%, the average CPU utilization rate is 65%, and the peak value can reach 80%. Alibaba's Hadoop cluster has 150 user groups and 4,500 cluster users, providing basic computing and storage services for Taobao, Tmall, Ceramics, Juhua, CBU and Alipay. Its main applications include:

Data platform system.

Search support.

Advertising system.

Data cube.

Quantum statistics.

Amoy data

Recommendation engine system.

Search leaderboards.

In order to facilitate the development, the WebIDE inheritance development environment is also developed, and the related systems used are Hive, Pig, Mahout, Hbase and so on.

3. Tencent

Tencent is also one of the earliest Internet companies using Hadoop in China. By the end of 20 12, Tencent had more than 5,000 Hadoop cluster machines, and the largest single cluster had about 2,000 nodes. It also used Hadoop-Hive to build its own data warehouse system TDW, and developed its own TDW IDE basic development environment. Tencent's Hadoop provides basic cloud computing and cloud storage services for Tencent's product lines. It supports the following products:

Tencent social advertising platform.

SOSO。

Pat the net.

Tencent Weibo.

Tencent compass.

QQ member.

Tencent games supports it.

QQ space.

friends. com

Tencent open platform.

Tenpay.

Mobile QQ.

QQ music.

4. Qihoo 360

Qihoo 360 mainly uses Hadoop-HBase as the underlying web storage architecture system of its search engine so.com. The number of web pages searched by 360 can reach hundreds of billions of records, and the data amount reaches PB level. By the end of 20 12, its HBase cluster had more than 300 nodes and the number of regions exceeded 65438+ 10,000. The platform version used is as follows.

HBase version: facebook0.89-fb.

HDFS version: facebookHadoop-20.

Qihoo 360' s work in Hadoop-HBase is mainly aimed at optimizing and reducing the start-stop time of HBase cluster and the recovery time after RS abnormal exit.

5. Huawei

Huawei is also one of the major contributors to Hadoop, ranking ahead of Google and Cisco. Huawei has conducted in-depth research on Hadoop's HA scheme and HBase domain, and introduced its own big data solution based on Hadoop to the industry.

6. China Mobile

China Mobile officially launched BigCloud 1.0 in May of 20 10, and the cluster nodes reached 1024. China Mobile's Dayun has realized distributed computing based on Hadoop's MapReduce, distributed storage using HDFS, developed Hadoop-based data warehouse system HugeTable, parallel data mining tool set BC-PDM, parallel data extraction and transformation BC-ETL, object storage system BC-ONestd, and opened its own version of BC-Hadoop.

China Mobile mainly applies Hadoop in the telecommunications field, and the planned application fields include:

KPI centralized operation.

Subsystem ETL/DM.

Settlement system.

Signal system.

Cloud computing resource pool system.

Internet of things application system.

E-mail

IDC services, etc.

7.pangu search

Pangu Search (now merged with instant search to form China Search) mainly uses Hadoop cluster as the infrastructure support system of search engine. By the beginning of 20 13, the total number of machines in the cluster exceeded 380, with a total storage of 3.66PB, mainly including the following applications.

Web page storage.

Web page analysis.

Index.

Pagerank calculation.

Log statistical analysis.

Recommendation engine, etc.

Search Now (People Search)

Instant Search (which has merged with Pangu Search into China Search) also uses Hadoop as the support system of its search engine. As of 20 13, its Hadoop cluster has a total scale of more than 500 nodes, configured as dual-channel 6-core CPU, 48G memory, 1 1×2T storage, with a total cluster capacity of over 65438+300TB and a utilization rate of 78%.

Search the web pages in sstable format stored in the search engine immediately, and store the sstable files directly on HDFS, mainly using HadoopPipes programming interface for subsequent processing, and also using Streaming interface to process data. The main applications include:

Web page storage.

Analyze it.

Index.

Recommendation engine.

end