The first step is to figure out what big data is. It is not a simple massive data or massive data, but a data gold mine with 4V characteristics. He will bring opportunities and challenges to our enterprise.
The second step, according to the characteristics of big data, analyzes what capabilities the enterprise big data platform should have to meet the challenges of big data.
The third part, based on the demand of big data platform, puts forward the technical solution of enterprise big data, and introduces how this solution solves the problem of big data.
Finally, let me look at the existing problems of big data applications and how they will develop in the future.
What is big data?
From the data point of view, big data is not simply big and many. Big data is ready to come out, but it has the characteristics of 4 V. Simply put, it is large in size, many in styles, fast in speed and low in value.
Big data volume: The latest research report shows that by 2020, the global data usage is expected to increase by 44 times, reaching 35.2ZB When we talk about big data, the average enterprise data volume must reach PB level to be called big data.
There are many styles: Big data includes structured data and unstructured data in addition to a large amount. E-mail, Word, pictures, audio information, video information and other types of data can no longer be solved by previous relational databases.
Speed: What we are talking about here is the speed of data collection. With the development of e-commerce, mobile office, wearable devices, Internet of Things and smart communities, the speed of data generation has evolved to the second level. Enterprises require real-time data acquisition and real-time decision-making.
Low value: refers to the value density, the value of the whole data is getting higher and higher, but because of the growth of data volume, the value density of data is also reduced accordingly, and worthless data will occupy most of it, so enterprises need to find value from massive business.
From the developer's point of view, big data is different from the previous database technology and data warehouse technology. It represents a series of new technologies led by Hadoop and Spark.
The remarkable characteristics of this technology are: distributed and memory computing.
Distributed: Simply put, distributed is to split complex and time-consuming tasks into multiple small tasks and process them in parallel. The tasks here include data collection, data storage and data processing.
Memory calculation: In essence, the CPU reads data directly from the memory instead of the hard disk, and calculates and analyzes the data. Memory computing is very suitable for processing massive data and data that need real-time results. For example, almost all the financial, marketing, market and other data of an enterprise in the past ten years can be stored in memory at one time, and the data can be analyzed on this basis.
Data mining: In fact, the core of big data should also include data mining technology, which is closely related to statistics. It can be roughly divided into four categories: classification, clustering, prediction and correlation. It can use mathematical methods to extract potential laws or knowledge from a large number of incomplete and fuzzy data.
Big data platform requirements
The ability of big data is divided into five aspects: data collection, data storage, data calculation or processing, data mining and data presentation.
Data collection: the ability to collect massive data and real-time data is needed, which is the first step of data utilization.
Data storage: corresponding to the characteristics of big data, it needs large capacity, high fault tolerance and high efficiency storage capacity, which is the basis of data utilization.
Data calculation: Strong, cheap and fast data processing and cargo calculation capabilities are needed. Strong data corresponds to a large number of types of big data, low value density of cheap data corresponds to big data, and fast speed corresponds to big data, which is the key to the development of big data.
Data mining: It is the core of data utilization to be able to analyze and mine data value from all angles and directions, and make good use of data mining to turn data into value.
Data presentation: Multi-channel, intuitive and rich data presentation forms are the external image of data, the highlight of data application and the window that can be recognized by users.
The above are the problems that need to be solved, the capabilities that must be possessed and the data requirements of the big data platform.
Technical solution
Enterprise big data solutions are divided into data acquisition layer, data storage layer, data calculation layer, data mining layer and data presentation layer from the data processing flow, and each layer solves the key problems needed by big data. The yellow part is the traditional data processing technology.
Data acquisition layer:
Data acquisition technology is divided into real-time acquisition and timing acquisition. Real-time acquisition adopts tools such as Oracle GoldenGate to collect data in real time increments to ensure the timeliness of data. Timing acquisition adopts SAP data service combined with other tools to extract data regularly, which is mainly used for a large number of non-real-time data. Add kettle, sqoop and other distributed ETL tools, enrich diversified data extraction services, add kafka services, integrate real-time data, and process a large number of real-time data.
Data storage layer:
On the basis of traditional oracle, the data storage area adds modules such as distributed file system, distributed column database, in-memory file system, in-memory database and full-text search. Among them, the distributed file system ceph is used to store unstructured data because of its balanced data distribution and high parallelism. Distributed file system Hdfs is used to store other structured data because of its good scalability and compatibility. The column storage database hbase is mainly used to store massive data with specific requirements for operation and query services.
Data computing layer:
The computing layer uses standard SQL query, full-text search, interactive analysis Spark, real-time data processing flow, offline batch processing, Graph X and other technologies to calculate and process structured data, unstructured data, real-time data and massive data.
Advantages of spark memory computing engine, core computing mode;
Lightweight fast processing.
Simple and easy to use, Spark supports multiple languages.
Support complex queries.
Real-time stream processing.
Can be integrated with Hadoop and existing Hadoop data.
Can merge with Hive?
Data mining layer: using Spark_Mllib, R, Mhout and other analysis tools, the model and algorithm library are created according to the model analysis engine. The model is trained by the model algorithm library to generate model examples. Finally, real-time and off-line decisions are made according to the model examples.
Data presentation layer: provides portal presentation, data charts, e-mail, office software and other data analysis methods, which can support large screen, computer desktop, mobile terminal and so on.
Concluding remarks
With the continuous optimization of high-performance computers and mass data storage management processes, the problems that technology can solve will not become problems in the end. There are three links that will really restrict or become the bottleneck of the development and application of big data:
First, the legitimacy of data collection and extraction, the trade-off between data privacy protection and data privacy application.
When any enterprise or institution extracts private data from the crowd, users have the right to know. When using their private data for business activities, users need to get their consent. However, at present, a series of management issues, such as how to protect users' privacy, how to formulate business rules, how to punish users' privacy violators, and how to formulate legal norms, are lagging behind the development speed of big data in China and even around the world. In the future, many big data services will linger in the gray area in the early stage of development. When commercial operations begin to take shape and have an impact on a large number of consumers and companies, relevant laws, regulations and market norms will be forced to speed up the formulation. It can be expected that although the application of big data technology can be infinitely broad, due to the limitations of data collection, the data that can be used for commercial applications and serve people is far less than the data that can be collected and processed by big data in theory. Limited data source collection will * * limit the commercial application of big data.
Second, the collaboration of big data requires enterprises in all links of the industrial chain to achieve a balance between competition and cooperation.
Big data puts forward more cooperation requirements for enterprises based on its ecosystem. If there is no macro grasp of the whole industrial chain, a single enterprise can't understand the relationship between the data in each link of the industrial chain based on its own independent data, and its judgment and influence on consumers is very limited. In some industries with obvious information asymmetry, such as banking and insurance, the demand for data sharing between enterprises is more urgent. For example, banks and insurance industries usually need to establish an industry-specific database to let their members know the credit records of individual users, eliminate the information asymmetry between guarantors and consumers, and make transactions go smoothly. However, in many cases, the relationship between competition and cooperation among these enterprises that need to enjoy information exists at the same time. Before enjoying the data, enterprises need to weigh the pros and cons to avoid losing their competitive advantage while enjoying the data. In addition, when many businesses cooperate, it is easy to form a seller alliance, which will lead to the damage of consumers' interests and affect the fairness of competition. The most imaginative development direction of big data is to integrate data from different industries, provide all-round three-dimensional data drawing, and try to understand and reshape user needs from a system perspective. But the sharing of cross-industry data needs to balance the interests of too many enterprises. If there is no neutral third-party organization to coordinate the relationship between all participating enterprises and formulate rules for data, the use of big data will be restricted. The lack of authoritative third-party neutral institutions will restrict the full potential of big data.
Third, the interpretation and application of big data conclusions.
Big data can reveal the possible correlation between variables from the level of data analysis, but how is the correlation at the data level reflected in industry practice? How to draw the conclusion that big data is applied in an executable scheme? These problems require executives not only to be able to interpret big data, but also to understand the relationship between various elements of industry development. This link is based on the development of big data technology, but it also involves various factors such as management and implementation. In this link, human factors become the key to success. From a technical point of view, executives need to understand big data technology and be able to interpret the conclusions of big data analysis; From the perspective of the industry, executives should have a good understanding of the relationship between the processes of various production links in the industry and the possible correlation between various factors, and correspond the conclusions drawn from big data with the specific implementation links of the industry; From the management point of view, the executor needs to work out an executable solution to the problem, and ensure that this solution does not conflict with the management process, and does not create new problems while solving the problem. These requirements not only require executives to be proficient in technology, but also should be an excellent manager with systematic thinking, who can look at the relationship between big data and industry from the perspective of complex systems. The scarcity of such talents will restrict the development of big data.