Predict the future trend of big data development from three directions

The development of technology allows the world to continuously generate data every day. As the concept of big data is proposed, This technology has gradually developed into an industry and is constantly being favored. So what is the future development of the big data industry? Three directions predict the future development trend of big data technology:

(1) Social network and Internet of Things technology have expanded data collection technology channels

Through industry informatization construction, medical, transportation, Finance and other fields have accumulated a lot of internal data, forming a "stock" of big data resources; and the development of the mobile Internet and the Internet of Things has greatly enriched the collection channels of big data, from external social networks, wearable devices, Internet of Vehicles, Internet of Things, etc. Data from the Internet and government public information platforms will become the main body of big data incremental data resources. Currently, the deep popularity of mobile Internet has provided abundant data sources for big data applications.

In addition, the rapidly developing Internet of Things will also become an increasingly important provider of big data resources. Compared with the existing Internet data that is messy and has low value density, through various data collection terminals such as wearables and Internet of Vehicles, the data resources collected in a targeted manner are more valuable. For example, after several years of development, smart wearable devices such as smart bracelets, wristbands, and watches are becoming mature. Smart keychains, bicycles, chopsticks and other devices are emerging one after another. Intel, Google, Facebook, etc. abroad, domestically Baidu, JD.com, Xiaomi, etc. have plans.

Internal enterprise data is still the main source of big data, but the demand for external data is increasingly strong. Currently, 32% of enterprises obtain data through external purchases; only 18% of enterprises use government open data. How to promote the construction of big data resources, improve data quality, and promote cross-border integration and circulation is one of the key issues to promote the further development of big data applications.

Generally speaking, all industries are committed to actively expanding emerging technical channels for data collection and developing incremental resources on the basis of making good use of existing resources. Social media, the Internet of Things, etc. have greatly enriched the potential channels for data collection. In theory, data acquisition will become increasingly easier.

(2) Distributed storage and computing technology solidifies the technical foundation of big data processing

Big data storage and computing technology is the foundation of the entire big data system.

In terms of storage, the file system (GFS) proposed by Google and others around 2000, and the subsequent Hadoop distributed file system HDFS (Hadoop Distributed File System), laid the foundation for big data storage technology.

Compared with traditional systems, GFS/HDFS physically combines computing and storage nodes to avoid I/O throughput constraints that are easily caused in data-intensive computing. At the same time, this type of distribution The file system of the hybrid storage system also adopts a distributed architecture, which can achieve high concurrent access capabilities.

In terms of computing, the MapReduce distributed parallel computing technology disclosed by Google in 2004 is a representative of new distributed computing technology. A MapReduce system is composed of cheap general-purpose servers. The total processing capacity of the system can be linearly expanded (Scale Out) by adding server nodes, which has huge advantages in cost and scalability.

(3) Emerging technologies such as deep neural networks open up a new era of big data analysis technology

Big data data analysis technology is generally divided into online analytical processing (OLAP, Online Analytical Processing) and There are two major categories of data mining.

OLAP technology is generally based on a series of user assumptions, and performs interactive data set queries, correlation and other operations on multi-dimensional data sets (usually using SQL statements) to verify these assumptions, which represents the application of deductive reasoning. Ways of thinking.

Data mining technology generally actively searches for models in massive data and automatically develops patterns hidden in the data, which represents an inductive thinking method.

Traditional data mining algorithms mainly include:

(1) Clustering, also known as group analysis, is a statistical analysis method for studying (sample or indicator) classification problems. Similarities and differences in data divide a set of data into categories. The similarity between data belonging to the same category is very large, but the similarity between data in different categories is very small, and the correlation of data across categories is very low. Enterprises can group customers by using cluster analysis algorithms, group customer data from different dimensions without clarifying the behavioral characteristics of the customer groups, and then extract and analyze features of the grouped customers, thereby seizing the characteristics of the customers and recommending corresponding products. and services.

(2) Classification, similar to clustering, but with different purposes. Classification can use a model pre-generated by clustering, or you can use empirical data to find the most common points of a group of data objects, and combine the data into Divided into different categories, the purpose is to map data items to a given category through a classification model. The representative algorithm is CART (Classification and Regression Tree). Enterprises can classify business data such as users, products, services, etc., build a classification model, and then perform predictive analysis on new data to classify it into existing categories. The classification algorithm is relatively mature and the classification accuracy is relatively high. It has very good predictive capabilities for precise customer positioning, marketing and services, and helps companies make decisions.

(3) Regression reflects the characteristics of the attribute values ??of the data, and uses functions to express the relationship of data mapping to discover the relationship between attribute values ??at a glance. It can be applied to the prediction and correlation research of data sequences. Enterprises can use regression models to analyze and predict market sales and make corresponding strategic adjustments in a timely manner. In terms of risk prevention and anti-fraud, regression models can also be used for early warning.

Traditional data methods, whether traditional OLAP technology or data mining technology, are difficult to cope with the challenges of big data. The first is low execution efficiency. Traditional data mining technologies are developed based on centralized underlying software architecture and are difficult to parallelize, so they are inefficient in processing data above the terabyte level. Secondly, the accuracy of data analysis is difficult to improve as the amount of data increases, especially when dealing with unstructured data.

Among all human digital data, only a very small part (accounting for about 1% of the total data volume) of numerical data has been in-depth analysis and mining (such as regression, classification, clustering), and the large-scale Internet Enterprises perform shallow analysis (such as sorting) on ??semi-structured data such as web page indexes and social data. Unstructured data such as voice, pictures, and videos, which account for nearly 60% of the total, are difficult to effectively analyze.

Therefore, the development of big data analysis technology requires breakthroughs in two aspects. First, efficient and in-depth analysis of large volumes of structured and semi-structured data to mine tacit knowledge, such as Understand and identify semantics, emotions, intentions, etc. from text web pages composed of natural language; the second is to analyze unstructured data and convert massive, complex and multi-source voice, image and video data into machine-recognizable and clear semantics information and extract useful knowledge from it.

At present, big data analysis technology represented by emerging technologies such as deep neural networks has achieved certain development.

Neural network is an advanced artificial intelligence technology with the characteristics of self-processing, distributed storage and high fault tolerance. It is very suitable for processing non-linear and those with fuzzy, incomplete, imprecise knowledge or Data is very suitable for solving big data mining problems.

Typical neural network models are mainly divided into three categories: the first category is feed-forward neural network models used for classification prediction and pattern recognition, whose main representatives are functional networks and perceptrons; The second category is the feedback neural network model used for associative memory and optimization algorithms, represented by Hopfield's discrete model and continuous model. The third category is self-organizing mapping methods for clustering, represented by the ART model.

However, although neural networks have a variety of models and algorithms, there are no unified rules for which models and algorithms to use in data mining in specific fields, and it is difficult for people to understand the learning and decision-making process of the network.

With the increasing integration of the Internet and traditional industries, the mining and analysis of web data have become an important part of demand analysis and market forecasting. Web data mining is a comprehensive technique for discovering hidden input-to-output mapping processes from document structures and usage sets.

The PageRank algorithm is currently being studied and applied more frequently. PageRank is an important part of Google's algorithm. It was granted a U.S. patent in September 2001 and was named after Larry Page, one of the founders of Google. PageRank measures the value of a website based on the number and quality of its external and internal links. This concept is inspired by the phenomenon in academic research that the more frequently a paper is cited, the higher its authority and quality will generally be judged.

It should be pointed out that data mining and analysis have strong industry and enterprise characteristics. In addition to some of the most basic data analysis tools, there is currently a lack of targeted and general modeling and analysis tools. Various industries and enterprises need to build specific data models based on their own business. The ability to build data analysis models has become the key for different companies to win in the big data competition.