At present, data mining is not popular in China, just like killing dragons.
Initial data preparation usually accounts for about 70% of the whole data mining project workload.
Data mining itself combines statistics, database and machine learning, and it is not a new technology.
Data mining technology is more suitable for business people to learn (it is more efficient than technical people to learn business)
Data mining is suitable for traditional business intelligence (reports, OLAP, etc.). ) I can't support it.
Data mining projects usually need to repeat some unskilled work.
If you think the above content is acceptable, then continue reading.
Learning a technology should be close to the industry, and technologies without industry background are like castles in the air. The development of technology, especially in the computer field, is vast and rapid (companies can be established by designing web pages ten years ago), and most people don't have the energy and time to fully grasp all the technical details. However, after technology and industry are combined, they can be independent. On the one hand, it is conducive to grasping the pain points and rigid needs of users. On the other hand, you can accumulate industry experience, and cross-border thinking with the Internet makes it easier for you to succeed. Don't try to cover everything when learning technology, which will lose your core competitiveness.
1. At present, the work fields of domestic data miners can be roughly divided into three categories.
1) data analyst: doing business consulting, business intelligence and making analysis reports in industries with industry data, such as e-commerce, finance, telecommunications and consulting.
2) Data Mining Engineer: Implement and analyze machine learning algorithms in big data-related industries such as multimedia, e-commerce, search and social networking.
3) Research direction: Study the efficiency improvement and future application of the new algorithm in universities, scientific research units and enterprise research institutes.
Second, talk about the skills that need to be mastered in various fields of work.
(1). Data analyst
Need a deep foundation of mathematical statistics, but do not need the ability of program development.
You need to be proficient in using mainstream data mining (or statistical analysis) tools, such as business analysis and business intelligence software (SAS), SPSS, EXCEL and so on.
It is necessary to have a deep understanding of all core data related to the industry and a certain degree of data sensitivity training.
Classic book recommendations: Probability and Mathematical Statistics, Statistics Recommended by david friedman, Business Modeling and Data Mining, Introduction to Data Mining, SAS Programming and Data Mining Business Cases, Clementine Data Mining Methods and Applications, Excel 2007 VBA Reference, IBM SPSS Statistics 19 Statistical Program Compani.
(2). Data Mining Engineer
It is necessary to understand the principle and application of mainstream machine learning algorithms.
Familiar with at least one programming language such as Python, C, C++, Java and Delphi.
It is best to understand the database principle and skillfully operate at least one database (Mysql, sql, DB2, Oracle, etc.). ), understand the principle of MapReduce and skillfully use Hadoop series tools.
Classic book recommendation: data mining concept and technology, machine learning practice, artificial intelligence and its application, introduction to database system, introduction to algorithm, Web data mining, Python standard library, thinking in Java, thinking in C++, data structure, etc.
(3) scientific research direction
It is necessary to learn the theoretical basis of data mining, including association rules mining (Apriori and FPTree), classification algorithms (C4.5, KNN, Logistic regression, SVM, etc. ) and clustering algorithm (Kmeans, spectral clustering). The target can first deeply understand the usage, advantages and disadvantages of 10 algorithm in data mining.
Compared with SAS and SPSS, R language is more suitable for R project of statistical calculation, because R software is completely free, and the open community environment provides a variety of additional toolkit support, which is more suitable for statistical analysis and research. Although it is not well-known in China at present, it is highly recommended.
We can try to improve some mainstream algorithms to make them faster and more efficient, such as implementing SVM cloud algorithm calling platform under hadoop platform-Web project calling Hadoop cluster.
It needs extensive and in-depth reading of world-famous conference papers and hot spot tracking technology. Such as KDD, ICML, IJCAI, Artificial Intelligence Promotion Association, ICDM, etc. There are also journals related to data mining: ACM transactions on knowledge discovery from data, IEEE transactions on knowledge and data engineering, Journal of machine learning research home page, IEEE xplore: pattern analysis and machine intelligence, IEEE transactions on, etc.
You can try to participate in the data mining competition to cultivate your ability to solve practical problems in all aspects. For example, Sigmund ·KDD, kaggle: From Big Data to Big Analysis, etc.
You can try to contribute your code to some open source projects, such as Apache Mahout: scalable machine learning and data mining, Myrrix and so on. You can find more interesting projects on SourceForge or GitHub. ).
Classic books are recommended: Machine Learning, Pattern Classification, Essence of Statistical Learning Theory, Statistical Learning Methods, Practical Machine Learning Techniques for Data Mining, R Language Practice, English Quality is Essential for Scientific Research Talents, etc. Machine Learning: A Probabilistic Perspective, Expanding the Scale of Machine Learning: Parallel and Distributed Methods, Using sasen Data. Terprise miner: case study method, Python for data analysis, etc.
Third, the following are the working feelings of data mining engineers in the communication industry.
From the practice of data mining projects, communication ability is the most important for mining interests. Only with hobbies will you be willing to learn. Only with good communication skills can we correctly understand business problems, correctly turn business problems into mining problems, clearly express our intentions and ideas among relevant professionals and gain their understanding and support. So I think communication skills and hobbies are the core competitiveness of personal data mining, which is more difficult to learn; Anyone can learn other related professional knowledge, which is not the core competitiveness of personal development.
Speaking of which, many data warehouse experts, programmers, statisticians and so on. Maybe throw bricks. Sorry, I didn't mean anything else. Your major is very important for data mining. Everyone is a whole, but as an individual, it is impossible to master these fields with limited energy and time. In this case, the most important core should be data mining skills and related business capabilities (from the other extreme). Although he doesn't understand data warehouse, simple Excel is enough for data processing of 60,000 samples. Although he doesn't know professional exhibition skills, as long as he can understand it himself, he doesn't need any exhibition; As mentioned above, mastering statistical skills is very important for a person's mini-project; Although he doesn't know programming, his professional mining tools and skills are enough for him to practice. In this way, in a mini-project, a person who knows mining skills and marketing business ability can successfully complete it, and even in a data source, different project ideas can be endlessly excavated according to business needs. I would like to ask this mini-project, a simple data warehouse expert, a simple programmer, a simple exhibition technician and even a simple mining technology expert are all incompetent. This also explains why communication skills are important from another aspect. These completely different professional fields want to be effectively and organically integrated for data mining project practice. Do you think there is no good communication skills?
The ability of data mining can only be improved and sublimated in the melting pot of project practice, so learning to mine with the project is the most effective shortcut. People who study mining abroad always follow their boss to do projects at first. It doesn't matter if they don't understand at first. The less they understand, the more they know what to learn, and the faster they learn, the more effective they are. I don't know how data mining students in China learn it, but from some online forums, many of them are armchair strategists, wasting time and being inefficient.
In addition, the concept of data mining in China is very confusing now. A lot of BI is only limited to the presentation of reports and simple statistical analysis, but it is also called data mining. On the other hand, there are only a handful of industries (banks, insurance companies, mobile communications) that really implement data mining on a large scale in China, and the applications of other industries can only be regarded as small-scale. For example, many universities have some related mining topics and projects, but they are scattered and in the exploration stage, but I believe that data mining must be very promising in China, because it is the inevitable development of history.
Speaking of the actual case of mobile, if you are mobile, you must know that there is a domestic company called Huayuan Analysis (I have nothing to do with this company, but I have analyzed most of the so-called data mining service companies in China from the perspective of data miners, and I think Huayuan is not bad, which is more practical than many nominal big companies). Their business has now covered the analysis and mining projects of most provincial mobile companies in China. You should be able to find some detailed information through online search. What impressed me most about the analysis of Huayuan is that in 2002, this company started from scratch. It doesn't matter if you don't know yourself. It began to expand customers at the same time of self-study. Now it has blossomed in China mobile communication market. I really admire it. At first, EXCEL was used to process data, and different models were compared with naked eyes. The difficulty can be imagined.
As for the specific applications of data mining in mobile communication, there are too many, such as the formulation of different phone bill packages, customer churn model, cross-selling model of different businesses, elastic analysis of different customer preferences, customer group segmentation model, life cycle model of different customers, channel selection model, malicious fraud early warning model and so on. Remember, from the customer's needs and practical problems, mobile communication can find too many mining projects. Finally, I will tell you a secret. When your data mining ability is improved to a certain extent, you will find that no matter what industry, the application of data mining is mostly the same, and you will feel more relaxed.