1. Say two projects related to analysis or computer science? How do you measure the result?
2. How to make a web crawler faster, extract better information, better summarize data and get a clean database?
3. What is the promotion value, key performance indicators, robustness, model fitting, experimental design, and 2/8 principle?
4. What are collaborative filtering, n-grams, map reduce and cosine distance?
5. Should clickstream data be processed in real time? Why? Which part should be processed in real time?
6. How to design a solution to plagiarism?
7. How to verify that a personal payment account is used by many people?
8. What is probabilistic merger (also known as fuzzy merger)? Is it convenient to use SQL processing or other languages? Which language would you choose to use to process semi-structured data?
9. Which do you think is better, good data or good model? Meanwhile, how do you define "good"? Is there a universal model for all situations? Don't you know that some models are not so well defined?
10, what's your favorite programming language? Why?
How do you deal with the lack of data? What processing technology do you recommend?
12. What is the curse of big data?
13. Tell me three reasons why you like statistical software best.
14, what's the difference between SAS, R, Python and Perl languages?
15. What features do you like about TD database?
16, have you ever participated in the design of database and data model?
17. Have you ever participated in the design of dashboard and the selection of indicators? What do you think of business intelligence and reporting tools?
18, please illustrate the working principle of mapreduce? In what application scenario does it work well? What is the security problem of the cloud?
19. How are you going to send 1 10,000 marketing emails? How do you optimize delivery? How do you optimize the reaction speed? Can you separate these two optimizations?
20. If several customers query ORACLE database inefficiently. Why? What can we do to improve the speed by more than 10 times and handle a large number of outputs better?
2 1. How to transform unstructured data into structured data? Is it really necessary to make such a transformation? Is it better to save data as a flat text file than as a relational database?
22. What is a hash table collision attack? How to avoid it? How often does it happen?
23. How to judge the load balance of mapreduce process is good? What is load balancing?
24. Have you handled the white list? Main rules? (In case of fraud or crawling inspection)
25. Do you think 100 small hash table is better than one big hash table, from the internal or running speed? Evaluation of database analysis?
26. Why is Naive Bayes so bad? How to improve the crawler detection algorithm with naive Bayes?
27. What are the defects of ordinary linear regression model? Do you know any other regression models?
28. What is a star model? What is a lookup table?
29. Can you use excel to establish a logistic regression model? How can I explain the establishment process?
30. In the programming process of SQL, Perl, C++, Python, etc. , is the related code or algorithm optimized to improve the speed? How about it? How much is it?
3 1, 5-day accuracy 90% or 10-day accuracy 10%? What do you want to see?
32. Definition: QA (Quality Assurance), Six Sigma and Experimental Design. Can you give an example of good and bad experimental design?
33. What is sensitivity analysis? Is the lower sensitivity (that is, the better robustness) and the lower predictive ability better or just the opposite? How do you use cross-validation? What do you think of the idea of inserting noise data into data sets to test the sensitivity of models?
34. Do you think a decision tree with less than 50 leaves is better than a big one? Why?
35. Is actuarial a branch of statistics? If not, why?
36. Give a data case that does not conform to Gaussian distribution and lognormal distribution. A numerical example with very chaotic distribution is given.
37. How to suggest a nonparametric confidence interval?
38. How do you prove that the algorithm improvement you brought is really effective compared with not making any changes? Are you familiar with the A/B exam?
39. Why is the mean square error not a good indicator to measure the model? Which indicator do you suggest to use instead?
40. For logistic regression, decision tree and neural network. /kloc-what major improvements have been made in these technologies in the past 0/5 years?
4 1. Do you use other data dimensionality reduction techniques besides principal component analysis? How do you want to gradually return? What are the stepwise regression techniques you are familiar with? When is complete data better than dimensionality reduction data or samples?
42. How to create a keyword classification?
43. Are you familiar with extreme value theory, Monte Carlo logic or other mathematical statistics methods to correctly evaluate the probability of sparse events?
44. What is attribution analysis? How to identify attribution and correlation coefficient? For example.
45. How to define and measure the predictive power of an indicator?
46. How to find the best rule set of fraud detection scoring technology? How do you deal with rule redundancy, rule discovery and its essence? Is the approximate solution of a rule set feasible? How to find a feasible approximate scheme? How do you decide that this solution is good enough to stop looking for another better solution?
47. What is a proof of concept?
48. What is a botnet? How to take the exam?
49. Have you ever used API interface? What kind of API? Is it Google or Amazon or software instant service?
50. When is it better to code yourself than to use a software package developed by a data scientist?
5 1. What tools are used for visualization? How do you evaluate Tableau in the Painted Skin? r? SAS? Effectively display five dimensions in a picture?
52. Is it false positive or false negative?
53. What kind of customers do you mainly cooperate with: internal, external, sales department/finance department/marketing department /IT department? Do you have any consulting experience? Dealing with suppliers, including supplier selection and testing.
54. Are you familiar with the software life cycle? And the life cycle of IT projects, from revenue demand to project maintenance?
55. what is cron's task?
56. Are you a single coder? Or a developer? Or a designer?
57. What makes a graphic misleading and difficult to read or explain? A useful graphic function?
58. Are you familiar with price optimization, price elasticity, inventory management and competitive intelligence? Give cases respectively.
59. How does 59.Zillow's algorithm work?
60. How to check false comments or false FB accounts for bad purposes?
6 1. How to create a new anonymous digital account?
62. Have you ever thought about starting your own business? What kind of idea is it?
63. Do you think the login box for account number and password input will disappear? What will be replaced?
64. Have you ever used a time series model? Correlation of time delay? Related maps? Spectral analysis? Signal processing and filtering technology? In what kind of scene?
65. Which data science do you admire most? Where to start?
66. How did you get interested in data science?
67. What do you think are the five best forecasting methods for the next 20 years?
68. What is a recommendation engine? How does it work?
69. What is precision testing? How and when can simulation help us not to use accurate tests?
70. How do you think you can become an excellent data scientist?
7 1. Do you think data scientists are artists or scientists?
72. How to immediately know that the statistics published in an article (such as a newspaper) are wrong, or are used to support the author's argument, rather than just listing information about something? For example, what do you think of the unemployment statistics released by the government on a regular basis in the media every month? How can we make these data more accurate?
73. Give several "best practice cases" of data science.
74. What is the efficiency curve? What are their shortcomings and how do you overcome them?
75. What is the largest amount of data you have processed? How did you deal with them? The result of processing.
76. What is the computational complexity of a good fast clustering algorithm? What is a good clustering algorithm? How to determine the aggregation number of a cluster?
Do you know the "rule of thumb" used in statistics or computational science? Or in business analysis.
The above questions are very easy to meet among job seekers interviewing data analysts, and some of them involve professional issues, so you must be fully prepared before the interview!