Introduction to Data Science: Python or R

Introduction to Data Science: Python or R

For beginners who want to learn data science, choosing Python or R language is a difficult problem. This paper compares the two languages, hoping to help you make a choice.

I'm the head of data scientist at Deloitte. I have used Python and R language for many years and have worked closely with Python community 15 years. This article is my personal opinion on these two languages.

Third option

In order to solve this problem, Htley Wickham, the chief data scientist of Studio, believes that a better choice is to let two languages cooperate rather than choose one of them. Therefore, this is also the third option I mentioned, which I will discuss in the last part of the text.

How to compare r and Python

For these two languages, the following points are worth comparing:

History:

The development history of R and Python is obviously different, and there are overlapping parts at the same time.

User group:

It contains many complicated sociological and anthropological factors.

Performance:

Detailed comparison and why it is difficult to compare.

Third-party support:

Module, code base, visualization, repository, organization and development environment.

Use case:

There are different choices according to specific tasks and types of work.

Can I use it at the same time:

R is used in Python, and Python is used in R.

Forecast:

Internal testing.

Company and personal preferences:

Reveal the final answer.

history

Brief history:

ABC language->; Python came out (Guido van Rossum founded it in 1989) -> Python 2(2000)->; Python 3(2008)

Fortan language->; S language (bell labs)->; R language came out (199 1 founded by Mr Ross Ihaka and Mr Robert)->; R 1.0.0(2000) -> R 3.0.2(20 13)

user group

When comparing users of Python and R, we should pay attention to:

Only 50% Python users use R at the same time.

Suppose that programmers who use R language use R to do related "science and numbers" research. Regardless of the programmer's level, you can be sure that this statistical distribution is true.

Go back to the second question, which user groups are there. The whole scientific and digital community contains several subgroups, some of which overlap with each other.

Subgroups using Python or R language:

Deep learning

machine learning

Advanced analysis

predictive parsing

statistics

Exploration and data analysis

academic research

A large number of computational research fields

Although almost every field serves a specific group, R language is more common in the fields of statistics and exploration. Not long ago, when exploring data, R language took less time than Python, and it also took time to install Python.

All this has been changed by subversive technologies called Jupyter notebook and Anaconda.

Jupyter notebook: increased the ability to write Python and R code in the browser;

Anaconda: Python and R are easy to install and manage.

Now, you can start and run Python or R in a friendly environment, providing out-of-the-box reports and analysis. These two technologies remove the barriers between completing tasks and choosing your favorite language. Python can now be packaged in a platform-independent way and provide quick and simple analysis faster.

Another factor that affects the choice of community language is "open source". Not only the open source library, but also the influence of the collaborative community on open source. Ironically, open source software such as Tensorflow and GNU Scientific Library (Apache and GPL, respectively) are bound to Python and R. Although there are many users who use R language, there are many users who use Python. On the other hand, more enterprises use R language, especially those with statistical background.

Finally, regarding community and collaboration, Github supports Python more. If you see the popular Python package recently, you will find that Tensorflow and other projects have more than 35,000 user collections. But seeing R's popular software package, Shiny and Stan's collections are less than 2000.

perform

This aspect is not easy to compare.

The reason is that there are too many indicators and situations to test. It is difficult to test on any specific hardware. Some operations are optimized in one language, but not in another.

spread

Before that, let's think about how Python compares with R. Do you really want to write many loops in R language? After all, the design intentions of these two languages are not quite the same.

{

"Cell": [

{

" cell_type": "code ",

" execution_count": 1,

"Metadata": {},

"output": [],

"Source": [

"import numpy as an npn",

" %load_ext rpy2.ipython "

]

},

{

" cell_type": "code ",

“execution _ count”:2,

"Metadata": {},

"output": [],

"Source": [

" def do_loop(u 1):n ",

“n”,

" # Initialize `usq`n ",

" usq = {}n ",

“n”,

"For the range I( 100):n",

"The square of the i-th element of' u1'to the i-th position of' usq' n",

" usq[I]= u 1[I]* u 1[I]n "

]

},

{

" cell_type": "code ",

“execution _ count”:3,

"Metadata": {},

"output": [],

"Source": [

“%%Rn”,

" do _ loop & lt- function(u 1) {n ",

“n”,

" # Initialize `usq`n ",

" usq & lt- 0n”,

“n”,

" for(i in 1: 100) {n ",

"The square of the i-th element of' u1'to the i-th position of' usq' n",

" usq[I]& lt; - u 1[i]*u 1[i]n ",

" }n ",

“n”,

"}"

]

},

{

" cell_type": "code ",

“execution _ count”:4,

"Metadata": {},

"Output": [

{

" name": "stdout ",

"output type": "stream",

"Text": [

“ 1.58 ms 42.8? S (average standard deviation per cycle. Dave. 7 runs, each time 1000 cycles) n "

]

}

],

"Source": [

" %timeit -n 1000n ",

“%%Rn”,

" u 1 & lt; - rnorm( 100)n”,

" do_loop(u 1)"

]

},

{

" cell_type": "code ",

“execution _ count”:5,

"Metadata": {},

"Output": [

{

" name": "stdout ",

"output type": "stream",

"Text": [

"36.9 ? s 5.99? S (average standard deviation per cycle. Dave. 7 runs, each time 1000 cycles) n "

]

}

],

"Source": [

" %timeit -n 1000n ",

" u 1 = NP . random . randn( 100)n ",

" do_loop(u 1)"

]

}

],

"Metadata": {

" kernelspec": {

" display_name": "Python 3 ",

Language: python,

[Name]: "python3"

},

"Language information": {

" codemirror_mode": {

"Name": "ipython",

[Version]: 3

},

File extension:. py”,

“mime type”:“text/x-python”,

[name]: "python",

" nbconvert_exporter": "python ",

" pygments_lexer": "ipython3 ",

"Version": "3.6.3"

}

},

【nb format】:4、

" nbformat_minor": 2

}

Python is 0.000037 seconds, and r is 0.00 158 seconds.

Including loading time and command line running: r is 0.238 seconds, Python is 0. 147 seconds. To emphasize, this is not a scientific and rigorous test.

The test proves that Python's running speed is obviously accelerated. Usually it doesn't have much impact.

Besides running speed, which performance is more important for data scientists? Both languages are very popular because they can be used as command languages. For example, when using Python, we rely heavily on pandas most of the time. This involves modules and libraries in each language and how they are implemented.

Third party support

Python has PyPI, R language has CRAN, and both have Anaconda.

CRAN uses the built-in install.packages command. At present, there are about 12000 packages on CRAN. Packets above 1/2 can be used in data science.

The number of packets in PyPi is 10 times that of the former, and there are about 145438+00000 packets. There are 3,700 dedicated to scientific engineering. Some can also be used in science, but they are not labeled.

There are some overlaps between the two. When searching for "random forest", you can get 170 items in PyPi, but these packages are not the same.

Although the number of packages in Python is 10 times that of R, the number of packages related to data science is roughly the same.

operating speed

It makes more sense to compare data frames with pandas.

We conducted an experiment to compare the execution time of complex exploration tasks, and the results are as follows:

Python runs faster in most tasks.

http://nb viewer . jupyter . org/gist/Brian ray/4ce 15234 E6 AC 2975 b 335 c 8d 90 a 4 b 6882

As you can see, Python+Pandas is faster than native R language data frames. Note that this does not mean that Python runs faster. Pandas is written in c language based on Numpy.

visualize

Here, ggplot2 is compared with matplotlib.

Matplotlib was written by John D. Hunter. He is one of the people I respect most in the Python community, and also the one who taught me how to use Python.

Matplotlib is not easy to learn, but it can be customized and extended. Ggplot is difficult to customize, and some people find it more difficult to learn.

If you like beautiful charts and don't need to customize them, then R is a good choice. If you want to do more things, then Matplotlib and even interactive shots are good. Similarly, ShinnyR of r can increase interactivity.

Can I use it at the same time?

You may ask, why not use Python and R language at the same time?

You can use both languages at the same time in the following situations:

Company or organization license;

Both can be easily set up and maintained in your programming environment;

Your code doesn't need to enter another system;

Will not bring trouble and trouble to the cooperative people.

The method of using both languages is:

Packages provided by Python to R: such as rpy2, pyRserve, Rpthon, etc.

R also has relative packages: rPython, PythonInR, rethink, rJython, SnakeCharmR, XRPython.

Use Jupyter, both at the same time. Examples are as follows:

After that, you can transmit the data frame of Panda, and then automatically convert it into the data frame of R through rpy2, and convert it with "-i df":

http://nb viewer . jupyter . org/gist/Brian ray/734 BD 54 f 468d 9 a6 db 9 17 1 B2 CFC 98405 a

predict

Someone on Kaggle wrote a kernel for developers in R or Python. According to the data, he found the following interesting results:

If you plan to switch to Linux next year, you are more likely to be a Python user.

If you study statistics, you are more likely to use r; If you study computer science, you are more likely to use Python. .

If you are young (18-24 years old), you are more likely to be a Python user;

If you take part in a programming contest, you are more likely to be a Python user.

If you want to use Android next year, you are more likely to be a Python user.

If you want to learn SQL next year, you are more likely to be an R user.

If you use MS office, you are more likely to be an R user.

If you want to use Rasperry Pi next year, you are more likely to be a Python user.

If you are a full-time student, you are more likely to be a Python user.

If you use agile methodology, you are more likely to be a Python user.

If you are more worried about artificial intelligence than excited, then you are more likely to be an R user.

Company and personal preferences

When I talked with Google and Alex Martelli, the great god of Stack Overflow, he explained to me why Google only officially supported a few languages at first. Even in the relatively developed environment of Google, there are some restrictions and preferences, and so are other enterprises.

In addition to corporate preferences, the first person to use the language in the enterprise will also play a decisive role. Deloitte was the first person to use R, and now he still works in the company and is currently the chief data scientist. My suggestion is to choose your favorite language, love your chosen language, play a leading role and love your career.

When you are studying something important, mistakes are inevitable. However, every well-designed data science project will leave some room for data scientists to experiment and learn. It is important to keep an open mind and embrace diversity.

Finally, individuals mainly use Python, and then look forward to learning more about R.