Subscribe to our Newsletter

Guest blog post by Alessandro Piva

The proliferation of data and the huge potentialities for companies to turn data into valuable insights are increasing more and more the demand of Data Scientists.

But what skills and educational background must a Data Scientist have? What is its role within the organization? What tools and programming languages does he/she mostly use? These are some of the questions that the Observatory for Big Data Analytics of Politecnico di Milano is investigating through an international survey submitted to Data Scientists: if you work with data in your company, please support us in our research and compile this totally anonymous survey.

Programming is one of the five main competence areas at the base of the skill set for a Data Scientist, even if is not the most relevant in term of expertise (see What is the right mix of competences for Data Scientists?). Considering the results of the survey, that involved more than 200 Data Scientist worldwide until today, there isn’t a prevailing choice among the programming languages used during the data science’s activities. However, the choice appears to be addressed mainly to a limited set of alternatives: almost 96% of respondents affirm to use at least one of R, SQL or Phyton.

In particular, at the top of the ranking in the current sample we find R used by 53% of Data Scientists, supported by the R Foundation for Statistical Computing. Initially widespread mainly among statisticians or in academic environments, the use of R has increased substantially in recent years in data science’s activities. Today it’s one of the most popular open source languages and it’s supported by a large and helpful community.

Even if it was developed in the early 1970s, SQL plays a key role still today (in second position of ranking with 49% of preferences). Although SQL is not designed for the task of handling unstructured datasets (typical of Big Data), there is still a strong need for analyse structured data in organizations, and SQL is a very popular choice for data crunching stage.

At the third position of the ranking there is Python (43%), that has become very popular in recent years because of its flexibility and relative easiness to learn. Like R, it also has a large community dedicated to improve the product and developing specific and focused packages.

The top 5’s ranking is completed by Unix Shell/AWK/Gawk (15%) and Java (8%).

If you are a Data Scientist and you want to receive more detailed results with the main and final findings of the research, compile the questionnaire leaving us your email in order to contact you to send the material.

Email me when people comment –

You need to be a member of Hadoop360 to add comments!

Join Hadoop360