Books, certificates and graduate degrees in data science are spreading like mushrooms after the rain.
Unfortunately, many are just a mirage: some old guys taking advantage of the new paradigm to quickly re-package some very old material (statistics, R programming) with the new label: data science.
To add to the confusion, executives, decision makers building a new team of data scientists sometimes don't know exactly what they are looking for, ending up hiring pure tech geeks, computer scientists, or people lacking proper experience. The problem is compounded by HR who do not know better, producing job ads which always contain the same keywords: Java, Python, Map Reduce, R, NoSQL. As if a data scientist was a mix of these skills.
Indeed, you can be a real data scientist and have none of these skills. NoSQL and MapReduce are not new concepts: many embraced them long before these keywords were created. But to be a data scientist, you also need:
- business acumen,
- real big data expertise,
- ability to sense the data,
- distrust models,
- knows about the curse of big data
- ability to communicate, understand which problems management is trying to solve
- ability to correctly assess lift or ROI on the salary paid to you
- ability to quickly identify a simple, robust, scalable solution to a problem
- being able to convince and drive management in the right direction, sometimes against their will, for the benefit of the company, its users and shareholders
- a real passion for analytics
- real applied experience with success stories
- data architecture knowledge
- data gathering and cleaning skills
A data scientist is also a business analyst, statistician and computer scientist - being a generalist in these three areas, and expertise in a few fields (e.g. robustness, design of experiments, algorithm complexity, dashboards and data visualization)
Fake Data Science Examples
Here are two examples of mis-labeled data science products, and the reason why we are interested in creating a standard and best practices for data scientists. Not that these two products are bad, they indeed have a lot of intrinsic value. But it is not data science.
1. eBook: An Introduction to Data Science
Most of the book is about old statistical theory. Throughout the book, R is used to illustrate the various concepts. The entire book is about small data, with the exception of the last few chapters where you learn a bit of SQL (embedded in R code) and how to use a R package to extract tweets from Twitter, and create what the author calls a word cloud (it has nothing to do with cloud computing).
Even the Twitter project is about small data anyway, and there's no distributed architecture (e.g. Map Reduce) in it. Indeed the book never talks about data architecture. Its level is elementary. Each chapter starts with a very short introduction in simple English (suitable for middle school students) about big data / data science, but these little data science excursions are out-of-context, and independent from the projects and technical presentations.
I guess the author (Jeffrey Stanton) added these short paragraphs so that he could re-name his "Statistics with R" eBook as "Introduction to Data Science". But it's free and it's a nice, well written book to get high school students interested in statistics and programming. It's just that it has nothing to do with data science.
2. Data Science Certificate
Delivered by a respected public University (we won't mention the name). The advisory board is mostly senior technical guys, most have academic positions. The data scientist is presented as "a new type of data analyst": I strongly disagree with this. Data scientists are not junior people.
This program has a strong data architecture and computer science flair, and this CS content is of great quality. That's a very important part of data science, but in my opinion, it covers only one third of data science. It has a bit of old statistics too and some nice statistics lessons on robustness and other stuff, but nothing about six sigma, approximate solutions, the Lorentz curve, the 80/20 rules and related stuff, cross-validation, design of experiments, modern pattern recognition, lift metrics, third party data, Monte Carlo simulations, life cycle of data science projects, and nothing found in a MBA curriculum. It requires knowledge of Java and Python for admission. It is also very expensive - several thousand dollars.
To be admitted, you need to take a 90-minute test (multiple choices) with questions that only fresh graduates would be able to answer. Click here to see the admission test: could you pass? Ironically, this online test is the same for everyone (I double checked), so technically, you could first take it using a fake name, save the questionnaire, then pay someone to answer the questions, then take the test again but this time with your real name - and complete it in just 30 seconds and get all the answers correct! I guess they don't have a real data scientist on board to help them with fraud detection issues. In short, the admission process will eliminate most real data scientists (those with years of successful business experience) except the fraudsters.
- The curse of big data
- What Map Reduce can't do
- 53.5 billion clicks dataset available for benchmarking and testing
- Eight worst predictive modeling techniques
- Another example of misuse of statistical science
- The curse of dimensionality (it got worse with big data)
- Data Science eBook
- Data Science Apprenticeship
- Debunking lack of analytic talent
- Causation vs. Correlation
- Data Science dictionary
- How and why to build a data dictionary
- Data Science tools
- A new random number generator
- Modern books on multiple programming languages
- Assessing efficiency of approximate vs. exact algorithms (coming soon)
- Statistical comic strip
- 66 job interview questions for data scientists
- Most popular blog posts
Originally posted on Analytic Bridge