How to hire a Data Scientist
At SAP we help our customers to become Intelligent Enterprises. This is a transformation, not tactical purchasing of newer technologies. As part of our Innovation Management Consulting, we help customers to establish their own Intelligent Technology competence.
Starting from one of the widely underestimated challenges — hiring a Data Scientist
Artificial Intelligence, Machine Learning, and Data Science are catchy words these days. Look around, it seems every company is hiring Data Scientists. And chances are your organization don’t want to be left behind.
Some startups were created by Machine Learning experts and their top leadership are their best Data Scientists, which means they are well qualified to identify competencies and observe shortcomings. However, even for these experts, sometimes, they might find hiring good Data Scientists difficult.
For the most of the rest of us, hiring any “scientists” could be a daunting task outside of the academic settings. So, if you need to hire Data Scientists, either you want to or you were told to, where do you start?
First, let’s take a look at what makes a Data Scientist? What are the differences between a Data Scientist and a “math person”, a “computer geek”, or an “academic”?
Data Scientist is a relatively new role after Machine Learning really took off early this decade. In short, it is a role converging various subject expertise to “extract” value out of data. Namely, a Data Scientist is an innovator who integrates several areas of knowledge and skills around Data and processing (Computer Science), algorithm(Math), Meaning (domain expertise).
Most typical skills required to be a Data Scientist include Math, especially Statistics. This should not be a surprise, the whole field of Artificial Intelligence is built on top of Math, Machine Learning is based on mathematical models that can be “trained” to recognize patterns, modern Machine Learning actually relies heavily on Statistical models against big data sets.
“How important is Statistics in Data Science?” “Very important” is not very convincing, so let me use an example to explain its importance.
As most companies use A/B testing these days. It is widely suspected that A/B testing’s success and popularity are causing it an existential threat. What happened was as A/B testing becomes more and more popular, more and more of them were run by people without statistics background, these runs are marred by many statistical flaws – up to 80% of the “finding” done by people who don’t have statistic background were “false positives” — “benefit” that either does not exist, or even negative, because the principle of randomness was broken by “common” actions like terminating testing as soon as a “winner emerges” (or p-value is <0.05), or having too many variations to a single test, or run “scaled-down” tests with a smaller group of audience, or “adjust” As or Bs as test results coming in, or after test is done, trying to identify which group the test showed bigger effect…
If your team told you any one of the above looks “reasonable”, then you need someone having a statistic degree to run these tests, otherwise, the A/B testing will be mostly telling you random blips instead of actual effects.
This is why many Data Scientists have an advanced Applied Statistics (or similar) degree. To the point there are still faculties argue Data Science should be a branch of Statistics.
I am personally on the fence of that argument though, because the other skills Data Scientists require are not trivial. One of which his Computer science – Data structure, processing, Computational performance, Programming skills, are quite crucial. A Data Science has to be a very competent programmer, who are proficient in various different computing programming languages and have deep enough knowledge about how to use computers most efficiently. When we are working on a big set of data, knowing how computers work, is very important in the efficiency of their work, it is quite easy to have dozens if not hundreds of times performance difference depending on the implementation of algorithms and data preparation method. While Algorithm complexity is a math problem, implementation performance is very much a Computer Science issue.
Some of the most important statistical advance int he past several decades were partially the consequence of computer being used in this field. R or Matlab or similar tools are nearly inseparable from machine learning these days. And the Data Pipeline including Data Engineering is mostly a programming exercise — you want data to be “prepared” so from its raw form, data can be cleansed, structured, pre-processed, or engineered, in a way that maximize its intrinsic value, minimize the “noise”. These processes should also be automated once the insight is ready to be moved into production. So knowing what processing is “computer achievable” .vs. “out of reach of computers” is quite important.
Now with Math and Computer Science background, one can be a Machine Learning researcher — studying algorithms that perform best to tackle certain theoretical challenges. They are very important components of the modern Machine Learning booming.
However, most Data Scientists will need a third pillar skill – domain knowledge – they don’t need to be the Subject Mater Expert, but they need to comprehend the full cycle of the process chain and value chain, and able to make an educated judgement on resource allocation, areas of focus, and meanings of data points, tags, and predicting model rationale (to some extent at least).
One of the most fundamental study method is “testing of hypothesis”. Actually most of the hypothesis should have some reasonable explanations — it is VERY dangerous to treat “whatever Data shows” as true laws of “how things work”. If you don’t know why your data show a particular effect, then you don’t know how long will that effect last – you don’t know what is the limit of that phenomenon, and it will be irresponsible to apply that “finding” on future events.
One very common mistake I see managers, customers make is the perception of “if we have Data and Data Scientists, they should yield results even human experts never thought of, no need for domain expertise any more.” This is a half-truth applied only the false half. While it is true that Data Science *might* reveal insights human experts have not thought of, this should be the evidence and indication of more related work should be performed following this “finding”, instead of declaring “case closed, Machine Learning found rules in raw Data”. The further study does not need to be performed by your Data Scientist, but some Subject Mater Experts in that domain should work with the team to carry on the analysis.
The reality is in some cases, Machine Learning might really yield results that hard for human experts to explain. It does not mean we need to throw these findings away, or limit ourselves to our knowledge and competency in this field. In this situation, your Data Scientist is key to determine the level of testing, validation, generalization and inference decisions based on their Data Science training — to use a Machine Learning model that have insights beyond domain expert’s ability to explain, certain guidance, caution, and safeguarding are needed, Data Scientists are expected to properly assess these caveats and recommend and implement these measures.
Of course, Data Scientists do not have to be an expert in the particular field – while that’s a luxury that might yield benefit, the other two pillar skills are much more important – you definitely want your Data Scientist to be a real math and computer genius, with insider knowledge of the field you operate in. The key is your Data Scientist should know your field enough to understand what do your data mean and think of what relationship different tags could have, causality intuition and assessment, and it will be great if your Data Scientist could recommend some further study areas based on insight they generate out of the Data, although most of the time, such follow up study are done by SME”.
Second, where do we find Data Scientists? Well, this is a question I generally refrain from discussing publically, because it can be controversial. I’ll share my personal experience and point of view, but don’t limit your search only in places where I had success.
Some managers would try to hire Data Scientist from other organizations. Of course, the tried and true “hire people away from their current employer” is appealing, it usually does not work very well in this field. There are mainly two reasons for it. a), Data Science is relatively new, and many works are still in progress, we do not have lots of success history yet, so “past success” would not be a good yardstick to measure people’s competency; b), lots of in-house grown data scientists might not carry sufficient skills that are transferable.
Data Science only became a “job” category in the past decade. In less than 10 years, here are not a lot of opportunities for people to get sufficient on-the-job training (not saying it’s impossible, but rare, and hard to evaluate those OTJ). So having gone through proper training is somewhat more important in Data Science than lots of other more mature fields (think about MBA in the 80s and 90s). And the amount of knowledge I outlined above, indeed require some advanced degree math, which unfortunately are not very common in most practical jobs.
I personally prefer rigidly trained Data Scientists, that have a degree in relevant fields. Typically I would look into Master of Data Science, Machine Learning, Applied Statistics, etc. It would be even more preferable if someone has 2 or more Masters and/or PhDs degrees in multiple fields (going back to above skills convergence feature)s, for example, someone with Master in Data Science and an MBA (i.e. hiring for financial/business/managerial/fields), or someone with a Master in Applied Statistics and Masters in Physics.
So I had been looking into reputable schools who offer graduate programs in this field.
Ok, now I might have offended a large group of talented people who may not have an opportunity going through degree programs. There are actually reputable degreed programs what train Data Scientists in a relatively short period of time (some as little as 1 year) and can be done while working.
I am also open to hiring someone talented in two out of the three areas mentioned above and train them while under my supervision, especially if they are open to signing up for a degree program. That being said, I’m pretty sure most organizations would not have the luxury of being able to train Data Scientists in house – I found training someone for computer science is much easier (at least for me) than teaching them Math or Business — so I personally lean towards relaxing the programming side of the requirement — of course that’s just leveraging my own capability of mentoring.
Third, what to look for in a Data Scientist Candidate? While relevant degrees are good yardsticks, individual talent can vary significantly even if they went through same programs, evaluating those talented candidates without a degree is even more challenging.
One of the reasons I give additional weight to a relevant degree is because completing a degree program not only “force-feed” knowledge to the candidate, but also gives them an opportunity to use their “ability” to solve problems the school throw at them. I shall warn hiring manager though, it has been repeatedly proven that GPA is not a good predicting factor in future work performance. However, if you don’t use regression, but use classification, having a degree actually correlates to ability in higher intellectual fields. In many cases, you can ask them for some of their school projects that are relevant, it’s usually pretty helpful (at least if you have someone who can review those meaningfully)
Beyond degree, you should try an innovative approach to look for the 2 traits mentioned above within a candidate: the knowledge in the relevant field, and (more importantly) the ability to solve problems with extended-knowledge. Extended knowledge refers to someone’s ability to use knowledge, both in their brain, and can be acquired reasonably quickly. That’s why I don’t mind candidate use the internet during “interviews” (of course this also means more work in setting up the interview challenges and environment)
I have been a vocal criticizer of the old HR hiring approach that put *overly* emphasis on “previous experiences and past success”. Not only does Data Science as a field is relatively new, and evolving fast, but also because without knowing actual details of past work settings, it can be quite meaningless to use it as a measure of “experiences”. In Data Science words, that’s too small a sample (how many Data Science job your candidate could have been on??), with selection bias (they tell you what they’ve done, not showing you the full spectrum of their work), and the hypothesis (that this applicant is good fit) cannot be rejected with sufficient confidence given the density of the information provided in the application process.
In my 205+ years of Information industry career, I keep hearing “horror stories” where decades of experience turned out to be “decades of failure experience” after a candidate been hired. Even if previous success was tangible, remember the good old disclaimer “Past Performance Is Not Indicative of Future Results”? If you can’t put your nesting egg based on past performance, why would you hire your organization’s future that way?
“Then what should I do?” You ask. Well, it’s actually no harder than hiring many other professionals. To assess someone’s ability to acquire new knowledge, you don’t really have to make them show how to acquire knowledge in all fields, pick the field you are most familiar with and try to assess their ability to learn something new to solve a problem in this field — I expect you have some skills to evaluate their approaches, way of thinking, ability to dissect problems, isolating cause and effects, and instinct in prioritization. Hopefully, your team have some recent issues, problems, that you felt proud solving (maybe not even fully solved yet) in one of the 3 areas we talked about above (it would be even better if your problem covers more areas – computer science, math, or domain knowledge), discuss the case study with the candidate, encourage them to discuss their analysis with you, see if she or he provides any meaningful insights your current team have not thought about (or your current team thought about it already, but feels the need of increasing ability like that).
For knowledge evaluation, you should focus on using this as an indicator of the candidates’ ability to learn new things, their passion of the field, and how much they stay on the cutting edge of this ever-evolving field of innovation. Particularly I would recommend familiarizing yourself with Machine Learning field of studies and progress. Because this is one of the most important parts of Data Scientist responsibility, and it covers both Computer Science portion (mainly programming and computing efficiency, performance tuning, etc) and Math (mainly algorithms, model tunings, and result-analysis). One key thing about Data Science is it requires lots of tools and skills to integrate their knowledge to solve a problem. There is not one single language that is the holy grail of Data Science, a key feature Data Scientists need is the ability to acquire new skills on the fly, so someone who is not intimidated by constantly learning new things, passionate about staying on the cutting edge and integrate multiple sources of knowledge and skills is a “gem” in the field of Data Science.
For applying knowledge to solve problems, I prefer to get a Virtual environment setup (typically simple VMs, Dockers, etc) and let the candidate solve a recently found problem. I don’t really care about the result, what’s more important for me is to observe their approach, working method, and ability to analyze (which is way more important than remembering certain terms). If you really need an example of ideas to use, download MNIST dataset, remove all “7”s from the training set, but keep them in the testing set. Don’t tell them you removed “7” from training set, let the candidate troubleshoot what went wrong… again, don’t pay too much attention to the result (this is too small a challenge for Data Scientists), but pay more attention to their way of approaching the issues, how they analyzed it, how they consume external knowledge sources, how they try to learn new things (or review old knowledge).
For domain knowledge, you are supposed to know more than me. My advice is to remember you are not hiring an SME (Subject Matter Expert) in this domain, a Data Scientist is required to comprehend and discuss input, output, independent variables, and dependent variables with SMEs, maybe even point out more meaningful features, missing data, etc. I have seen too many cases, the Data Scientist and the Domain expert can *not* communicate, it sounded as if one is from Jupiter the other Mercury. Data Scientists are not here to replace human domain expertise (yet?!), skills in data alone are not where values are to be created. Despite news coverage of how Data Analysis found things domain experts didn’t know, most likely the findings were based on previous domain knowledge and only achieved because that domain knowledge was successfully “translated” into data, data engineering, and initial models. Try to explain something new in that domain, and assess how the candidate digests and converge this new knowledge into data world. Actually often a good Data Scientist can “teach” your long-time domain expert a thing or two from the data — for example, your data scientist should be able to pick out anomalies in your data, i.e. if your RPM sensor reading come back from your engine is negative — is that possible? are there problems before data even got collected? If you data collection was problematic, likely there are not much your Data Science can do without problems.
Last by not least, why did I not provide a list of common interview questions? Because your candidates can search the web and reach this post as well, and if they are serious, they could prepare based on those already. BTW, there are also tons of “machine learning interview questions” out there, if you want to search and use them, I’ll recommend the questions asking about what problems different algorithms are best suited to solve, the comparison in certain situations and algorithms (especially shortcomings of different algorithms), field trend. The best use of interview questions in the field of Data Scientist is to identify “false positives” — someone who looked or sound confident but actually know nothing about this field at all. In many cases, such candidates might be genuinely ignorant about what a Data Scientist does, and mistaken their previous experience dealing with data was “scientific” enough to be called Data Scientist.
So by all means, go ahead and google “Machine Learning interview questions”, you might run into some candidates who are not able to answer any of questions on the list, which you really should pass them. But if someone care answer more than half of the questions you found online, try focus on previous section recommendations.
After all, it’s a new field.