About two years ago I was tasked with building a data science team at Statnett, a government owned utility. Interesting data science tasks abounded, but most data science work was done by consultants or external parties as R&D projects. This approach has its merits, but there were two main reasons why I advocated building an in-house data science team:
- Domain specific knowledge as well as knowing the ins and outs of the company is important to achieve speed and efficiency. An in-house data science team can do a proof of concept in the time it takes to hire an external team. And an external team often needs a lot of time to get to know the data sources and domain before they start delivering. It is also a burden for the few people who teach new consultants the same stuff time and again, they can become a bottleneck.
- Data science is about making the company more data driven, it is an integral part of developing our core business. If we are to succeed, we must change how people work. I believe this is easier to achieve if you are intimately familiar with the business and are in it for the long haul.
But how to build a thriving data science department from scratch? Attracting top talent can be tough, especially so when you lack a strong professional environment where your new hires can learn and develop their skills.
Build on talent from the business
Prior to establishing the data science team, I lead a team of power system analysts. These analysts work in the department for power system planning, which has strong traditions for working data driven. Back in 2008 we adopted Python to automate our work with simple scripts; we didn’t use version control or write tests. Our use of Python grew steadily and eventually we hired a software development consultant to aid in the development and help us establish better software development practices.
This was long before I was aware that there was anything called data science and machine learning. However, the skillset we developed – a mix of coding skills, domain expertise and math – is just what Drew Conway described as the “primary colors” of data science in his now famous Venn diagram from 2010.
A background as an analyst with python skills proved to be a very good starting point for doing data science. The first two data scientists in Statnett came from the power system analysis department. I guess we were lucky to have started coding python back in 2008, as this coincided perfectly with the rise of python as the programming language for doing data science.
Still, most large companies have some sort of business analyst function, and this is a good place to find candidates for data science jobs. Even if they are unfamiliar with (advanced) coding, you get people that already know the business and should know basic math and statistics as well.
Hire for aptitude rather than skill
Having two competent analysts to kickstart the data science department also vastly improved our recruitment position. In my experience it is important for potential hires to know that they will have competent colleagues and a good work environment for learning. Therefore, Øystein, one of the “analyst-turned-data scientists”, attended all interviews and helped develop our recruitment strategy.
Basically, our strategy was to hire and develop fast learners, albeit with some background in at least two of the “bubbles” from the Venn diagram. Preferably the candidates should have complementary backgrounds. E.g. some strong coders, some more on the statistics side and some with a background in power systems.
We deliberately didn’t set any hard skill requirements for applicants. Requiring experience in machine learning or power systems would have narrowed the field down too much. Also, the technical side of data science is in rapid development. What is in vogue today, will be outdated in a year. So, it makes more sense to hire for aptitude rather than look for certifications and long experience with some particular technology. This emphasis on learning is also more attractive for the innovative talent that you need in a data science position.
Use interview-cases with scoring to avoid bias
During the first four months we conducted about 60 interviews with potential candidates. I had good experience using tests and cases in previous interview situations. First off, this forces you to be very specific about what you are looking for in a candidate and how to evaluate this. Secondly, it helps to avoid bias if you can develop an objective scoring system and stick to it. Thirdly, a good selection of tailored cases (not standard examples from the internet) serves to show off some interesting tasks that are present in the company. We used the ensuing discussions to talk about how we approached the cases in our company, thus advertising our own competence and work environment.
Such a process is not without flaws. Testing and cases can put some people off, and you are at risk of evaluating how candidates deal with a stressful test situation rather than how they will perform at work. Also, expect to spend a lot of time recruiting. I still think the benefits clearly outweigh the downsides. (Btw, maybe an aspiring data scientist should expect a data drive recruitment process).
We ended up with a set of tasks and cases covering:
- Basic statistics/logic
- Evaluate a data science project proposal
- How to communicate a message with a figure
- Statistical data visualization
- Coding skills test (in language of choice)
- Standardized aptitude tests
We did two rounds of interviews and spent about 4 hours in total per candidate for the full test set. In addition, some of the cases were prepared beforehand by the candidate. The cases separated well and in retrospect there was little need for adjustments. In any case, it would have been problematic to change anything once we started: How you explain the tasks, what hints you give etc. should be consistent from candidate to candidate.
Focus on a few tasks at a time
The first year of operation has necessarily been one of learning. Everyone has had to learn the ropes of our new infrastructure. We have recently started using a Hadoop stack for our streaming data and data at rest. And we have been working hard to get our CI/CD pipelines working and learning how to deploy containerized data science application on Open Shift.
Even though the team has spent a lot of time learning the technology and domain, this was as expected. I would say the major learning point from the first year is one of prioritization: We took on too many tasks at once and we underestimated the effort of getting applications into production.
As I stated initially, interesting data science tasks abound, and it is hard to say no to a difficult problem if you are a data scientist. When you couple this with a propensity to underestimate the effort needed to get applications into production (especially on new infrastructure for the first time), you have a recipe for unfinished work and inefficient task switching.
So, we have made adjustments along the way to focus on fewer tasks and taking the time needed to complete projects and get them in production. However, this is a constant battle; there are always new proposals/needs/problems from the business side that are hard to say no to. At times, having a team of competent data scientists feels like a double-edged sword because there are so many interesting problems left unsolved. Setting clear goals and priorities, as simple as that sounds, is probably where I can improve the most going forward.
Get strong business involvement
I would also stress that it is paramount to have strong presence from the business side to get your prioritizations right and to solve the right problem. This holds for all digital transformation, when in my experience, the goal often has to be adjusted along the way. The most rewarding form of work is when our deliveries are continuously asked for and the results critiqued and questioned.
On the other hand, one of the goals with the data science department was to challenge the business side to adopt new data driven practices. This might entail doing proof of concepts that haven’t been asked for by the business side to prove that another way of operating is viable.
It can be difficult to strike the right balance between the two forms of work. In retrospect we spent too much time initiating our own pilot projects and proof of concepts in the beginning. While they provide ample learning opportunities – which was important during our first year – it is hard to get such work implemented and it is a real risk that you solve the wrong problem. A rough estimate therefore is that about 80 % of our work should be initiated by and led by the business units, with the representative from the data science team being an active participator, always looking for better ways to solve the issues at hand.
Scaling data science competency
From the outset, we knew that there would be more work than a central data science department could take on. Also, to become a more data-driven organization, we needed to develop new skills and a new mindset throughout the company, not only in one central department.
Initially we started holding introductory Python classes and data science discussions. We had a lot of colleagues who were eager to learn, but after a while it became clear that this approach had its limits. The lectures and classes were inspiring, but for the majority it wasn’t enough to change they way they work.
That’s why we developed a data science academy: Ten employees leave their work behind for three months and spend all their time learning about data science. The lectures and cases are tailored to be relevant for Statnett and held mainly by data scientists. The hope is that the attendees learn enough to spread data science competence in their own department (with help from the us). In such a way we hope to spread data science competency across the company.
What have we achieved?
At times I have felt that progress was slow. Challenges with data quality have been ubiquitous. Learning unfamiliar technology by trial and error has brought development to a standstill. Very often, projects that we think are close to production ready have weeks left of work. This has coined the term “ten minutes left” meaning an unspecified amount of work, often weeks or months left.
But looking back, the accomplishments become clearer. There is a saying that “people overestimate what can be done in the short term, and underestimate what can be done in the long term” which applies here as well.
Some of the work that we have done builds on ideas from our time working on reliability analysis in power systems and is covered here on this blog:
- Predicting failure probability of overhead lines
- Realtime classification of failure causes
- Probabilistic analysis of security of supply
The largest part of our capacity during the last year has been spent on building forecasts for system operation:
We have also spent a considerable effort on communication, both to teach data science to new audiences at Statnett, exchange knowledge and to promote all the interesting work that happens at Statnett. This blog, scientific papers, our data science academy and talks at meetups and conferences are examples of this.