4 intense days together with ca 2000 other Data Science (DS) practitioners, enthusiasts and professionals at the Open Data Science Conference (ODSC) in London 19-22 November 2019. This conference is one of the biggest in the world niched in DS, really focusing on the practitioners of DS.
These are our key takeaways from some action-packed days, condensed from way too much information to digest!
1. Why aren’t our Data Scientists producing any value?
We were lucky to experience Cassie Kozyrkov (@kozyrkov), Chief Decision Scientist at Google. Not only is she a great public speaker, but more importantly she has a great perspective on how to make your Data Science team effective and useful.
We want to be data-driven, so we hired a couple of Data Scientists. Why aren’t they producing any value?
Do you recognise the above statement? Unfortunately it’s very common that the output of the DS team doesn’t reach up to the expectation of their organisation. Cassie has a great analysis of the problems of Data Science (see this post), however at the conference she focused on two.
- Data Scientist is too broad and vague of a position. It’s really an umbrella term for numerous positions.
- We forget to educate Data Science Leaders.
Analyst, Statistician & ML Developer
Many employers believe that a Data Scientist will solve all things data-related. Cassie claims that “Data Scientist” really is split into numerous positions. These positions are all vital during different parts of the DS process, and the people in these roles work in very different ways.
The role of the Data Analyst is to be the explorer of the data. By exploring, visualising and correlating data, this person will identify questions and provide preliminary answers. In summary, the Data Analyst chooses speed over precision and goes broad but shallow. Cassie has a great post about the importance of the Data Analyst.
The role of the Statistician is to pick up from where the Data Analyst left off and answer the identified questions. The statistician is precise and will use proven techniques and statistics to confirm the answer. In summary, this person goes narrow but deep.
The role of the ML Developer is to build models and put them into production. These models create predictions directly related to the questions that the Data Analyst initially identified and Statistician answered. In summary, the ML Developer also goes narrow, and even deeper.
Each of the three data science disciplines has its own excellence. Statisticians bring rigor, ML engineers bring performance, and analysts bring speed. – Cassie Kozyrkov
So when a company hires a Data Scientist, they expect that person to be able to do all of the above? That’s a lot of tasks for one person to be doing…
Data Science Leader
Cassie also stressed the fact that many DS organisations are missing Data Science Leaders. The role of the Data Science Leader isn’t to be the best at either one of the aforementioned roles, but being adequate in them all.
Instead, the DS Leader translates the needs and requirements of the organisation to the DS team. Similarly, this person helps ensuring that the output of the DS team is received and interpreted as intended by the organisation. In summary, the DS Leader ensures that the DS team produces actual value to the organisation. Cassie has a post about this too.
2. Everyone wants to be the ML platform provider of choice
Only a small portion of all ML models actually ever make it to production. Once you’ve finished the tedious process of building a data pipeline of ETL- and transform-jobs, you have to identify the relevant features for your problem. This is followed by finding a ML model that works well enough for your use-case, and finally comes another tedious part of optimising your model, without overtraining it.
Then you put it into production. After a while you realise that the model is producing strange predictions. It is biased or has some sort of bug and needs fixing. Somehow you need to unravel the black box and understand what is causing your model to behave the way it is. You have to go back and repeat some of the previous process again…!
The entire ML workflow is long and difficult. You need an automated and transparent ML CI/CD platform. However, building this from scratch is a huge commitment.
ML platform products
There are only a few companies that have the resources to build a ML platform from scratch. Unless there are very specific requirements for it, there are very few companies that should.
The value in ML is in the actual output of the models. Unless it’s your business, the ML model or ML platform itself isn’t where the value is. Instead, consider using one of the already available products that will facilitate some part of the process for you.
At ODSC there were many products at display and demo:d at sessions. There is a range of different kinds of ML platforms that offer help, both open-source and proprietary. They can help in different categories ranging from data orchestration and data pipelines, to CI/CD and model hosting, to feature extraction and Auto-ML. There was an abundance of (especially proprietary) products being advertised at the conference. Some examples of different platforms are listed below.
Currently seems to have the largest market share within this type.
|High Level Auto-ML|
|IBM Watson||Proprietary.||High Level Auto-ML|
|Paperspace||Proprietary.||High Level Auto-ML|
Originally built by Netflix.
|Data & ML Pipeline|
Originally built by Google.
|Data & ML Pipeline|
|Databricks||Proprietary.||Data Analytics & |
Originally built by Airbnb.
|High Level Auto-ML||This platform picks the best model for your problem/data. It can host your model for use in production and handle version control. Generally, it offers limited custom control.|
|Data & ML Pipeline||This platform provisions pipelines/flows from data input to ML model. It is built on connecting containerised jobs in a scaleable cluster.|
|Data Analytics & Transformation||This platform is specialised for easily accessing and analysing large amounts of data.|
|Data Orchestration||This platform for triggering and managing data jobs and data job dependencies.|
3. Visualising your data in a data-illiterate world is hard
The more you work with a dataset, the better you get at understanding what the data means, and data in general. This is true both for data in its raw form, but even more important for data that is visualised. However, the average person is not good at interpreting data. A majority of people are data-illiterate.
We went to a great session by Alan Rutter, founder of Fire Plus Algebra. He stressed the importance of how you visualise data so that the message is delivered as intended. With the risk of grossly oversimplifying his message, here comes a summary!
Data Literacy & Motivation
Make sure to visualise your data so that the intended message comes across. Instead of choosing the latest and coolest diagram, choose the most-easy-to-understand diagram. Every time a new type of diagram is used, it must be explained properly. In addition, there is a risk of misunderstanding. (The page datavizcatalogue.com has collected many cool diagrams for visualising data.)
Another really important thing is that different people have different motivators in their life/business, and will interpret data thereafter. For example, C-level people might focus on profits and margins, whilst a development team is focusing more on the actual feature. Keep that in mind when visualising your data, so that your receiver will react in the sought after way.
If the data is not presented in a relatable way, it will be more difficult for the user to take in the message. By presenting your data in a context that the receiver can actually grasp and relate to, the likelihood of it being well-received goes up.
An effective way to make the data more relatable is by introducing a “me-layer”, the possibility of injecting the receiver into the data. As soon as the receiver can visualise herself within the data, it will become easier for her to relate to it.
Adding imagery is another way of making your data easier to understand. An image can provide humour, a context, or simply just a pleasing design. However, make sure to only use an image that is really connected to what you’re trying to present. And that isn’t a stock image!
Alan Rutter gave the example of how The Guardian has changed the way that they use images in their articles. For example, they used to show pictures of polar bears on melting ice caps when talking about climate change. However, now The Guardian uses pictures of humans portrayed in environments that have been affected by climate change. Even if it’s sad that a polar bear is losing its ice, the majority of people will probably never see a polar bear. Pictures of humans are more relatable, and therefore more emotionally engaging.
This one is probably not something new to you, but it’s so important that it’s worth bringing up. Make sure to decide what message that you want to convey, and emphasise that. You’re the expert of the data that you’re visualising, you have the power of extracting the key information. Use a classic storytelling structure of bringing the receiver from point A to point B, to get the key information across.
- Grab the attention of your receiver. Do this visually and/or with something relatable.
- Extract the key information. Simplify the data, and emphasise it. Details and follow-ups come later.
- Adapt the key information to your receiver. Depending on the motivation of the receiver, present the key information differently.
- Have separate views with details. Once the key information has been presented, the receiver can look at the details if she wishes. The detailed views can present how the data was collected, more in-depth, or even the raw data.
Another good practice is to always have over-explicit titles. Catchy titles are often unclear and will leave room for interpretation.
Consider if there is any uncertainty in the data or message. If there is, consider if it’s vital and if you need to convey it to the receiver also. However, this should not be done in the key information, but rather in the details section. If there is a way to visualise the uncertainty, do so! This will make it easier for the receiver to understand it.
Conclusion & extra shout-out
In the majority of the sessions we dove deep into some of the latest techniques within ML. We went to workshops for Scikit-learn with it’s core-developer Andreas Müller. In addition we met the founders of Data Provenance who told us the importance of data lineage. Together with researchers in the field, we also worked our way through notebooks covering the latest within NLP.
However, the best part was to meet people from around the world (though primarily from Europe) that work within the world of DS. It’s a hot topic, and it is evolving and growing extremely fast. Let’s do our best to keep up.
And seeing The Book of Mormon & The Lion King was also pretty sweet…! See you at the next edition of ODSC!