It probably comes as no surprise that like many other careers, data scientists, mathematicians and statisticians face numerous day-to-day challenges. For instance, lack of data, difficulties with version control, poor data quality, and model overfitting. All of these things have the potential to impact the quality of the results and are therefore the typical issues that most people associate with the term “data scientist” or “statistician”.
However, one could argue that there is a more pertinent challenge that we face almost every day, most certainly on every project, that even good data scientists may not appreciate. That is that context is everything.
To explain, I am going to use a well-known example from World War II that is taken from the book “How not to be wrong” by Jordan Ellenberg, as well as extracts from “Black Box Thinking” by Matthew Syed.
In 1943 the US military were losing too many aircraft to enemy fighters. They would often be subjected to enemy fire, and they knew that armour was the answer. The problem was that too much armour makes the plan heavier. Heavier planes are less manoeuvrable and use more fuel. Equally, armouring the planes too little leaves them vulnerable to enemy hits. Somewhere, there is an optimum.
A Hungarian mathematician by the name of Abraham Wald was tasked with determining the precise level and position of the armour required.
In order to help Abraham, the military provided him with data they thought might be useful. Often, when the planes returned, they were covered in bullet holes, so they provided data on the number of bullet holes per section of plane, indicating the areas that were most commonly hit by enemy fire.
Immediately, the military sought to strengthen the most commonly damaged parts of the planes to reduce the number that were shot down. This meant that the additional armour would mainly cover the fuselage.
Abraham instantly disagreed. “The armour shouldn’t go where the bullet hole are, it should go where the bullet holes aren’t. Around the engines and cockpit”.
He had realised that the military were only considering data from the planes that actually returned. The reason planes were coming back with fewer hits to the engine is because the planes that got hit in the engine weren’t coming back! The bullet holes revealed the parts of the aircraft that demanded the least amount of armour.
Jordan Ellenberg goes further to explain that “if you were to visit the recovery room at a battlefield hospital, you’d see a many more people with bullet holes in their limbs than people with bullet holes in their chests. That’s not because people don’t get shot in the chest; it’s because the people who get shot in the chest don’t typically make the recovery room”.
The absence of information is often not information of absence. The key is sometimes in the missing data. These are examples of what is called ‘survivorship bias’.
A good data scientist will carefully handle data. They will look for relationships and correlations and apply machine learning models in their sleep. A great data scientist will question the data. They will separate fact from assumption as they know the importance that this has on the outcome. They will strive to understand all of the potential variables and use domain knowledge and expert opinion to apply logic and reasoning to the result. A great data scientist not only knows that the model is robust, but that it makes sense in the context of the original problem.
To illustrate, our data scientists spend a lot of time analysing water meter data, and it is not uncommon to have to look at distributions of flow to inform our analyses. It is typical to have no flow below a certain threshold. A good data scientist will look at the distribution and spend time checking units, outliers and erroneous data before deciding if the data is valid. This may indicate that the threshold is real. A great data scientist will spend extra time understanding why this is possible. Is it that households have near zero chance of using water at such low flow rates? It is more likely that there is a physical limitation to the meter so that this flow is not detected. Both scenarios explain the data, but only the latter makes sense in the context of the data.
At Artesia, our data scientists are great data scientists because of the emphasis we place on the context of the data in delivering the right solution. We achieve this through the collaboration with our industry experts who are equally important to our success. By looking beyond the data, challenging our assumptions and separating facts, we are able to ensure that our solutions are not only robust, accurate and technically sound – all the things one expects from a data scientist – but that they are also logical, and utilise all other possible sources of information that may not initially seem relevant. Sometimes it is the missing information that makes all the difference to understand what is truly going on.