With data scientists regularly topping the charts as one of the most in-demand roles globally, many organizations are increasingly turning to non-traditional employees to help make sense of their most valuable asset: data.
These so-called citizen data scientists, typically self-taught specialists in any given field with a penchant for analysis, are likewise becoming champions for important projects with business-defining impact. They’re often leading the charge when it comes to the global adoption of machine learning (ML) and artificial intelligence (AI), for example, and can arm senior leaders with the intelligence needed to navigate business disruption.
Chances are you’ve seen several articles from industry luminaries and analysts talking about how important these roles are for the future. But seemingly every opinion piece overlooks the most crucial challenge facing citizen data scientists today: collecting better data.
The most pressing concern is not about tooling or using R or Python2 but, instead, something more foundational. By neglecting to address data collection and preparation, many citizen data scientists do not have the most basic building blocks needed to accomplish their goals. And without better data, it becomes much more challenging to turn potentially great ideas into tangible business outcomes in a simple, repeatable, and cost-efficient way.
Quality Data is at the Heart of ML Deployment
When it comes to how machine learning models are operationalized (or not), otherwise known as the path to deployment, we see the same three patterns crop up repeatedly. Often, success is determined by the quality of the data collected and how difficult it is to set up and maintain these models.
The first category occurs in data-savvy companies where the business identifies a machine learning requirement. A team of engineers and data scientists is assembled to get started, and these teams spend extraordinary amounts of time building data pipelines, creating training data sets, moving and transforming data, building models, and eventually deploying the model into production. This process typically takes six to 12 months. It is expensive to operationalize, fragile to maintain, and difficult to evolve.
The second category is where a citizen data scientist creates a prototype ML model. This model is often the result of a moment of inspiration, insight, or even an intuitive hunch. The model shows some encouraging results, and it is proposed to the business. The problem is that to get this prototype model into production requires all the painful steps highlighted in the first category. Unless the model shows something extraordinary, it is put on a backlog and is rarely seen again.
The last, and perhaps the most demoralizing category of all, are those ideas that never even get explored because of roadblocks that make it difficult, if not impossible, to operationalize. This category has all sorts of nuances, some of which are not at all obvious. For example, consider the data scientist who wants features in their model that reflect certain behaviors of visitors on their website or mobile application. How do they get that data? The answer is often to raise a change request with the IT team to tag the applications to collect it.
But of course, IT has other priorities, so unless the citizen data scientist can persuade the IT department that their project should rise to the top of their list, it’s not uncommon for such projects to face months of delays — assuming IT is willing to make the change in the first place.
To consolidate data collection and lay the foundation for advanced machine learning and data science projects, many companies are adopting technologies that make customer data more actionable across their digital properties. In fact, a recent survey of retail and brand marketers revealed that investing in a customer data platform (CDP) is their top tech priority. In doing so, they’re automating the most complicated and time-consuming processes that all too often sabotage even the most advanced citizen data scientists.
Avoiding Deployment Traps
By definition, citizen data scientists are not as well versed in the most technical aspects of data science as their professional counterparts. But what they may lack in technical expertise, they make up for with their subject matter expertise. And that insider knowledge of critical business processes and industry dynamics is a tremendous advantage when creating predictive models that are successful, innovative, and potentially business-defining.
With that in mind, technology that lowers the bar for experimentation, increases accessibility (with appropriate guardrails) and ultimately, democratizes data science is worth consideration. And companies should do everything they can to remove roadblocks that prevent data scientists from creating data models in a time-efficient and scalable way, including adopting CDPs to streamline data collection and storage.
But it’s up to chief information officers and those tasked with implementing CDPs to ensure the technology meets expectations. Otherwise, data scientists (citizen or otherwise) may continue to lack the building blocks they need to be effective.
First and foremost, in these considerations, data collection needs to be automated and tagless. Because understanding visitor behaviors via tagging is effectively coding in disguise. Citizen data scientist experimentation is severely hampered when IT has to get involved to code changes to data layers. And while IT can and should be involved from a governance perspective, the key is that citizens data scientists must have automated collection systems in place that are both flexible and scalable.
Second, identity is the glue in which data scientists can piece together disparate information streams for organizations to find true value. Thankfully, organizations have a myriad of identifiers about their customers to reference, including email addresses, usernames, and account numbers. And identity graphs can help organizations create order from chaos so that it becomes possible to identify visitors in real-time, making these features essential for analyzing user behavior across devices.
These components, together, lower the bar for citizen data scientists to reach their full potential. Because ultimately, it’s not factors like whether citizen data scientists have advanced degrees or are fluent in R that will determine their success. Instead, their success will often come down to whether their organizations have prioritized investment in the tools and technology that resolve the more fundamental constraints that limit their ability to experiment and create sustainable models.