dbt + Machine Learning: What makes a great baton pass?
Special Thanks: Emilie Schario, Matt Winkler
dbt has done a great job of building an elegant, common interface between data engineers, analytics engineers, and any data-y role, by uniting our work on SQL. This unification of tools and workflows creates interoperability between what would normally be distinct teams within the data organization.
I like to call this interoperability a “baton pass.” Like in a relay race, there are clear handoff points & explicit ownership at all stages of the process. But there’s one baton pass that’s still relatively painful and undefined: the handoff between machine learning (ML) engineers and analytics engineers.
In my experience, the initial collaboration workflow between ML engineering & analytics engineering starts off strong but eventually becomes muddy during the maintenance phase. This eventually leads to projects becoming unusable and forgotten.
In this article, we’ll explore a real-life baton pass between ML engineering and analytics engineering and highlighting where things went wrong.
By doing so, we can hopefully solve this breakdown by answering questions like:
- Do we need a better Jupyter notebook?
- Should we increase the SQL surface area to build ML models?
- Should we leave that to non-SQL interfaces(Python/Scala/etc.)?
- Does this have to be an either/or future?
Whatever the interface evolves into, it must center people, create a low bar and high ceiling, and focus on outcomes and not the mystique of high learning curves.
TL;DR: There’s an ownership problem in the ML engineering & analytics engineering workflow. Luckily, the Modern Data Stack is making this baton pass smoother. This post will walk you through a recent project where I was able to see firsthand how these systems can work together to provide models that are built for long term accuracy and maintainability.
What does a baton pass look like today?
As an analytics engineer, I was paired with a ML engineer to determine when a customer will churn and what actions we could take to prevent it. We labored over a solution, the mechanics kind of worked, we presented to business stakeholders that wanted this, and 1 month later we were hopeful but skeptical. New data changes caused model drift, so the ML engineer logged a ticket for the data engineer/analytics engineer to fix…3 months later, no one remembered we did this. Does this sound familiar?
This happens because the “normal” way of doing things lacks long-term & explicit ownership. But how do these breakdowns happen?
Here’s what happened
After some initial planning, I knew we had this raw data living somewhere in our data warehouse. It was easy to make sense of this starting point for our work together. I wrote dbt transformations to massage this raw data and joined a couple tables together based on intuition of what variables mattered: daily active usage, number of users, amount paid, historical usage, etc.
The ML engineer stepped in from here. She was used to doing her statistics and preprocessing in python pandas and scikit-learn. Before she opened up her Jupyter notebook, we had a heart-to-heart conversation and realized the same work could be done through dbt. Preprocessing could be done through this open source dbt package and there were plenty of others like it in the package registry.
The ML engineer got the preprocessing steps (think: one-hot encoding, feature scaling, imputation) finalized. She used SQL to read the dbt models (tables) into a Jupyter notebook to perform model training. After iterating on the machine learning models and tracking model fit (think: AUC/Precision/Recall (for classification)), she ran the model over dbt-created tables and output the predicted results as a table in the database. To keep documentation clean, she configured a source within the dbt project to reflect this predicted results table. It wasn’t intuitive, but it was better than leaving it out of dbt docs.
Finally, she created a dashboard on top of this table to publicize model accuracy over time to end users. To schedule this, we went to the data engineer to string together the above in Airflow everyday at 8am and called it done.
Core Behaviors and Results (Where things went wrong)
Let’s parse out the job-to-be-done we’re seeing in our story.
Jobs to be Done | Intended Results | Role |
---|---|---|
Extract and load raw data into database | SQL-ready data | Data Engineer |
Transforming data for business user consumption | Data is coherent to make decisions from | Analytics Engineer, ML engineer |
Working with structured, tabular data | Unified and practical workflow | Analytics Engineer, ML engineer |
Working in SQL | Unified and practical workflow | Analytics Engineer, ML engineer |
Equipping users with documentation to understand how data outputs are created and how to use them | Trust and context for data | Analytics Engineer, ML engineer |
Testing transformed data outputs | Trusted data | Analytics Engineer |
Testing pre-processing data outputs | Trusted data | ML engineer |
Training and deploying machine learning models(primarily in python) | Make predictive decisions | ML engineer |
Testing and maintaining machine learning table outputs over time | Prove predicted results match reality even when data inputs change over time | Data Engineer, Analytics Engineer, ML engineer ??? |
Our story starts with unity and a baton pass develops over time until we end with what looks more like playing hot potato. The building narrative is something I was proud of with my fellow ML engineer. The maintenance and validation of the machine learning output workflow is something we weren’t proud of. This happened because we were missing something critical: we weren’t united on who should do what in the long term (think: source data changes cascaded through the transformation->pre-processing->training steps->monitoring performance.) No wonder business users shrugged their shoulders at our results after 1 month because we assumed the other role would hold that baton pass forever.
Let’s make this simple:
- “Who makes data AND machine learning pipelines maintainable over time when data changes?”
If we get this question right, it makes answering this question easier: “How do we get people to use our work?”
How are tools evolving to heal the gap between analytics engineering & machine learning?
Let’s focus this question even more: What’s being done about the “maintenance over time” problem in our workflow? What would your team have done differently?
Build a better notebook experience that makes dbt and python more interoperable
Gluing together notebooks and dbt isn’t the most elegant experience today. It’d be nice to schedule notebooks in sync with my dbt jobs. All so I can better diagnose my problems in ML model drift.
Hex
• Hex: Notebook experience with quality of life improvements. SQL query results can be read in as dataframes after running. dbt integration to verify data quality during development. Build parameterized data apps and give audiences one less context switch to a BI dashboard.
Modelbit
- Modelbit: Bring your own notebook with a huge quality of life improvement - enable dbt to call versioned ML models as external functions in SQL.
How would this change my story?
My ML engineer would know the quality of input data created by dbt before starting machine learning development. I could schedule this notebook in sync with my dbt jobs and know instantly if my ML model drift is caused by data quality vs. model logic. Also, I would create a data app (in Hex) where users plug in different input scenarios that feed into the predictive model. Even better, I could track versions of my ML models deployed over time in Modelbit + Hex and deploy ML external functions as dbt macros (by the way: how is this not more normal?!).
What are the tradeoffs?
I’d still have to export my predictive results back to the database and configure them as sources for dbt docs(depends if Modelbit is involved). People wouldn’t know at a glance the data lineage to power this notebook. But my gut tells me the tradeoff would be worth it because the ML engineer knows where to start problem solving even if the solution wasn’t readily available through SQL.
Bring machine learning to the SQL workflow
What if…SQL could do more than what we think it can or even should?
MindsDB
- MindsDB: Open source layer on top of a database using SQL syntax to create predictive models. Machine learning computation takes place in the MindsDB layer.
Continual
- Continual: Directly create a dbt model to predict your results. Evaluates ML models to use based on your dbt model configuration. Machine learning computation takes place in this layer and data warehouses where applicable (think: Snowpark API, Databricks).
Bigquery ML
- BigQuery ML: Use BigQuery-specific syntax to create machine learning models within the database and apply them as functions to your SQL to predict results. You get to import your own TensorFlow models!
Redshift ML
- Redshift ML: Use Redshift-specific syntax to create machine learning models within the database and apply them as functions to your SQL to predict results. You get to bring your own models too!