Following my last post on “What Does dbt Actually Do?”, I wanted to explain the difference between the 2 products offered by dbt Labs: dbt Core and dbt Cloud.
As a recap:
- dbt Core: a set of open source Python packages, run via the command line, that helps you transform, document, and test your data
- dbt Cloud: a web based UI that allows you to both develop and deploy data pipelines (built on top of dbt Core)
So what does this actually mean?
Let’s start with a data pipeline built with just dbt Core:
- You develop your SQL, written in dbt syntax (lots of curly brackets)
- You version control your production code with something like GitHub
- You schedule, in something like Airflow, an hourly
dbt build
which will run against your production code - dbt Core (the Python package), compiles your SQL (turns curly brackets into “proper” SQL). I’ll go into this a bit more in a later article
- Your tables get built in your database (e.g. Snowflake), and you run tests & build documentation based on the
.yml
files that accompany your SQL files
Here’s my attempt at simplifying all of this in one diagram:
dbt Core is doing all of the bits in red, it’s taking your production code, compiling it, creating data tables, and running tests & creating documentation. (This diagram isn’t 100% accurate, dbt can also test your raw data!)
The above setup is common, but there are 3 main drawbacks:
- Local development can require a lot of setup: tools like VSCode are endlessly customisable, especially for dbt, but for every new developer you will also need a new local setup
- Scheduling isn’t easy: Airflow is popular, but the UI isn’t great (especially for beginners) — and scheduling is done in a separate tool to where you’re developing your code
- Documentation hosted separately: the documentation (generated from
.yml
files) is generated as a file and has to be hosted separately if you want to have a production version of your documentation
dbt Cloud is built on top of dbt Core, and includes a web development environment (like VSCode) and a scheduler (like Airflow).
Let’s take the above setup, and instead see how it would work on dbt Cloud:
There are a few things to take away from the above:
- dbt Core is still being used: dbt Cloud is just built on top of it
- Everyone develops in the web UI: no local setup needed for new users
- The web UI is used for scheduling: this, personally, is the main selling point for me — scheduling is far easier that any other tool I’ve used, and is purpose built for dbt pipelines
- Documentation is hosted: you don’t to do anything other than fill out your
.yml
files, click “generate documents” when running via dbt Cloud, and it’ll take care of hosting them for you so that you can share it across your company
In the interests of balance — dbt Cloud may not always be the right choice.
If you have a solid Data Engineer who can handle scheduling, hosting documentation, and making a local setup a well-oiled process, then most of the drawbacks of using just dbt Core can be mitigated.
dbt Cloud also isn’t free — at $1,200 per developer per year on their Pro plan (limited to 8 developers, and only 2 concurrent jobs running at once) a lot of companies will need to choose their Enterprise plan which can rack up significant costs on top of your cloud data warehouse spend.
In summary:
New to dbt, or just want to learn the advanced concepts? My dbt course is now live on Udemy with a limited time promotion (link), and covers everything from basic to advanced dbt — including building a full project from scratch, using dbt Core and dbt Cloud, 7 hours of video, and 50 downloadable course materials!