Semantic layers are not new, they started in 1991, when BusinessObjects patented the semantic layer. It’s a great concept but it hasn’t really taken hold in the enterprise. Now LLMs are generating renewed interest in the semantic layer. Semantic layers really unlock the future of where things will go with AI. If you want trustworthy data, you need a semantic layer. If you let an LLM loose on a schema versus let an LLM loose on a semantic layer on top of that schema with some basic curation, you're going to get much better results.
In this blogpost, I am discussing trends and observations from speaking with business users, founders, and BI experts at the first user conference of cube.dev held in San Francisco.
Text-to-SQL startups, while hyped up, are destined to fail without a semantic layer.
Data Products are trending: It's becoming expected in your product to feed some data back to the person paying for it. The intersection of semantic models, plus data sharing and distribution is extremely powerful.
To unlock LLMs, we need first to write the literature of data modeling for data analytics. This will serve as the foundation of the dialogue that we will then have with LLMs.
The industry has been talking about a universal source of truth for semantic layers but it's not going to happen, because it's not how reality works.
We’re entering a golden age of vertical specific analytical software to be built on top of semantic layers.
Semantic Layers: The Treasure Map for LLMs
As we navigate the vast ocean of data in today's business landscape, it's becoming increasingly clear that semantic layers may indeed be the treasure map that Large Language Models (LLMs) need to uncover the true value hidden within our data. Just as ancient mariners relied on maps to guide them to buried riches, LLMs can use semantic layers to navigate the complex terrain of business data, avoiding the pitfalls of misinterpretation and inaccuracy.
For those new to semantic layers, imagine you have a large library with thousands of books. The semantic layer is like a friendly librarian who knows exactly where every piece of information is stored and can translate complex library jargon into simple terms you understand. In practice, this means as a business user, you can create reports and analyze data without needing to understand the underlying technical complexities and data structures. Examples of tools building to get to a usable semantic layer: Cube, Amazon Q, ThoughtSpot and Snowflake.
Professor Christopher Manning gave a talk about meaning and intelligence at COLM this year. He argues that we confuse being knowledgeable (which most commercially available LLMs are) with being intelligent, which would be necessary for practically deployment reasons. Giving context to LLMs is important but currently limited. He suggests that true intelligence and understanding will require more advanced ways for models to adapt to new contexts and learn continuously, beyond just having a large knowledge base or taking in some static context at inference time. Semantic Layers could aid with this. And if you do it right, you’re essentially coding in the language of the business.
Text-to-SQL startups are DOA without a Semantic Layer
Josh Klahr, shared at the event that while building the early prototype for Text-to-SQL, Snowflake got feedback from customers that the product is great, but it's wrong half the time. Especially when the queries got complex, or if the schema contained a bunch of multiple tables where joins needed to be figured out. And so they realized they needed some way to allow the analysts to tell them a little bit more about the data and its semantics.
Snowflake does have a semantic model. It's used today only by Cortex Analyst, but it is similar to what a bunch of other semantic modeling companies do, there's no magic in identifying measures and dimensions and hierarchies so far. Snowflake is in the process of figuring out what is rightfully part of the core database experience, what should be native and what to expect to be built by partners. Should a metric be a first class concept in Snowflake? It’s unknown, but they are thinking about it. Julian Hyde gave a great talk about a proposal for how to implement a metric as a kind of a first class citizen in a database for those interested to learn more.
Data Products are trending
Source: google trends for data products
“As the number of data sources, users, and use cases have grown exponentially, the toolset on top of centralized data has fragmented as have the roles. Data consumers today are savvy and have high expectations. They want data to be responsive, high quality, reliable, and at predictable cost, and no longer want to be treated as the beta testers by the data teams. Trust and user experience of analyzing data are paramount” - SanjMo said it best. Data products are a way of refocusing again on the end goal, to delight the business user instead of obsessing over a new lakehouse, warehouse etc.
A data product is a combination of the underlying data at the intersection with a domain (e.g. finance or healthcare or manufacturing). The domain is super important. Being able to intersect the data with the domain is really, really, really critical to the consumers of that underlying data. And then the final thing, the final component is access. Source: SanjMo.
Data products need a semantic layer. It’s a way to hide the complexity from your consumers, and it’s really, really important. If you want to build better apps, having a semantic layer for your developers is really important. If your data engineers are instrumenting metrics as part of their pipelines, how do you expose those metrics, so that they're easily consumable? If your ML folks are building feature stores, and they're saying, these are the features that matter when I'm building models, or when I'm serving data, or I'm doing real time scoring, having some kind of model or metric store that kind of says these are the features, this is a curated feature, this is what it means, and having that be reusable and consumable is really, really important.
Part of Snowflake’s success was to create an architecture that is really good for data sharing. A lot of customers, especially in industries like financial services, come to Snowflake because their entire ecosystem has Snowflake accounts and they want to do data sharing. So the ability to create and share distributed data products is super powerful and it has stimulated a huge amount of growth. The intersection of the semantic model plus data sharing and distribution is extremely powerful.
Data Modeling is more important than ever
In the data industry, we've collected amnesia every couple of years, where we're rediscovering domain driven design and microservices. Francesco Puppini, at the cube.dev event shared that to him, data modeling was never born and it’s actually a very ambiguous word. He described four types of data modeling:
If you buy a book for data modeling, you buy a book for data modeling for creating applications from scratch - type 1
Or you buy a book of data modeling for creating a data warehouse based on the data that is in the applications - type 2
And there's a type 4, which is what you do for preparing data for machine learning.
Type 3 was never born. Today, we are all using Kimball dimensional modeling. And that's the only option. Either one big table or Kimball. But, did Kimball do any dimensional modeling efforts to give definitions of things? We don't have definitions - “This table looks like a fat table, but a bit dimensionish”. Francesco thinks we should write a book and just put down the foundations of a coherent data modeling that gives all the details, and proper definitions like we were used to at school explaining things properly. We don't have a definition of what a measure is exactly. And we don't have a distinction of measures that work horizontally first and vertically after versus the ones that are born vertically.
We need the literature of data modeling for data analytics. Once we have this, we can put the foundations of the dialogue that we will have with LLMs as well. Let's catch the train of LLMs. We didn't do it until now for humans. Let's do it for machines. Let's put together a framework of communication with humans and with machines that puts together all the metadata that is needed for creating proper data modeling for analytics. Reach out to Francesco if you’d like to contribute, he put together a list of ten things, 10 sets of data which need to be there if we want to make it possible to write in English a question to an LLM and get the correct answer.
The universal source of truth is a waste of time
There are only two hard things in Computer Science: cache invalidation and naming things — Phil Karlton. And that's why the semantic part is difficult. A lot of the research, especially driven by AI, is backing into a lot of that semantic understanding because it’s needed for language, which is how you construct entities, which a lot of the folks in the data modeling space or domain driven design and microservices have been trying to do. There is some overlap there because if any of you have ever done any data warehouse implementation, the first thing that you end up having to try to do is understand what that business's definitions of things are. This is not new.
What's controversial in this industry is the notion of a universal source of truth. This concept, while often discussed, is ultimately unattainable because it doesn't align with the complexities of real-world business operations. In practice, businesses are composed of multiple distinct domains, each with its own unique perspective and data requirements. When you look at a capability map of a business, of what they buy, one buys ERP because it's a source of a core thing that one doesn't need to do. Like, it's not core to my business, but it's a function, finance, I need to do. All businesses have these degrees of these domains. But the fact is that every time you do all these invocations, you're basically doing it from scratch. You're trying to figure out, okay, well what should we call the patient account? Well, how should we define that? What does it look like? Well, what does this person do? And you end up just going back in circles. When really, every company that has dealt in that space seems to be doing it over and over.
It's very odd that none of this is publicly spoken about, how those common entities are recognized. Because if you could have that, it could power many products, because you can build a semantic understanding of your business, rather than always having to inject it.
Moving beyond version one of AI BI
The investment in AI BI so far has been how do we use AI to do the BI thing. Here's the demo of how I'm gonna use a chatbot to trend sales over time in just three minutes. I can just do that in my chart. Why do I need AI? Far more interesting is to use AI to create the relationship between a series of questions that a customer might ask looking at a BI tool.
So how do I build into the semantic model, maybe what consultants do. When consultants look at the data, they're going to ask one question, and then another question, and then based on the result of that, they're going to ask another question. And in there, is the semantic information that we should be building that actually doesn't describe a projection of the data, but describes the flow and time through the analysis.
Vertical specific analytical software will be built on top of semantic layers
In the past there was a sea of data that was basically inaccessible. Industries like law, for example, have never really been particularly touched by BI before, because all of their data was unstructured text and it was impossible to access. And now suddenly with LLMs, that unstructured data isn't unstructured anymore. It's, like, semi structured if not structured.
If one builds a metric on top of that unstructured data, in a semantic layer, one can get some really interesting capabilities. If you map your business processes, they'll reveal the core business entities, which one often walks through indirectly for data modeling exercises. The different entities may differ in how they're stored in different engines, but a SQL query can be represented as a graph. And you can graph databases. The vectors are relations.
One can imagine a central semantic model, with different storage layers based on the performance where they're stored. But to the end customer, it doesn't matter if it's in a document, it doesn't matter if it's in a data, a raw level database, it doesn't matter if it's in MongoDB. They want to know where it is, and it has to start with some central business definition, similar to business process mapping software, which a lot of older data modeling tools kind of did as well.
Charting the course
Throughout this exploration, we've seen how semantic layers provide the essential context and guidance that LLMs require to work effectively with structured business data. They act as the compass and sextant, helping these powerful AI tools to orient themselves and plot a reliable course through the sea of information.
X Marks the spot
The treasure that semantic layers help uncover isn't just gold and jewels, but something far more valuable in the modern business world: trustworthy, actionable insights. By providing a well-defined framework for understanding data relationships and business logic, semantic layers enable LLMs to dig deeper and extract nuggets of wisdom that might otherwise remain buried.
The Quest continues
As we conclude our journey, it's clear that the adventure is far from over. The quest to fully harness the power of LLMs in business analytics is ongoing, with semantic layers playing a crucial role in mapping out the territory ahead. We're entering a golden age of vertical-specific analytical software built on top of semantic layers, promising new discoveries and innovations on the horizon.
Just as treasure hunters of old knew that the map was only the beginning of their quest, we understand that semantic layers are not the end goal, but rather the key to unlocking the true potential of AI in data analytics. As we continue to refine our maps and chart new territories, the treasures we uncover – in the form of insights, efficiencies, and innovations – will undoubtedly transform the landscape of business intelligence.
So, are semantic layers the treasure map for LLMs? As our exploration has shown, they very well might be the X that marks the spot for the next big breakthrough in AI-powered analytics.
Featured in this picture from left to right at the cube.dev event:
Hamzah, the founder of LightDash - Self-serve BI to 10x your data team. Reach out to him if you want to turn your dbt project into a full stack BI platform, instantly.
Galen, reach out to him if you want to monitor the health of your accounts and maximize renewals
Me - reach out if you want to geek out on semantic layers, all things LLMs or if you’re hiring, post on https://www.shesready2.dev/
Katie - for all things data and product in BI. If your current solution does not satisfy you, reach out to Katie
Francesco - wrote The Unified Star Schema and is THE most passionate person I've ever met to talk deeply about semantic layers + business user access to data. Francesco is a legend.
Love this and also, Data Product is trending :) No GenAI without Clean Data Products