It’s hard to ignore the elephant in the room: Generative artificial intelligence pretty much dominated the conversation in 2024, and going forward that’s not likely to change anytime soon.
Given that generative AI burst onto the scene barely a couple years back, it’s not surprising that innovations on refining generative models to make them more sustainable and useful are making the technology a fast-moving target. We’re especially looking at innovations enabling models to reason, detect causality, get orchestrated and employ agentic approaches.
With all this attention on the models, it has been easy to forget that models are only as good as the data on which they are trained. But as enterprises venture beyond early experimentation with proof of concepts, our big take for the year ahead is that data will crawl back into the limelight. In this year-ahead post, we’ll share some quick thoughts about where we are with generative AI, which provides the context for our expectation that 2025 will be the year that data gets respect again.
Setting the stage
When it comes to generative AI, it seems like there is no middle ground. Depending on whom you speak with, the AI future is infinite or the sky is falling. We’re getting predictions that “2025 will be known as the year of The Automation of Everything” (this was in a vendor email pitch) to “Wall Street’s hottest trend may come to a crashing halt” (link below). Or that, according to Microsoft Corp. CEO Satya Nadella, AI agents will cause software-as-a-service applications to “collapse” as they soak up all the business logic. There’s no doubt that we’re in for some form of transformation, at some point.
We’ve been seeing posts in the past few weeks characterizing ChatGPT as an $8 trillion birthday gift to big tech or that the AI bubble will burst in 2025. Earlier in the year, we postulated that Nvidia Corp. might be becoming the de facto AI mainframe; more recently we’ve seen posts suggesting that Nvidia has peaked and that savvy investors should be shorting the stock. As for OpenAI, it is drawing plaudits for pioneering reasoning models, but appears to be hitting the wall with the bigger-is-better strategy behind GPT-5. That comes amidst broader concerns that the world may be running out of publicly available data for training these models.
Naturally, we couldn’t resist the temptation to add our two cents. We observed that gen AI in the short term was becoming a cash drain on the industry that would take three to four years to pay off. That’s because any new technology has upfront costs, and the more transformative the innovation, the longer it takes to catch on.
For gen AI, the huge costs are stemming from the need to literally build the infrastructure and ecosystem and train the models. And though there are short-term killer apps, such as conversational interfaces, coding copilots and document/content entity extraction and summarization, it will take a while for enterprises to understand and embrace use cases that will transform their businesses.
What’s next for AI?
First off, let’s set some perspective. Though it dominates the conversation, gen AI is just one form of AI. By contrast, “classical” machine learning is well-entrenched, enhancing software applications and tools with predictive trend analysis, prescriptive remediation, classification and clustering capabilities. Gen AI is the new kid on the block and that’s obviously where most of the action is.
In the coming year, venture firms that are already all in on frontier model investments will continue doubling down. But the longer tail of VC funding will pivot toward practical use cases. Third-quarter data from CB Insights provides hard evidence of funding starting to shift away megadeals toward more targeted investments. Paradoxically, even the recent $10 billion Series J round with Databricks (which outraised Open AI last year) backs this point. Hold that thought.
For the record, we believe that when it comes to generative AI models, the meek will inherit the earth. Exhibit One? The growing promise of approaches such as Mixture of Experts in conjunction with smaller language models. In essence, we expect to see a value engineering approach to generative AI: How much can we shrink the compute footprint, training data set and model and still deliver results that are good enough?
Innovation in gen AI will continue at a frenetic pace in 2025. But we’ll defer to others to talk about the forms this will take, whether that be agentic AI, reasoning models, orchestrated execution or even that artificial general intelligence is already here.
What happened with data in 2024?
We acknowledged last year that AI practically sucked the air out of the room; data startups are so 2014. Last year, we made a number of predictions, some of which were overly ambitious. ambitious; you can find them here. So how did they pan out?
Database platforms. We predicted enterprise customers would opt for a continuing flight to safety. There is scant appetite for new database startups in a landscape that still counts hundreds of engines but shows the top 10 most popular ones remaining largely stable. With the foundational architecture for cloud-native databases becoming pretty well-defined (e.g., options such as elasticity, serverless and multimodel support have become checkbox items), core engine enhancements have become incremental. In fact, some providers such as MongoDB Inc. are actually retrenching, pulling the plug on features such as mobile and SQL query support.
So, it shouldn’t be surprising that, of the Cambrian explosion of new databases of the 2010s that only a handful (Databricks Inc., Snowflake Inc. and MongoDB, distantly followed by Cloudera Inc.) have broken the $1 billion revenue mark. If you’re one of those 2010s database startups and haven’t reached that milestone (adjusted for inflation), by now you probably never will. The overwhelming presence and diversity of database portfolios of the hyperscalers along with emergence of multimodel databases are likely to consign the likes of Aerospike, Yugabyte, Cockroach Labs, DataStax, Couchbase, Redis and Enterprise DB (actually founded the previous decade) et. al. to the longer tail.
Data types. Vector embeddings were the rookies of the year in 2024 as new data types. They’re supporting an emerging adoption pattern for generative AI targeting defined data sources for keeping answers relevant versus fine-tuning models. Vector data types and indexing are the key building blocks for supporting retrieval-augmented generation, or RAG.
We expected database providers adding vector storage support to innovate with indexes pinpointing use cases demanding precise response versus those generating results that are “good enough.” A year later, vector indexes have become checkbox features for cloud databases, reflecting their growing popularity, but we have not yet seen a lot of differentiation there, outside what’s offered by pure-play vector database providers.
Nonetheless, we expect RAG adoption to continue driving innovation in working with data, and that’s gaining our attention in 2025.
Data discovery and AI. One of the fastest-emerging use cases for gen AI is enhancing data discovery and governance; last year this was a hotbed for innovation. Language models can introspect metadata, picking up where SQL and search queries leave off. There are a wide range of approaches, from natural language query converting text input to SQL to generating data pipelines, automating classifying and transforming of unstructured data, enriching metadata for populating data catalogs, and so on. What we have yet to see is generative AI employed for modeling schema for database design, but never say never.
Data lakehouses. This now seems like ancient history. We started being bullish on them way back in 2023. And though enterprise adoption has been more gradual than we initially expected, with Apache Iceberg becoming the de facto standard open table format, we expect in 2025 that take-up will accelerate.
The inflection point was Databricks’ surprise acquisition of Tabular, the company whose founder Ryan Blue created Iceberg while at Netflix Inc. Given Databricks’ backing of the rival Delta Lake table format, we initially wondered about future direction of the Iceberg project. We got reassured during a meeting with Blue at AWS re:Invent that cross-community collaboration remains alive and well, with Databricks rivals such as Snowflake continuing to put skin in the project.
A potential fly in the ointment? Amazon Web Services Inc. introduced the S3 Tables storage bucket to S3 that optimizes Iceberg performance on S3. It’s open to all query engines, but requires a new library that AWS has nonetheless open-sourced. Nonetheless, with providers such as AWS, Google Cloud and IBM Corp. placing their wagers on the lakehouse as the data pillar for AI, we have little doubt that Apache Iceberg will become the de facto data store for cloud AI and business intelligence services.
Data and AI governance. Last year we also jumped the gun in calling for both to converge. Until now, both disciplines have been largely siloed. Concerns about data and model access, quality, security, bias and compliance have been handled by separate toolchains. As we’ll note in our outlook for the year ahead, these functions are parallel but often not identical. For instance, though bias has parallels regarding choices for what data is sampled and assumptions for how models are built, the tasks differ.
The same applies to managing quality of data and AI models, and the list goes on. The linkage comes with lineage – correlating which iterations of the model performed training or inference workloads on which corpuses of data. Databricks, with Unity Catalog, is one of the few that has started making baby steps to bridge the silos. As we’ll note in our outlook for the year ahead, we expect more to follow.
What’s happening with data in 2025?
This year will be the Renaissance of Data. But it won’t be a repeat of the 2010s that spawned furious startup activity, but rather, that enterprises will of necessity have to redirect attention to data now that some of the gen AI projects get beyond proof-of-concept stage.
For instance, data quality has been a persistent issue especially with analytics, but the consequences magnify with AI. Though there is a need for some technology innovation to fill the gaps, many of the building blocks and practice already exist. It will direct more attention to people and process.
The recent Databricks J round underscores this. Though the company’s intent for the funding was for making vested employees otherwise awaiting an initial public offering whole, the size of the raise reflects the market’s enthusiasm for a company that is not a pure-play AI player, but one that places data on an equal footing.
Simply stated, good models require good data, and AI governance depends on data governance. And, as we expect that RAG will initially be the more popular approach for enterprises to meld generative models to their needs (versus fine tuning), the spotlight will shine on data.
Though not all the technology building blocks are in place, many already are. Using AI to crawl and enrich metadata? Check. Automatically generate data pipelines? Check. Using regression analysis to flag data and model drift? Check. Using entity extraction to flag personally identifiable information or summarize the content of structured or unstructured data? Check. Applying machine learning to automate data quality resolution and data classification? Check. Applying knowledge graphs to RAG? You get the idea.
There are a few technology gaps that we expect will be addressed in 2025, including automating the correlation between data and model lineage, assessing the utility and provenance of unstructured data, and simplifying generation of vector embeddings. We expect in the coming year that bridging data file and model lineage will become commonplace with AI governance tools and services. And we’ll likely look to emerging approaches such as data observability to transform data quality practices from reactive to proactive.
Let’s start with governance. In the data world, this is hardly a new discipline. Though data governance over the years has drawn more lip service than practice, for structured data, the underlying technologies for managing data quality, privacy, security and compliance are arguably more established than for AI. And there is plenty of lineage data as most tools in the data lifecycle collect that information – often the challenge is deciding what’s the single source of the truth.
In addition, though we’ve been jaded about schemes for master data management, the reality is that AI can pick up where traditional, static, top-down schemes over limited corpuses of data have left off. Machine learning has provided useful capabilities for classifying and grouping structured data, with recent innovations from language models able to enrich it with natural language that makes the output the de facto business glossary.
Unstructured data is the frontier. The goals – having data that is high-quality and validated – are much akin to those for structured data. The current state of practice is to track data at file level – e.g., track the lineage and assess the provenance of source data files, that in the short term will primarily be text-based (and/or voice auto-transcribed to text).
But we can do better. The challenge is that getting a finer-grained handle on unstructured data currently involves a highly complex series of orchestrated steps and multiple tools, as laid out in this AWS technical blog post. This scenario leverages machine learning to extract metadata, which subsequently populates a data catalog, but it takes a lot of orchestration to get there.
As described on the AWS post, much of the enabling technology already exists – such as extracting metadata from text, converting speech to text, and extracting text from images and video. It’s a matter of piecing everything together and burying the complexity under the hood. So how can we smooth the process?
Initially, we expect that many of these tools will get conversational front ends, where the end user points to specific data sources and types (or speaks) natural language on what type of metadata to extract. We’ll likely see some basic workflows automated that will autopopulate data catalogs and perform some analytics tasks such as sentiment analysis without, for instance, requiring the end user to trigger AWS Lambda to create functions and AWS Step Functions to orchestrate them.
That’s probably the best we can hope for this year; over the long run, we’ll expect to see more automation of higher-level governance functions that will, for instance, trigger alerts regarding privacy, bias or questionable data content. However, detecting factors such as currency, completeness or reliability of content will prove an ongoing challenge.
As for RAG, in the near term we expect that enterprises will embrace it over alternatives such as fine-tuning because it doesn’t require organizations to dive into the guts of modifying language models.
But RAG is hardly a cakewalk, since it requires skills for prompt engineering and generating vector embeddings. And that’s in addition to mastering the core precepts of data management such as data modeling, orchestration and pipeline building. The process of converting unstructured data into vector embeddings requires expertise in how source content (e.g., documents, images, video and the like) is parceled into chunks. If the chunks, or pieces of documents, are too small, they may lose context, whereas if they are too large, they may exceed the context windows that the model can handle (this is a moving target) or become too generalized to yield sufficient meaning.
There are added challenges relating to ranking the relevance of information, not to mention speed bumps when it comes to handling large source documents. And of course, you need to choose the right language model to perform the embedding.
As happened with machine learning, we expect that a form of “AutoRAG” capabilities will emerge, providing guided experiences that could scan a source document and recommend settings for chunking and dimensions, while providing A-B test capabilities for comparing outputs from different embedding models. Pgai Vectorizer, developed by Timescale, is a new open-source Postgres tool that enables vector embedding creation from a SQL command. It provides a hint of what’s to come. Watch this space.
The other major enhancement will be with application of knowledge graphs to RAG. The GraphRAG pattern was introduced by Microsoft last year as a means to add more context based on the interrelationships between vector embeddings. Or, as stated in a Neo4J post, “It’s basically the same architecture as RAG with vectors but with a knowledge graph layered into the picture.”
It was inevitable that RAG would meet the knowledge graph given how ubiquitous knowledge graphs have become with enterprise applications. Just a few examples range from the Microsoft Graph underlying Microsoft 365 (formerly Office) that powers collaborative use cases to Salesforce Data Graph acting as a materialized view of the relationships between customer contacts; and the SAP Datasphere Knowledge Graph that enhances the semantic tier of SAP business objects with a graph representing the interrelationships between them. And the list goes on.
GraphRAG can make RAG far more efficient and useful. By factoring explicit relationships, it adds context. On a practical level, it could also reduce the amount of compute and prompting to arrive at a meaningful answer.
Putting Graph + RAG together can produce a 1 + 1 = 3 scenario, where the end result is better than the sum of the parts. For instance, though “conventional” RAG searches ferret out entities that are similar, they may not paint the full picture when dealing with large data collections. Thus, a fraud query that includes a vector similarity search may or may not unearth the most relevant chatter to a specific occurrence of fraud. Conversely, a knowledge graph alone might not distinguish between which seemingly connected instances of fraud meet the same pattern or not. Put the two together, and theoretically, one should be able to get a more faithful picture of reality – and for gen AI, less chance of hallucination.
Over the coming year, we expect that GraphRAG will become common practice for RAG applications querying data, both structured and unstructured. Further down the pike, we also expect to see techniques such as semantic re-ranking borrowed from the e-commerce world get applied as an initial filtering of content for relevance.
Initially, knowledge graphs are likely to be separate for structured and unstructured data, such as the semantic tier underlying BI applications such as ThoughtSpot Spotter (which maps the underlying schema of the database). But solutions such as the Writer Knowledge Graph are emerging that link nodes of structured with unstructured data.
Reshaping analytics
Emergence of GraphRAG is part of a broader trend that in turn is already redefining BI, which traditionally focused on query, reporting and analytics of structured data. RAG provides the opportunity to enhance BI query by correlating trending of tabular data found with context found in unstructured data. GraphRAG in turn provides the opportunity to make the context more solid. Though routine BI query of structured data for standard reporting won’t disappear, it will become a subset of a broader definition of what business intelligence and analytics are all about.
Growing adoption of gen AI applications in production will reintroduce enterprises to the importance of data. Data will rule in 2025. The good news is that most of the data management, integration and governance building blocks already exist; we’re not looking at dramatic technology revolutions there (we’ll leave that to the AI folks).
Instead, the challenge will fall on solution providers for putting existing pieces together, and for enterprises to further erode the silos and up the collaboration among business analysts, data engineers and data scientists. To some extent, natural language query and low-code/no-code tools already offer baby steps down that path, but with embrace of RAG and GraphRAG, the issue will be pushed to the forefront.
Tony Baer is principal at dbInsight LLC, which provides an independent view on the database and analytics technology ecosystem. Baer is an industry expert in extending data management practices, governance and advanced analytics to address the desire of enterprises to generate meaningful value from data-driven transformation. He wrote this article for SiliconANGLE.
Image: SiliconANGLE/Microsoft Designer
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU
Leave a Comment