• Mon. Dec 23rd, 2024

Podcast: Storage and AI training, inference, and agentic AI

Byadmin

Dec 8, 2024




In this podcast, we look at storage and artificial intelligence (AI) with Jason Hardy, chief technology officer for AI with Hitachi Vantara.
He talks about the performance demands on storage that AI processing brings, but also highlights the extreme context switching it can result in as enterprises are forced to pivot between training and inferencing workloads in AI.

Hardy also talks about a future that potentially includes agentic AI – AI that designs its own workflow and takes decisions for itself – that will likely result in an even greater increase in workload context switching.

Antony Adshead: What demands do AI workloads place on data storage?
Jason Hardy: It’s a two-dimensional problem. Obviously, there is that AI needs speed, speed, speed, speed and more speed. Having that level of processing, especially when talking about building LLMs and doing foundational model training, it [AI] needs extremely high performance capabilities.
That is still the case and will always be the case, especially as we start doing a lot of this stuff in volume, as we start to trend into inferencing, and RAG, and all of these other paradigms that are starting to be introduced to it. But, the other demand that I think is – I don’t want to say overlooked, but is under-emphasised – the data management side of it.
For example, how do I know what data I need to bring and introduce into my AI outcome without understanding what data I actually have? And one could say, that’s what the data lake is for, and really, the data lake’s just a big dumping ground in a lot of cases.
So, yes, we need extremely high performance, but also we need to know what data we have. I need to know what data is applicable for the use case I’m starting to target, and then how I can appropriately use it, even from a compliance requirement, or a regulatory requirement, or anything like that from those themes.
It’s really this two-headed dragon, almost, of needing to be extremely performant, but also to know exactly what data I have out there, and then having proper data management practices and tools and the like all wrapped around that.
And a lot of that burden, especially as we look at the unstructured data side, is very critical and embedded into some of these technologies like object storage, where you have these metadata functions and things like that, where it gives you a little bit more of that descriptive layer.
But when it comes to traditional NAS, that’s a lot more of a challenge, but also a lot more of where the data’s coming from. So, it’s, again, this double-sided thing of, “I need to be extremely fast, but I also need to have proper data management tools wrapped around it.”

Features for AI use cases
That leads me nicely to my next question, which is, what features do enterprise data storage arrays need for AI use cases?
Hardy: You’re absolutely right. One is leading into the other, where, just like we said, we need to be extremely performant, but what we also need to be is performant at scale.
If you look at it from, for example … if we talk about model training, model training was always about, “I need a massive amount of volume and a huge amount of throughput so I can just crunch and learn from this data and go from there.”
Now what we’re seeing is [that] we’re starting to operationalise and bring a level of enterprise-ness into these AI outcomes that requires a lot more of the compliance side of it and the data visibility side of it, while also being very performant.
But the performance side is also changing a bit, too. It’s saying, yes, I need high throughput and I need to be able to constantly improve on or fine-tune these models … But then it’s also [that] I now have an indescribable workload that my end users or my applications or my business processes are starting to integrate into and creating this inferencing-level workload.
And the inferencing-level workload is a little bit more unpredictable, especially as we start to step into context switching. Like, “Hey, I always need to be fine-tuning and improving on my models by injecting the latest data, but I also need to introduce retrieval augmentation into this, and so I now have the RAG workload associated with it.”
So, I need to be able to do this high-throughput, high-IOPS context switching back and forth, and be able to support this at enterprise scale.
But also, as new data is introduced into the ecosystem – generated through applications and normal business processes – I need to understand, not necessarily in real time, but almost in real time, what new data is made available so I can incorporate that.
[That’s] as long as it’s the right data and it has the right wrapper and controls and everything around it. Depending again on the data type, to allow for me to embed or improve on my RAG processes or whatever, but [also] how I can incorporate a lot of that data into it.
And then at the same time, too, is the source systems that we’re pulling this information from. Whether it’s an OLTP environment like an SQL or some sort of structured environment, or if it’s an unstructured environment, those source systems also need to be equipped to be able to support this additional workload as well.
I need to have this data awareness, but I need to have performance even outside of just what’s generally made available to the GPU directly from the high performance file system that’s supporting directly against the GPU workload. So, one is really the other, and it’s not a mystery, this major epiphany or anything. These are common data practices that we at Vantara have always been practicing and preaching for a long time, [that] data has value.
You need to understand that data is [using] proper indexing, proper tagging – again, all of those data processes – and proper data hygiene. But also now, how do you do that at scale and do that very performantly?

Training and inference needs
How do the needs of training and inference in AI differ when it comes to storage?
Hardy: That’s a great question. And like I said, we’ve been focused so heavily on – “we” being the market – I’ve been so focused on how to build models and how to integrate in and create these foundational models that can start to really revolutionise how we do business. That was all well and good; massive amounts of volume. Hitachi ourselves are creating these for a lot of the markets that we work inside of from the big Hitachi perspective.
But now what’s happening is we’re shifting from – and we’re going to start to see this trend in 2025 and 2026 … just [being] exclusively about building models into how we integrate in and we do inferencing at scale.
Inferencing at scale, like I said, is very random because it’s driven by end users or applications or processes, not in a predictable fashion like, “Hey, I’m going to start a training process, and I’m going to evaluate it and do another training process where it’s very regimented and scheduled in a way.”
This is kind of at the whim of how the business operates and almost at the whim of, “I have a question that I want to ask the system” … and then it now spins up all these resources and processes to be able to support that workload.
So, this becomes a lot more random. Additionally, it’s not just one use case. We’re going to see many use cases where the infrastructure needs to support this all simultaneously.
It’s loading the proper model up, it’s tokenising, it’s then being able to get the output from what’s being interfaced into, and then being able to portray that back to the customer or the consumer, and then the back and forth nature of that. So, from our perspective, what you’re going to see here is inferencing is going to drive a huge level of random workload that is also going to be more impactful to the source data sides as well, not just the model.
So, again, like I mentioned earlier, retrieval augmentation, agentic AI, things like that.

These are spinning up all sorts of different levels of consumption against the storage platform that is specifically being driven by inferencing.
Agentic AI, this new trend that’s starting to appear, is going to make this more of an exponential problem as well, because now, instead of traditionally, if I’m going to interface with a system, I ask it a question, a model gets loaded, it does its tokenisation, I get the result back, etc, etc. That whole process.
Well, now what’s happening is that same level of communication of working with the system is turning into not just one model, but many different models, many different queries or the same queries being done against many different models to try to get to the best outcome or the best answer for that specific question.
Now what’s happening is this is spinning up that exponential level of more workload. And then, once that’s done, you need to spin that down and shift back over to doing your fine-tuning or your training or whatever other workload, because you don’t just have an idle set of resources there that are just going to wait. It’s going to be constantly used for both sides now, the inferencing and the training workloads.
This context switching is going to put a big burden on the storage platform to be able to support really high-speed checkpointing so that I can stop my tuning or stop my model training and then shift into using those resources to fulfil the end user or the process demand as quickly as possible, because that is a real-time interface.
Then that gets spun down because the inferencing is done, and then I spin back up and I continue with where I left off on the training and tuning side. So, you’re going to see now this really weird, random level of workload that both of these types of demands are going to place onto the storage systems.



Source link