The Future of Data Management: Insights from James Canterbury | DSA

Written by Ken Fromm | Sep 24, 2024

DSA member James Canterbury, founder of Zeroth Technology, shares insights into how decentralized storage, powered by blockchain technology, is revolutionizing data management.

Background

Can you run through some of your background and its relation to business and innovation?

James Canterbury: Sure, I spent most of my career in the consulting world. I was a partner at Ernst and Young in the blockchain consulting practice up until actually this past February (2024) when I set off to do my own thing with supply chain and blockchain and the overlap between the two.

Before I got into blockchain technology, I did a lot of work with the pharmaceutical industry, mostly around computer systems validation. In a regulated industry like pharmaceuticals, every computer system needs to go through stringent controls in terms of how they're implemented, how they're changed, and how they're managed to make sure that there's no impact on patient safety.

This work led me towards some of the industry groups, particularly the International Society of Pharmaceutical Engineers which writes guidance that a) help our regulators understand new technologies and b) help industry understand new regulations.

My work with blockchain allowed me to see what the security and robustness of public networks brings to transactional data and how that was missing from current computerized systems. I gravitated towards decentralized storage largely because of my data-centric background around pharmaceuticals records. I ended up diving in pretty deep with IPFS to better understand what it means when we store records and files in this new way.

Decentralized Stoage Benefits

Let's talk about this. What are some of those elements that decentralized storage helps resolve?

James Canterbury: I’ll give you an example with a past client of mine. They had a general audit going on at a manufacturing site and were looking at maintenance records. They realized that something was a bit out of sync with one of the records. As they dove deeper, they found it was not an issue with the record itself but with the IT system that was managing it.

They realized the problem was an access control issue with the data store for the maintenance records for the facility. Files were being randomly updated on occasion but not necessarily maliciously. There was a process to go through for updating the files but it wasn’t being followed. All it took was finding one errant record to put the entire system in question and causing them to ask, “How do we know all the rest of them are safe?”

A team of eight of us came in and spent about a month and a half comparing all of the records – I believe there were over 6,000 records in that particular batch. We had to verify the files using backups from previously trusted states and even that was questionable because it was unclear how long the access issue had been in place.

That wasn't all too long ago but these types of things still happen frequently even now. And it's not that people are doing things wrong on purpose. It's just that a few slip ups with the way that you're managing your files call into question everything else. Had the been storing things using IPFS, it would be a completely different ballgame, right?

Can you explain how IPFS and decentralized storage might have helped here?

James Canterbury: This was a traditional file storage system and so we relied on the location of the file and the filename to find the file. But had we been using IPFS as the data storage layer and using the Filecoin sealing process to package the file and then store it using the content identifier (or CID) then the CID itself would have been all the evidence that we needed to prove that these files were unchanged.

And not only that, it would've been all the evidence we needed of who last changed it. Because there's fingerprints on those CAR files of who's preparing them and who's putting them in. We would have been able to look at anything that had been prepared with a digital signature that wasn't an authorized one, and then we could actually just investigate just those files.

One of the main points of using decentralized storage is that if there's one small breach in security of managing a single file, it doesn’t affect all the files in the system or call them into question.

CONTENT ADDRESSABILITY

Can you explain the benefit of using globally addressable Content IDs (or CIDs)?

James Canterbury: In records management – particularly in pharma but I believe most knowledge management systems – we store things now based on directories and filenames. I titled the document this and I put it in this folder.

If you have a large procedural document, say 40 or 50 pages of a standard operating procedure, only small sections of that document might be referenced or needed for a particular question or a query. But we have to locate the entire SOP and we have to rely on the controls that were in place to update that SOP and the last time all these different pieces were signed in.

If we think about content addressability, we can create a CID that locates the entire document, but we can also create components within these documents that are also CIDs that can lead us specifically to knowledge elements.

My team and I have been doing some work recently about the safety concerns of companies using Retrieval Augmented Generation models (or RAG models), which is a form of LLM, to let them chat with PDF documents. When these are applied to controlled procedural documentation – the kind that people reference to determine how sterile is sterile enough when cleaning equipment that makes an injectable drug substance – there is no tolerance for illusory responses. We need fully referenceable responses to queries and we need to know for certain that response is coming from the most recent approved procedure.

If it cherry picks from multiple procedures, we need to know that too. Also important, if the LLM does not have sufficient sources to respond completely, it has to tell us that as well. While this might be a dramatic example – and I’m certain we will still rely on human expertise when making these types of decisions, it’s not too far-fetched.

We cannot have an AI LLM to make up a response to a query that will misinform an operator and potentially harm a patient. We want the LLM to return a citation from a procedure in context. Content addressability is going to go a long way in helping us do this because it's going to let us put a finer point on the knowledge elements within these procedural documents and let us control these documents in and of themselves.

Can you give an example of how content addressability and data sharing within Web3 might work?

James Canterbury: Yes. Here’s one. Big enterprises go through mergers and acquisitions all the time. They spin out an operating company or a business unit or they buy an operating company or business unit to augment their portfolio. When you go through a transaction, there's often a transaction service agreement (or TSA) that exists between the buying company and the selling company that requires the seller to provide access to the transactional data, usually financial data but it could be product quality data, manufacturing data, or other critical business data.

Sellers have to be able to provide that access for an extended period of time because the buyer is now taking liability for something you might have done in the past and they need to maintain access to all past records. If you're in a regulated environment, for example, it's almost always 10 years. The problem is that the buyer’s systems are generally not equipped to ingest all this historical data – who knows what format it's going to be in – and so they just want access to it in a way that has to be available.

This historic access is really expensive for the selling company to do. I've seen companies keep mainframes running just to make this stuff available and they're counting down the days for that TSA to run out so they can shut these things off. And when there is a request for access, it's a whole process to manage it because they didn't actually transfer the data over to the acquiring company.

If you have them in some type of decentralized storage solution with access controls and content addressability, then giving the buying company access to the data is as easy as transferring custody of the keys. Currently people get really hung up on the price of doing a TSA because it turns out to be a cost that just doesn't go away. With decentralized storage, you can actually just price the cost of the data storage into the transaction at the time of the transaction, which will probably just be a small fraction of current cost for a TSA.

DECENTRALIZED STORAGE EVANGELIST

You’ve been at the forefront of evangelizing blockchain and decentralized storage within enterprises. Take us through the different phases from “No, that's crazy” to acceptance of these solutions.

James Canterbury: If you rewind the clock back to 2018 or 2019, enterprises were excited about emerging technologies like blockchain, decentralized storage, and things like content addressability. They got it. It wasn't hard to sell them on this is how it works, this is where the cryptography comes in, and these are all the components involved. It was relatively easy to get proof of concepts up and running.

Where it became a challenge – and still is a challenge – is moving past the POC phase and putting something into production. Most enterprises are willing to modify existing IT systems for the sake of quality, but very rarely do they switch to a whole new technology for the sake of quality. Instead, they're going to find ways to layer in additional controls. The cost of switching to a new platform, much less a technological approach, is very high.

That said, I remember when the pharma industry said, “There's no way we would ever use the cloud for storing our files. Everything has to be on-prem.” It took a long time for industries to get comfortable with cloud solutions. But they did begin to think that maybe cloud storage is even more secure than our data center. And they began to realize that they are not actually in the business of running data centers, they're in the business of treating patients and producing medicines and so they were okay outsourcing that piece of it. But this was probably 10 years after cloud solutions were ready for primetime - cloud migrations are still happening today.

The issue I would encounter often when trying to move organizations to decentralized storage on a larger scale for enterprise was that there wasn't a rock solid case as to why a particular company had to do this other than trying to be innovative and being out front of the curve. They weren't being asked to do it by regulators. They weren't being asked to do it by customers. Nobody was really pushing them to innovate here.

Where do you think there is immediate or near-term opportunity for decentralized storage?

James Canterbury: The areas where these benefits will resonate is when we start doing something totally new, totally different such as looking at new ways to share data across an entire ecosystem. My passion is supply chain and I work in it all the time. Connecting a pharma manufacturer to a patient or farmer to an end customer is very difficult to do. And if you're trying to pass information between the parties, it's almost impossible to manage. From a manufacturer's perspective, they have thousands, maybe millions, of customers. And from a customer or business perspective, they might be using hundreds of products. These are very difficult mappings to do. But if we're able to use data networks instead of data stores, we actually have a better way of connecting the dots between the two.

And so I think adoption will wind up being just a matter of time because once there's two or three of these kinds of applications that become generally accepted, then everyone else will have to justify why they're not doing it this way. We actually see some parallels going on in TradFi versus DeFI with 401K custodians now having to justify why they are not including crypto as an asset class. We’ll likely see this same shift in the decentralized storage space in that people will be asking why they’re not doing it this way.

How might this network approach work and how is it different from the current siloed multi-system integration nightmare?

James Canterbury: This is a little bit of a leap from what might be widely available or in use today, but when you marry up content addressability and access control – which is what the DSA is working on and which is a very important piece – what you get is a very fine grained control over specific data elements for a specific action. If I store all my information out on AWS or put files in Dropbox or Box, what I generally do if I need to share this data is to give everybody access to that folder or to those files in that one location. You have to have very fine grain control over who has access within that environment but it's difficult to manage.

You could certainly do it with a smaller ecosystem, but with a large ecosystem, especially with one where you don't necessarily know who's going to need access, whitelisting is not an option. It's almost impossible. So when you couple content addressability with access control – in web3 it can be done with something like token gating – you have a credential in your wallet that gives you access to the data you’ve been given explicit permission to access.

The aha moment on this stuff is that it's much easier selling enterprise when it's an ecosystem play. When you're trying to connect many dots that you don't have control over.

Current Work and Outlook for Decentralized Storage

What are you working on and what is your outlook for the space?

James Canterbury: I believe in this space enough to have started Zeroth Technology to go down this path with my own company. It’s only a matter of time before people realize that these approaches just make more sense. Industries are changing to be more interconnected across their entire ecosystem. Silos are breaking down. I don't see any way in the future for this to be effective without the use of blockchain networks for transactional purposes but also for data storage and sharing.

We need these granular controls, we need identity, and we need a lot more ways to interconnect with these networks. Once we get these, it will undermine traditional ERP approaches and the way that these monolithic systems have been built up. Because we can build these systems in different ways – in more secure ways, faster ways, shareable ways, and portable ways. ERPs are not portable.

But if we're using just modules on top of a network, then this is very portable. The foundations for all of this are already being built, and I firmly believe this fundamental shift is just around the corner.

Contact James or follow his work at:
Website: https://zeroth.technology
LinkedIn: https://www.linkedin.com/in/jameycanterbury/
Filecoin Slack: @jamey

View full post