Building Confidence in Your AI Systems, Before It's Too Late

Imagine this: a single hacked AI agent managed to crash 50 others! It’s a stark reminder that autonomous AI needs something like a “DNS for trust” before its independence leads to utter chaos.

"Trust" Bench near Trophy Point at West Point

It happened in minutes: one rogue agent completely crippled our 50-agent machine learning system. That’s when it hit me – we were building these smart, independent AI agents without the fundamental trust setup that the internet sorted out decades ago with things like DNS.

So, as a PhD researcher and an IEEE Senior Member, I decided to tackle this head-on. For the past year, I’ve been creating what I’ve dubbed “DNS for AI agents” – essentially, a crucial trust layer designed to give autonomous AI the robust security it absolutely requires. What began as an academic project to fix authentication headaches in complex ML setups has blossomed into a real-world system, revolutionizing how companies are rolling out AI agents on a large scale.

Moving from standard machine learning to agentic AI is a huge leap for businesses. Think about it: old-school ML needs humans watching over every single step – checking data, training models, deploying, and monitoring. But today’s agentic AI systems can pretty much run themselves, managing tricky tasks with lots of specialized agents working together. The big question, though, is how do we actually trust these autonomous systems?

The cascading failure that changed everything

Let me tell you about a real-life incident in our production setup that really drove this point home for me. We were managing a machine learning operations system for multiple users, powered by 50 agents. They did everything from spotting when data patterns changed to automatically retraining models. Each agent had its specific job, its own login info, and knew exactly where to connect to other agents because those connections were hardcoded.

Then, one Tuesday morning, a simple configuration mistake led to a single agent getting hacked. In just six minutes, our whole system imploded. The reason? The agents couldn’t tell who was who. The compromised agent pretended to be our model deployment service, tricking other agents into rolling out faulty models. And our monitoring agent? It just kept reporting that everything was fine, completely unable to spot the difference between legitimate and harmful activity.

This wasn’t just a tech glitch; it was a total breakdown of trust. We had created an autonomous system, but we completely missed the basic tools agents needed to find, confirm, and verify each other. It was honestly like trying to build the internet without DNS – where every single connection just relies on manually entered addresses and hoping for the best.

That whole mess really highlighted four major issues with how we set up AI agents right now.

First, there’s no standard way for agents to find each other; they depend on someone manually setting things up or using fixed addresses.
Second, agents barely ever cryptographically verify each other – it’s practically non-existent.
Third, agents can’t show what they’re capable of doing without revealing private stuff about how they actually work.
And finally, rules for how agents should behave are either missing entirely or just impossible to consistently apply.

Building trust from the ground up

The answer we came up with, which we call Agent Name Service (ANS), draws a lot from how the internet tackled a similar challenge a long time ago. DNS completely changed the internet by linking easy-to-remember names to complex IP addresses. ANS does something along those lines for AI agents, but with an important extra step: it connects an agent’s name to its unique cryptographic identity, what it can do, and how much we can trust it.

So, here’s how it actually works. Instead of agents chatting through fixed addresses like “http://10.0.1.45:8080,” they use names that pretty much explain themselves, like “a2a://concept-drift-detector.drift-detection.research-lab.v2.prod.” This naming style instantly tells you what kind of communication it is (agent-to-agent), what it does (detecting drift), who made it (research-lab), which version it is (v2), and where it’s running (production).

But the truly clever part is what’s happening underneath these names. We built ANS using three core technologies that collaborate to create a rock-solid foundation of trust.

Decentralized Identifiers (DIDs) give every agent its own special, verifiable identity, using W3C standards originally meant for managing human identities.
Zero-knowledge proofs let agents confirm they have certain abilities — say, access to a database or permission to train a model — without actually spilling the beans on how they get to those things.
And we use Open Policy Agent for ‘policy-as-code’ enforcement, meaning our security rules and compliance needs are clearly defined, tracked like software versions, and automatically put into action.

We specifically built ANS to be native to Kubernetes, which was super important for companies to actually adopt it. It slots right in with Kubernetes Custom Resource Definitions, admission controllers, and service mesh tech. What that means is it plays nicely with the cloud-native tools businesses are already using, instead of forcing them to totally rework their entire setup.

Technically speaking, we’ve built this using a ‘zero-trust’ approach. That means every time agents talk to each other, they have to mutually authenticate using mTLS with unique certificates for each agent. Now, regular service mesh mTLS just proves who a service is. But with ANS mTLS, we’ve gone a step further by adding ‘capability attestation’ right into the certificate. So, an agent isn’t just saying “I am agent X” — it’s saying “I’m agent X, and I’ve got verified permission to retrain models.”

From research to production reality

The moment of truth came when we actually put ANS into a live production environment. And honestly, the results blew past even our most hopeful predictions. Agent deployment, which used to take a painful two to three days, now happens in less than 30 minutes — that’s a whopping 90% faster! All those steps that once needed manual tweaking, security checks, getting certificates, and setting up networks? They’re all handled automatically now, thanks to a GitOps pipeline.

What was even more amazing was how much more often deployments actually worked. Our old method only had about a 65% success rate, meaning 35% of the time, someone had to jump in and manually fix configuration issues. With ANS, we hit a 100% success rate, complete with automatic rollback if something goes wrong. Every single deployment either works perfectly or reverts back cleanly — no half-finished messes, no weird configuration changes, and no need for manual tidying up.

The performance numbers are just as impressive. On average, our service responds in under 10 milliseconds – that’s plenty fast for orchestrating agents in real-time, all while keeping everything cryptographically secure. We’ve even successfully put the system through its paces with more than 10,000 agents running at once, proving it can scale way beyond what most businesses typically need.

ANS in action

Let me walk you through a clear example of how this all plays out. We have a workflow for detecting ‘concept-drift’ that perfectly shows off the strength of trusted agent communication. So, when our drift detector agent spots that a production model’s performance has dropped by 15%, it taps into ANS to find the right model retrainer agent, based on what that agent can do, not just a fixed address. The drift detector then uses a zero-knowledge proof to confirm it has the authority to kick off a retraining. An OPA policy checks this request against our internal governance rules. Then, the retrainer does its job, updates the model, and a separate notification agent pings the team on Slack to let them know.

This whole process – finding the agent, checking identities, getting authorization, running the task, and sending alerts – all happens in less than 30 seconds. It’s completely secure, every step is recorded for audit, and it all runs without anyone needing to get involved. Best of all, every agent involved can absolutely confirm who the others are and what they’re allowed to do.

Lessons learned and the path forward

Working on ANS taught me a lot about rolling out autonomous AI systems. First off, security just cannot be an afterthought. You can’t just slap trust onto an agent system later — it has to be built in from the ground up. Second, standards are super important. By making sure ANS supports various agent communication protocols (Google’s A2A, Anthropic’s MCP and IBM’s ACP), we made sure it could work smoothly across the often messy world of different agent systems. And third, automation is absolutely essential. Trying to handle thousands of agents manually simply won’t cut it for large companies.

The bigger picture here goes way beyond just managing machine learning. As businesses increasingly rely on autonomous AI agents for everything – from customer support to managing their infrastructure – the issue of trust becomes a deal-breaker. An independent system that lacks solid trust mechanisms isn’t a help; it’s a huge risk.

We’ve actually seen this kind of pattern play out many times as technology has evolved. Remember the early internet? We quickly learned that trying to hide things for security just doesn’t work. Then, with cloud computing, we realized that simply guarding the outer edges wasn’t enough. Now, with agentic AI, we’re discovering that autonomous systems absolutely need extensive frameworks for trust.

The good news is, we’ve put out an open-source version that has everything you need to get ANS running in a real environment: the main library, Kubernetes setup files, example agents, OPA policies, and monitoring configs. Plus, we’ve shared the full technical talk from MLOps World 2025 where I showed the system in action.

What this means for enterprise AI strategy

If your company is rolling out AI agents — and let’s be real, most businesses are, according to recent surveys — then you really need to sit down and ask yourself some tough questions.

For instance: How do your agents confirm who they are to each other?
Can they prove what they’re allowed to do without revealing sensitive login details?
Do you have automatic rules enforcing policies?
And can you actually check what your agents are doing?

If you can’t confidently answer those questions, then you’re basically relying on hope instead of solid, cryptographic assurances. And as our system’s dramatic failure clearly showed, those ‘hopes’ will eventually let you down.

The great news is, this problem is fixable! We don’t have to just wait around for big companies or standards organizations to sort it out. The tech we need is already here: DIDs for confirming identity, zero-knowledge proofs for proving what an agent can do, OPA for setting rules, and Kubernetes for managing everything. What we were lacking was a single, cohesive system that pulled all these pieces together specifically for AI agents.

Look, moving towards autonomous AI is going to happen, no doubt about it. The real choice we face is whether we build these systems with solid trust mechanisms right from the get-go, or if we wait for some massive disaster to make us realize we should have. From what I’ve learned, I’d definitely push for the first option.

The future of AI is all about these smart agents. And for agentic AI to truly thrive, it absolutely has to be secure. ANS is that essential trust layer that makes both a reality.

You can find the full Agent Name Service implementation, including all the source code, deployment instructions, and documentation, over at github.com/akshaymittal143/ans-live-demo. There’s also a technical demonstration of the system from MLOps World 2025 that you can check out.

This piece is brought to you as part of the Foundry Expert Contributor Network.
Interested in joining us?

Artificial IntelligenceSecurityGenerative AILibraries and FrameworksSoftware Development

Trending →

Gemini CLI: See Your Changes First

Orchestrating AI Agents with Amazon Bedrock

JetBrains uses AI to make Kotlin and Java debugging easier.

Postgres: The Go-To Database, Ready for AI’s Future

Apple’s 50th Birthday: They’re celebrating you.

Building Confidence in Your AI Systems, Before It’s Too Late

Imagine this: a single hacked AI agent managed to crash 50 others! It’s a stark reminder that autonomous AI needs something like a “DNS for trust” before its independence leads to utter chaos.

The cascading failure that changed everything

Building trust from the ground up

From research to production reality

ANS in action

Lessons learned and the path forward

What this means for enterprise AI strategy

Leave a Reply Cancel reply

You Might Also Like ↷

Atlassian’s AI Future Costs 1,600 Jobs

Microsoft Key Disclosure Prompts Corporate Data Control Questions

Apiiro’s Guardian Agent Secures AI Code

Three Ways AI Will Transform Engineering