Arize AI in Asia Pacific: LLM Evaluation, Observability & Scale with Patrick Kelly

Arize AI in Asia Pacific: LLM Evaluation, Observability & Scale with Patrick Kelly
Patrick Kelly explains how Arize AI helps organizations across Asia Pacific observe, evaluate, and ship reliable AI agents in production and improves observability across enterprises.

Fresh out of the studio, Patrick Kelly, Vice President for Asia Pacific at Arize AI, joins us to explore the critical world of AI observability, evaluation, and infrastructure and how Arize AI will start their go to market across the region. Beginning with his transition from Databricks to Arize AI, Patrick explained how the company's mission centers on making AI work for people by helping teams observe, evaluate, and continuously improve their AI agents in production. Emphasizing that evaluations are the most important requirement for AI systems in 2025-2026, he revealed a striking insight: approximately 50% of AI agents fail silently in production because organizations don't know what's happening. Through compelling case studies from Booking.com, Flipkart, and AT&T, Patrick explained how Arize AI enables real-time observability and online evaluations, achieving results like 40% accuracy improvements and 84% cost reductions. He highlighted the company's open-source Phoenix platform and Open Inference standard as entry points for developers, while discussing emerging trends like "cursor for X" and the rise of AI product managers. Patrick concluded by sharing his vision for success across Asia Pacific's diverse markets - from regulatory frameworks in Korea and Singapore to language localization challenges in Vietnam - emphasizing the three pillars that remain constant: helping customers make money, control costs, and manage risk in an era where AI governance has become paramount. Last but not least, he shares what great would look like for Arize AI in the Asia Pacific


"The mission is to make AI work for the people. It’s about getting AI working for everybody—consumers, customers, and businesses at large. Evals are the most important things that we’ve seen through 2025 and will see more of into 2026; they are the most important thing for systems to work. When I'm working with a customer, I ask: How are we going to help them make money? How are we going to help them control costs? And how are we going to help them manage risk? A lot of AI now is about managing risk." - Patrick Kelly, Vice President, Asia Pacific, Arize AI

Profile: Patrick Kelly, Vice President, Asia Pacific, Arize AI (LinkedIn)

Here is the edited transcript of our conversation:

Bernard Leong: Welcome to Analyse podcast, the premier podcast dedicated to dissecting the pulse of technology, media, and business globally. I'm Bernard Leong. How do we know the large language models or any AI agents are doing the right thing? Today we're going to explore the evolving world of AI agents evaluation and infrastructure through the lens of Asia Pacific. With me today is Patrick Kelly, an old friend of the podcast, Vice President for Asia Pacific at Arize AI, which I also gladly take some lessons from your founder in deep learning.ai, one of the leading companies helping the world to build and ship reliable AI agents in production. So Patrick, welcome to the show.

Patrick Kelly: Awesome, thanks. Great to be here again, Bernard.

Bernard Leong: Congratulations on your new role with Arize AI. As you have been on the show before and shared your early career journey and your last role, what have you been up to and what's the thing that led you to the current role?

Patrick Kelly: Last time we spoke, I was at Databricks. That's right, quite a while ago. I spent two years at Databricks and had a great experience. We're building out the business, building great outcomes for customers across Southeast Asia, and I think you've seen that at Data World tour last year. That's right. You were running the podcast during that event with two ex-colleagues of mine and you were interviewing our customer: Hafnia, which was fantastic. Data rich, great company, great vision, Lakehouse, having that kind of common source of data and then just the opportunity to build something new again.

We are builders at heart, coming from Amazon days, and it really got me that Arize AI came along and said, we're looking to build the business in APAC and if you look at what AI is trying to do, it's all about how do you ship AI that works?

Bernard Leong: Making AI work for the people, right?

Patrick Kelly: That's right.

Bernard Leong: Arize is probably first very well known for large language model evaluations. Now we go into agent evaluations and even in terms of thinking about the infrastructure side as well. I also found out from you that it's not just on the cloud, but also on-prem, but we are going to dive deeper into this, which comes to the main subject of the day where I want to talk about Arize AI and what's the market opportunity for them in the Asia Pacific. Maybe to simplify this, first is what is Arize AI's core mission? How does the company actually enable teams to build with large language models and AI agents?

Patrick Kelly: Really the mission is to make AI work and make AI work for the people. It's all about getting AI working for everybody—consumers, customers, businesses at large. I really think that we're at an area of AI adoption where the last couple of years since GPT came out, we've had chatbots, we've had RAG, we've had LLMs and some of them have been very successful around performing outcomes for customers. A lot of what Arize does is really helping customers observe what's happening in your AI solution or AI application, then being able to understand if something's going wrong—how do I evaluate it? And then how do I get that feedback loop back into the system again? Everyone's talking about agentic or agents at the moment and everyone's moving towards that as well. We just think agentic is just another series of steps that a system will take to get an outcome. So there may be an LLM call to OpenAI and then there'll be an API call to another system and then there'd be a call to read or retrieve something from the system, be it a database or a PDF, and then you put all of them together and that will deliver the response back to the user of the agent. So it's just an amalgamation of a lot of different systems coming together, but as that becomes more complex—

Bernard Leong: How do you understand the whole session end to end?

Patrick Kelly: That complexity also exacerbates if you have something a model context protocol, what people are going to call MCP servers, because you're trying to take different information from different servers and then you want to compress that context and then get the AI agent to do the work on all the different data and then synthesize what they're supposed to do for you.

Bernard Leong: Just to baseline my audience a bit, can you help them understand what a platform that supports AI agents really looks like from the Arize point of view? How do things development, observability, and evaluation come into play when you are thinking about generative AI applications?

Patrick Kelly: The first step is observing your system—understanding your system, seeing what it's doing, and that will tell you if there's a problem. That's step one. Second, then you think about, I know there's a problem, and then you find out what are the symptoms. I think about diagnostics. There's a bit of medical analogy here. Think about doctors. You've got symptoms. You go to a doctor. I've got symptoms. He's going to evaluate you. He's going to look at you. You've got a runny nose, your eyes are watering a bit. I think it might be COVID or might be influenza. But then you need to evaluate against a certain playbook or data. Doctors will measure you against certain temperatures and everything else. Same as in any system to evaluate. We can do prompt engineering, we can then understand what data is causing this hallucination in the LLM or the chatbot. Then most importantly, when you're doing the development, redevelopment of the application, you want to push that improvement back into the system. We think about how do you develop agents or AI systems, and then how do you push them into production? When you're in production, you observe and then you evaluate and you create this continuous cycle so hopefully your quality and your performance and everything is improving for the whole application.

Bernard Leong: That is transfer learning on what the evaluation sends back to the system so they can iterate and self-learn. One question is probably just a side note to this—it also goes through different types of environments? I'm talking about on-premises cloud or even what is called an air gap type environment. Am I right to say that?

Patrick Kelly: Yes. We have many customers—we operate in the different cloud providers, so Google, Microsoft, and AWS. But we can deploy on any piece of compute infrastructure, which is super important.

Bernard Leong: That is the biggest surprise for me.

Patrick Kelly: There are some enterprises where they need very air gap type applications. Could be factories, manufacturing, where they don't want to have any other people to access it.

Bernard Leong: So the Arize platform is pushing a container into the on-premises system?

Patrick Kelly: We run on premises. If they're running DeepSeek or any small language model or open source [LLM] models on their own GPU stack, we can then collect all the telemetry data, which is OpenTelemetry (Arize-OTel). We're a vendor agnostic company from the platform or the infrastructure we run on to the technologies that gather information OTel and everything else. Then as we move further up the stack on how you build an agent—LangGraph, LlamaIndex and CrewAI—we support all those frameworks as well. You can see us as supporting the whole ecosystem and it's super important for us to have open source at the core, which is OpenTelemetry. Then there's Open Inference, which is created by Arize AI, which is further semantics that can help understand what the LLM is doing.

Bernard Leong: When I teach the large language model courses to engineers, I deal with a lot more students who are working in on-prem environments. Evaluations is one of the biggest challenges. The only ones I knew about was often through cloud environment. Thanks for giving me this tip. I can go back and tell them, because I use Arize as an example in my classes, that they also do on-prem environments as well. Many listeners will be very familiar with what is called traditional A/B testing or maybe machine learning metrics, but I think AI evaluations are really different because you can prompt anything and then the AI can come back with a lot of different things. Can you help us understand what evaluation means in the context of large language models, LLMs and agents, and why they matter?

Patrick Kelly: Super important. You nailed it. Evals [Evaluations], I think, are the most important things that we've seen through 2025 and I see more into 2026 is the most important thing for systems to work. Coming back to when we worked at AWS and machine learning, we were thinking about how do we get a forecasting model to a certain level of deterministic outcome. Then we keep training the model, keep training the model, and then we get to a score that we want, which is very important. In the large language model world, in the Generative AI world, it's not as easy because it's non-deterministic. The output can be very different. Ask the same question to GPT twice and you get two different answers. We really want to make sure how are we able to bring in elements of evals. I think it's very important that there's two types. One is the offline evaluation. Give you an example—let's say your public sector or government customer is building some kind of system. They're developing and then they want to test it. So that's offline. There's a set of data and they're testing prompts against it. And then they're measuring—is it hallucination, is it toxic? Is it disclosing information? That's one type of it. But as you get into production, you want to be able to do that online as well. We're seeing a lot of techniques around online evaluations in production. For example, this is an example for us from Booking.com. They've got a couple of agents for trip planning and different systems.

What they're able to see is how is that agent performing in production in real time? They can do AB tests online in production. Product managers are quite important now in this space. You have the AI engineer and developer, but the product manager in the company is deciding what does this AI system look like? How do I do that online and be able to switch on the fly for customers? Then obviously over time they'll bring it back into development environment and update things. But sometimes this is a customer facing agent—you want to be able to catch it in real time, so it's minimal downtime.

Bernard Leong: If I can intercept here for the customer service bot example, typically there would be a customer service chatbot to do with specific products and services and then most people usually test with one, two questions. One of the things I always tell my class, please do not do that. Take the first two questions and go through ChatGPT or Claude and say, can you tell me what are the 40 to 50 ways of asking the same questions? They will generate all the questions and then you use that whole list of questions as a form of evaluating the prompts and the answers. Does it help in terms of from the Arize viewpoint when we look at traces and observability? Is that how you can even deal with things chatbot answering that gives you very nonlinear, two different answers maybe to the same question?

Patrick Kelly: Absolutely. You'd collect all those responses and then be able to put them into a certain dataset and then you'd be able to take that dataset back into your development environment and then you'd be able to run that testing against it to see what are the outcomes. Then you're able to score them and see this prompt triggered this. Maybe I could change the model as an example. That's possible. Or I could instruct it, put some instruction, I could change some code, I could put some different steps in to change things. All of that is available, and that really comes into the whole improvement piece. Once you get to the evaluation, then you improve—I'm going to improve this and this, and then I'm going to test it again. So all that's available.

Bernard Leong: I find it interesting because in most cloud environments these days, the AI builder actually gets the choice of which model to use, whether it's Azure Foundry, whether it's Amazon Bedrock, or even the Google Cloud platforms environment. Essentially you can actually slip in the evaluations even as if, let's say we switch models and your system should be able to tell that.

Patrick Kelly: A hundred percent. There's a playground where you can start running these prompts and you can play it GPT-4 and then I'll compare it against GPT-5 or I want to change it to Gemini. All that's available because we're pulling all the traces from all these different models using OpenTelemetry.

Bernard Leong: Let's walk through, let's say how a typical customer works with Arize. Where do they usually start? Is it mainly starting from the observability, the evaluation, the tracing, or is it something else?

Patrick Kelly: I'll always think back to what Peter Drucker says in business principles—you can't manage what you can't measure. Back to our Amazon days, we were very focused on: What are we going to measure? What are the elements of success here when we develop any business or technology? It really does start around the observability or tracing. About 50% of customers start there because they built some app and they're testing it and they're running it. That really comes to our open source area, which I'll speak about in a moment. That's about 50% there. We do see some customers who are very early—they're building the application and they start doing some evaluation on that in the development environment. But most of it does start in the observability piece. But then even those ones that start in development, they quickly get it into some kind of production or testing environment because you will never know how this AI application or agent works until you get user interaction.

Bernard Leong: Getting into that observability piece very quickly is very important. I guess most developers know through Phoenix—is that the open source version that usually how people get into the Arize AI ecosystem?

Patrick Kelly: We have Phoenix and we have Arize. Phoenix is the open source version where part of the community really embraces OTel and where Open Inference really started. That's where any developer can go download Phoenix, run on their laptop, build some AI application, and just start testing it and evaluating it.

Bernard Leong: I forgot to mention Arize is actually responsible for Open Inference. You want to elaborate on that?

Patrick Kelly: I'll just say that we've seen with OTel and then with LLM and semantics, there's a lot of other elements that need to be understood. Open Inference is super important that we built that on top and we see that's now being adopted by lots of different players in the industry, as well as OTel and Open Inference are really the standard now around how we're going to build evaluations.

Bernard Leong: Patrick, I really appreciate that you come on so early after you step into this role.

Patrick Kelly: Three weeks already.

Bernard Leong: It wouldn't be very fair if I were to ask you a question on how many customers are there already in Asia Pacific. That would be the wrong question. But what I want to know is maybe I do see some of the customer use cases from the website. Can you tell me some more stories that stand out to you so far? I know you already mentioned Booking.com, but I'm pretty sure there are other companies or maybe they might be in the region that are particularly exciting for you starting out in this region now?

Patrick Kelly: I mentioned Booking.com. I think that's a great use case for us. The travel industry is pretty interesting for us—Skyscanner and a lot of pricing.

Bernard Leong: Because of travel planning in AI.

Patrick Kelly: Exactly. Being such a cost sensitive market with dynamic pricing and everything else. We really want to capture a lot of customers and make sure to get them in and be able to drive that outcome. When we think about enterprise scale, think about Flipkart in India—one of the biggest e-commerce retailers, 600 million customers.

Bernard Leong: Flipkart is now owned by Walmart now.

Patrick Kelly: Exactly. That really shows massive scale, and that's where we help them on their customer facing agent of how to understand what the customer's doing, how to deflect any customer queries, and just having that kind of automation that can help drive down cost on their side, but then also drive GMV. Gross merchandise value (GMV) is so important for retailers so they could get more customers into the system and get them their products organized and get more sales through the whole platform.

Bernard Leong: What you are saying is that the customers, other than just doing the evaluations, they do the optimization at the same time when they're in the Arize platform, orchestrated, doesn't matter which cloud you're on. Is that correct?

Patrick Kelly: Yes, it is auto-optimization.

Bernard Leong: As in trying to optimize the agents to be able to try to optimize less amount of tokens.

Patrick Kelly: Yes. One of the big things we can track is all the cost—how much tokens are being generated. Even this morning I was playing around with the system and I was comparing GPT-4 and GPT-5. GPT-5 generates a lot more tokens.

Bernard Leong: As I know, 4.5 also generates a lot more tokens. Having a particular system that can actually help you to even visualize this—I think there's already a lot of value in that. Can you share also examples where you have a team that struggled to ship a lot of agents out there, and then how does Arize as a company help them to turn that around?

Patrick Kelly: A great example is AT&T in the US, obviously one of the world's biggest telcos. This is a RAG [retrieval augmented generation] use case where they had for their customer agents, the customer logs in, asks questions about my account, retrieving information about them, and then generating the response. They were having real problems with accuracy of the data. In that case, we worked with NVIDIA. NVIDIA was fine tuning the model to make sure it got better. We were using LLM as a judge, which is another type of evaluation. LLM as a judge is getting another LLM to measure that LLM and be able to tune it as well. Human interaction is great on these systems, but scale can be a problem. Bringing SMEs in is always super important, but we've got to think about scaling as well around using LLM as a judge. The output of that was we got a 40% improvement in accuracy and then 84% reduction in cost, which is massive for quite a lot of the amount of agents. Between human agents and then the AI agents, when you put that together, it just really helps them drive a lot of scale in the business.

Bernard Leong: This is the second time you're on the podcast, so this question shouldn't be difficult. What's the one thing you know about building or shipping AI agents in Asia Pacific that very few people do, but they should?

Patrick Kelly: I like this question. I think last time the MIT report just came out—the 95% of AI pilots kind of fail. I think that's still there. I think a lot of systems have been launched—a lot of successful LLMs, RAG systems, AI apps have been launched. I think the more interesting one is that agents that have been launched, about 50% of them fail in production because we don't know what's happening. This is what we see in Arize. But this is also backed up by SMU, the study from Singapore Management University. It's about this silent failure—we've shipped out these agents, but we don't actually know what's happening and are they actually producing any output?

Bernard Leong: When I was on the Arize website doing research for this episode, one of the things I think Arize seems to emphasize is building agents is hard, shipping them is harder. If I were to ask you this question—what does that mean in practical terms for teams building in Asia Pacific, but it can be anywhere?

Patrick Kelly: A prelude to that just in Asia Pacific is a lot of governments and regulatory are taking steps for this. That's right. Korea just launched a risk management framework for AI for financial services, making sure that every company needs to adopt this framework. It brings it back to the early cloud days—cost controls, cloud controls, security controls.

Bernard Leong: But you also have the same AI framework in Singapore, MAS for example, the transparency and all that. Correct. So it's the same thing?

Patrick Kelly: Correct. With MAS [Monetary Authority of Singapore], they just released a call for consultation again around the AI risk framework. That's going to close end of January. Then they're going to release what are the guidelines going to be for companies around AI risk management, which is super interesting.

Bernard Leong: Do you see institutions or maybe even research groups that actually deal specifically with the evaluation side or accessibility of safety? Because I think Arize in terms of what it's really trying to do is to make sure that the AI agents or the large language models are doing the right thing and also using different techniques LLM as a judge to help them to work through that at the same time.

Patrick Kelly: We do a lot of work on the academic side, working with a lot of researchers from foundation models labs, really understanding the technology piece—what is the future of, what does this look like? So we can then bubble that out into the market as well and make sure academia and R&D is going in the right direction as well. That's very important for us. Then obviously working with highly regulated industries banking and telco, which will have that really high standard of what can be done and what cannot be done within certain use cases.

Bernard Leong: Healthcare is in there.

Patrick Kelly: Healthcare is super important. We as a platform have all the certifications ISO and HIPAA and everyone else. That is very important that we can deploy for these markets.

Bernard Leong: I saw your CEO, Jason Lopatacki, has made 10 predictions for AI in 2026. One of the questions—you mentioned something sessions matter or harness as a buzzword. Can you tell me from his 10 predictions, which are the ones that the trends you see are actually most actively now playing out in Asia Pacific?

Patrick Kelly: I think the top one is cursor for X. It's cursor for everything. Cursor for AI engineering, cursor for product development.

Bernard Leong: I thought it would be Claude Code for everything now.

Patrick Kelly: Probably Claude Code now as well. Cursor's just one word. It's that cursor experience, the software dev, that kind of vibe experience is really important. For ourselves, we've built our own AI engineering assistant called Alex, which is in the platform. When we come back to that evaluation piece again, let's say you're observing some issue or hallucination, you can ask Alex, hey, what's happening with this AI application? And it will diagnose what's happening, give you an understanding—here's where the problem is. Maybe generate some kind of dataset to test again, and then you can run some evaluation framework off the back of it. The reason this is important—big companies have lots of data and millions of traces. How can they humanly manage all this? This cursor for X, this AI engineering assistant, Alex is super important for that.

Bernard Leong: If I were to dive deeper into the same question—is it because of what Arize is helping you to do is also to help you to work through what the context is when the AI agent makes a mistake? Because when you have information, for example a chatbot pulls context from different documents and sometimes having too much information actually makes its performance do badly. Part of that observability and evaluation is also where your technology is actually helping them to make that correct evaluation?

Patrick Kelly: Most definitely. Context matters. State matters. What is happening and what are we pulling from different systems? The idea is to hone in on, because the session will tell you everything end to end, and then you can hone in on specific pieces. You can test against that and remove it, but then you want to do some session evaluation as well. That's the next one. Cursor for X, and then session matters.

Bernard Leong: Session matters. I think you're right because from the cursor for X, the way I interpreted that prediction is more of saying that actually context does matter because of how you select that piece of information, how you optimize that. I give you an example—say you have a log summary of a chatbot per day. How do the companies able to know what are the most important issues that have come up for the customer service side? What they do is they run something called compressed context technique where they will do a daily summary and then they try to see these are the five things that happen. Maybe tomorrow we should try to push that in. A lot of that is actually sitting in the evaluation side.

Patrick Kelly: Yes, absolutely. You nailed it. You're an expert, Bernard.

Bernard Leong: I'm trying to make sure and working with an expert with you to work out whether I'm understanding this correctly and then also try to bring the context in because even as an AI builder, I will need to explain to people what the day to day AI looks like. Coming back, how are now enterprises here adapting to ideas feedback loops, online evaluations, and session level agent testing?

Patrick Kelly: If we think about what we said at the start, a lot of customers have built some AI applications—be a simple RAG chatbot or there's still a lot of companies have not. Where we really want to help is we think about AI engineering as an overall discipline or process. Really starting with where customers are and maybe they have an open source model, and they're really just defined for a specific industry niche or even simple HR internal use case. That's still super important that they're understanding it's given the right information to employees. But then as we merge out into the agentic world, that's where the real sessions come into play. Also Jason, our CEO, he talked about the word harness.

Bernard Leong: Which will be the new buzzword for orchestrating agent architectures. That will come out as well. Not to confuse with the company Harness, that's the software delivery company.

Patrick Kelly: Correct. Kudos to Adam Crew, my ex-colleague, he just joined as VP for Asia. Maybe could be another candidate for the podcast.

Bernard Leong: I'm going to get back to the podcast, but I want to get your sense on this. Do you find that most of these AI builders now are coming more towards to think through how do they know if their AI agents are doing the right thing? I give you an example—say if you ask the AI agent to plan the action, a lot of AI builders conveniently give it a step one, step two, step three, step four, and then sometimes the AI decided to be smart and gets ahead of the problem, tries step four, step three, step one, and step two, totally bypassing in order to achieve a velocity in terms of trying to get a job done. Do you find that enterprises here are actually more conscious about when their AI agents actually start making certain decisions that they shouldn't have? This ties in things hallucination with the AI agents maybe losing context as well.

Patrick Kelly: Absolutely. It's really about what is the system designed or what is the flow of different steps this AI application is going to take? Some of it is an LLM call. Some of it is an API call. Some of it is a database call. All these different things need to be mapped in. To your point where the LLM will get constructive around how you might prompt engineer as well. That's where we really want to get into—how do we manage prompts and how do we define prompts for our agent? Because someone may try to hack the system with a prompt. When this prompt comes in, this is what the system should do and should say—direct over here, or I don't know the answer, or define that as well. That's all in the control of when we do the eval and then when we do the actual feedback loop in how we define that back through the system.

Bernard Leong: I usually use a trigger test. The trigger test is this—I say something along the lines if the question is not relevant to what the customer chatbot is supposed to do, do not answer the question and say something funny.

Patrick Kelly: I don't know the answer.

Bernard Leong: That's it. I don't know. But I still want to see the trigger because I want to see that thing. It's an easier way to show my students, hey, how do you know if this is not working? I can only do it that way. One thing is getting interesting because you see this in AWS, then you see this in Databricks, and now you see this in Arize—what I want to talk about is actually how is the market evolving? Do you see more companies moving from LLM experiments to truly scalable AI agent systems?

Patrick Kelly: I still think we're a bit early for agents. Some companies are in the forefront. They're making moves to have very end-to-end agent flow and different routers going this way. Let's say the task comes in and the router can take five different steps to go. They design it through a LangGraph or a different system that. I think we're seeing some companies doing it, but we're still seeing some companies getting great value out of just having RAG within their business or just having some kind of chatbot that can actually give some value back. We really want to start with where customers are around their AI journey. But I think with agentic, it's really going to kick up this year, especially with the whole cursor for X, Claude Code for X, as we see more and more development of systems.

Bernard Leong: Do you find that the complexity of these tasks also makes it challenging from the evaluation point of view? Do you have some kind of point of view—maybe there's certain set of problems look this, certain set of problems look this and this is how we can do the evaluation?

Patrick Kelly: A hundred percent. That's where we built Alex, which is our system off the back of that. Because we've seen all these common patterns and trends of different systems and then we're able to build our—Alex is an agent we built ourselves internally. That's all trained based on what we know already. Then we're able to revise that back into the system and then give that all customers can use that based on their own data, their own traces.

Bernard Leong: Does Alex give any recommendations? Let's say if the evaluations is not giving me a very clear answer, what's going on?

Patrick Kelly: Yes, absolutely. Based on what it knows and it's continuously learning over time, which is really great because it's great value for engineering teams, which is very important. But then I think the product managers or domain users, not the technical users will come in and see, let me look at that output and then be able to put my view on it. For the travel use case, for example, this product manager, actually that's not what this booking system should say. Maybe it's a technical answer, but that should be enriched with business human annotation data along with what Alex can do as well.

Bernard Leong: The ideal customer profile for someone looking for Arize probably would be in the product side or maybe the business owner side as well?

Patrick Kelly: It's both. As I said, we really start on developers with the Phoenix—developers, AI engineers really work on the platform and use it to understand the application that's been built. But the product manager and now we're seeing the rise of the AI product manager.

Bernard Leong: It's the classic one—AI's not going to take your job, but product managers who are using AI might take your job.

Patrick Kelly: That's right. And then any SMEs are super important—really experts within the domain if it's manufacturing or banking or telco. Within telco, customer experience, advisor, how to monetize, they really know around customers and churn and ARPU. What does that really mean for how does the agent understand when customers are asking certain questions?

Bernard Leong: Do you see where in terms of this, a lot more enterprises will think of putting more evaluations to tests, or do they actually have a fixed set of tests and then throw it to the wild and let the feedback get back? Or is it just more trying to prevent as many disasters as possible?

Patrick Kelly: Should prevent as many as possible. But as we know, there's probably some AI systems in the wild already which people have just put out. We think that you can only test as much as you want before you ship. In software development, you test, then you ship, something will go wrong.

Bernard Leong: But the unknown unknowns are bigger in AI. Don't you notice that?

Patrick Kelly: Totally true. But that's the reason why I think online evaluations is going to be super important—online in production eval. We're actually online. We're actually understanding what it is and we're making these changes.

Bernard Leong: I appreciate what Arize is doing. Sometimes I felt they have a much more difficult task because there's so much unknown unknowns. Even as an AI builder, some days I'm asking myself—I'm going to release this AI agent. What's the chance of this AI agent doing something that was not supposed to do?

Patrick Kelly: We try to test as much as possible in development. That means really detailed structured evaluation of different edge cases. There'll always be edge cases.

Bernard Leong: The other magic question—what is the one question that you wish more people would ask you about AI infrastructure, specifically on development, observability and evaluation about Arize, but they don't?

Patrick Kelly: I think the main one is today about how do I get a full system into production that is then going to deliver the business metric.

Bernard Leong: I'm going to ask you that question then.

Patrick Kelly: I think the main one is if any company—if I'm working with any company and I've worked in software companies for many years and I'm working with a customer—how are we going to help them make money? How are we going to help them control costs, and how are we going to help them manage risk? I think a lot of it in AI now is managing risk. How do you manage that risk around your AI system, application agent? That's something that Arize, we really wake up every day and try and help customers.

Bernard Leong: It's interesting you just mentioned something, what stays the same? How do you make money and how do they make sure that they control their costs and manage risk. Typically most of the time throughout our careers, looking at cloud computing, looking at data infrastructure and AI, we are always—it's always the first two questions. I think this third question is actually becoming much more important.

Patrick Kelly: Super important, especially in regulated industries. Managing risk and anyone, any customer or company that has touch with us as consumers with banking—they've got millions of customers. We've had some cases of chatbots gone rogue. We won't mention the names of very famous cases, and that has financial implications, it has regulatory implications.

Bernard Leong: Have you ever checked in the archives at Arize? Have they done anything forensic where AI agents go back?

Patrick Kelly: We've seen the cases in other companies. It has been the case where customers we've worked with before, and then we worked with them and we did the evals and observability, and then we fixed it, which is great.

Bernard Leong: I'm pretty sure. That's a fair answer. What does success look like for Arize AI in the Asia Pacific region you're going to be working for the next few years?

Patrick Kelly: I just love Asia Pacific as a region. It's so diverse. We're all the way from India to Singapore, China, which is where most of the AI money boom is going—a lot of those AI startups. Japan and Korea, unique markets. Japan is really focused on innovation now. There's a lot of companies building lots of different applications. Korea with the regulatory stuff, they're really pushing ahead. And Australia, as we know, is a really super diverse market from heavy industry mining, but then also a bit of financial powerhouse as well. I think success is just how are we solving these problems? How are we managing risks for customers? And just building a really great business and building a great team that can help service customers.

Bernard Leong: But also I think you probably understood the complexity in Asia Pacific of its diversity as well. Do you think that things maybe language cultural nuances would also have a big impact on how the Arize platform can actually do things evaluations because there's evaluations in different languages? There's evaluations based on maybe some nuance, maybe some AI agents is supposed to—some nuance processes due to cultural reasons. Do you see those things coming in and just giving you new challenges?

Patrick Kelly: Absolutely. Especially as it gets into, let's say, a market Vietnam, for example, where it's one similar language. It really depends on the model—how can the model handle that local language as well? I think we've seen that over the years and the great work by the AI Singapore team on Singapore LLM to really adapt models for those local languages is super important. We would just really hook into that system. We really work with all these model providers that have great language capability, vernacular capability, and then we are able to pull all those traces. But to your point, if it's a particular market, then we'd really rely on, if it's a customer in Korea, obviously they speak Korean. That's the human annotation on the output. They're able to see the output of this in Korean, actually that sounds a bit off, or it's not right, or the context is off. That's the subject matter experts, human annotation that we'd look for.

Bernard Leong: Patrick, thank you so much for coming on the show and educating me on what Arize is doing. Thank you so much. In closing, I still have two more questions. One, any recommendations which have inspired you recently?

Patrick Kelly: I think the State of AI 2025 from Air Street Capital. I'm sure you read it. You read them all. I've probably seen your link. You've probably posted, but that was awesome because it started all the way from what's happening in research in AI, all through to political industry area. I thought that was super well rounded. It's a 300 page deck or something, but it's a great read for anyone to look at as well. I think that's fantastic.

Bernard Leong: It does help when you have NotebookLM now.

Patrick Kelly: Yes, exactly. NotebookLM can summarize.

Bernard Leong: I was wondering how does Google do their eval for NotebookLM? Sometimes they don't pull everything out, so I have to check on that. Second question. How do my audience find you and Arize AI?

Patrick Kelly: Find me on LinkedIn: Patrick Kelly, and email address pkelly at arize dot com. Arize AI is A-R-I-Z-E dot com.

Bernard Leong: You can definitely find the podcast in any podcast channel and of course subscribe to us and of course drop us any feedback. Patrick, many thanks for coming on the show and we will talk again soon.

Patrick Kelly: Thanks very much.

Podcast Information: Bernard Leong (@bernardleongLinkedin) hosts and produces the show. Proper credits for the intro and end music: "Energetic Sports Drive" and the episode is mixed & edited in both video and audio format by G. Thomas Craig (@gthomascraigLinkedIn). Here are the links to watch or listen to our podcast.

Comments