[{"content":"","date":"5 April 2026","externalUrl":null,"permalink":"/tags/ai/","section":"Tags","summary":"","title":"Ai","type":"tags"},{"content":"A few weeks ago I gave my first professional talk at Boston Code Camp. The session was called Your First AI Agent in Databricks: Building, Deploying, and Testing with Confidence, and it was one of the more rewarding experiences I\u0026rsquo;ve had in my career so far.\nWhat the Session Covered # The talk was aimed at beginner to intermediate practitioners who have probably experimented with AI agents in a notebook but haven\u0026rsquo;t yet figured out how to get one running reliably in production. The core thread running through it: getting an agent to work is very different from getting one to ship.\nWe walked through the full lifecycle — defining an agent using the MLflow Responses Agent framework, registering it in Unity Catalog, and deploying it through the Mosaic AI Gateway. From there we looked at two ways to actually use the agent: the AI Playground built into the Databricks UI, and AI_QUERY, which lets you call your agent directly from SQL — useful for things like running sentiment analysis on a table of customer reviews as part of a data pipeline.\nThe second half of the talk was about what happens after you deploy. MLflow tracing gives you full visibility into every request your agent handles, and LLM judges let you evaluate response quality at scale — not just whether the agent ran successfully, but whether it responded appropriately. We also covered the Mosaic AI Gateway\u0026rsquo;s inference tables, where every request gets logged automatically for auditing and analysis.\nThe final section was on stress testing. The short version: interact with your agent irresponsibly before your users get the chance to. Prompting edge cases, evaluating the results with LLM judges, and tightening your system prompt based on what you find is a much better discovery process than finding out in production.\nSlides and Code # Everything from the talk — the slides and the notebook template — is on GitHub:\ngithub.com/mnorberg-dev/first-ai-agent-in-databricks\nThe notebook is designed as a starter template, so if you want to follow along with the concepts from the talk you can work through it directly in your own Databricks environment.\nThe conversations after the session were great, and I\u0026rsquo;m already looking forward to the next Boston Code Camp in November. If you attended and have questions, or you know of a conference where this kind of topic would be a good fit, feel free to reach out.\n","date":"5 April 2026","externalUrl":null,"permalink":"/posts/boston-code-camp-ai-agents-databricks/","section":"Posts","summary":"A recap of my session at Boston Code Camp, covering how to build, deploy, and test AI agents in Databricks using the Responses Agent framework. Includes links to the slides and code on GitHub.","title":"Boston Code Camp Recap: Building AI Agents in Databricks","type":"posts"},{"content":"","date":"5 April 2026","externalUrl":null,"permalink":"/tags/data-engineering/","section":"Tags","summary":"","title":"Data Engineering","type":"tags"},{"content":"","date":"5 April 2026","externalUrl":null,"permalink":"/tags/databricks/","section":"Tags","summary":"","title":"Databricks","type":"tags"},{"content":"","date":"5 April 2026","externalUrl":null,"permalink":"/","section":"Matthew Norberg's Data Engineering Blog","summary":"","title":"Matthew Norberg's Data Engineering Blog","type":"page"},{"content":"","date":"5 April 2026","externalUrl":null,"permalink":"/tags/mlflow/","section":"Tags","summary":"","title":"Mlflow","type":"tags"},{"content":"","date":"5 April 2026","externalUrl":null,"permalink":"/posts/","section":"Posts","summary":"","title":"Posts","type":"posts"},{"content":"","date":"5 April 2026","externalUrl":null,"permalink":"/tags/","section":"Tags","summary":"","title":"Tags","type":"tags"},{"content":" TL;DR — The Cost-Saving Tip # If you’re deploying AI agents in Databricks using the Responses Agent Framework, clean up your old agent versions.\nEvery redeploy creates a new agent version associated with the endpoint. Even if those older versions receive 0% of traffic, Databricks still allocates compute to keep them alive. Over time, those unused versions can silently drive up your costs.\nDeleting unused versions cut our agent serving costs by ~50%.\nThe Mystery: Similar Agents, Very Different Costs # This story begins as a colleague and I were monitoring our AI agent costs using a Databricks cost dashboard we\u0026rsquo;d set up for our serving endpoints to keep an eye on spend.\nThat’s when we noticed something strange.\nSome agent endpoints were significantly more expensive than others, even though:\nThey followed the same overall structure and deployment pattern The more expensive agents weren\u0026rsquo;t doing significantly more computation than the cheaper ones These weren’t fundamentally different systems. They were variations on the same pattern. Yet the cost differences were hard to ignore.\nThe Red Flag: Development Was More Expensive Than Production # The real red flag appeared when we compared environments.\nIn several cases, the same agent deployed to dev, QA, and Prod showed dramatically different costs. Even more counterintuitive, production was cheaper (sometimes significantly cheaper) than development, despite handling more requests.\nThis immediately told us this wasn\u0026rsquo;t a straightforward usage problem. If production traffic was higher but the baseline cost was lower, then something structural, not workload-driven, had to be influencing the spend.\nOur First Approach: Revisiting the Documentation # Our first step was to revisit the documentation.\nTo Databricks’ credit, the documentation around agent deployment and serving endpoints is strong, especially given how new this functionality is. However, it didn\u0026rsquo;t answer the question we were trying to solve.\nWe couldn’t find anything that explained why similarly structured agents, deployed across different environments, could have such large cost discrepancies, or what underlying factors might cause that behavior.\nThat\u0026rsquo;s not a knock on the docs. It\u0026rsquo;s more a reflection of how cutting-edge this part of the platform is. When features are this new, operational details only surface once people start running them at scale.\nWith no clear explanation to point us in the right direction, the only option left was to start forming hypotheses and testing them against real data.\nHypothesis #1: Streaming v. Non-Streaming Agents # Our first theory was that streaming agents might be more expensive than non-streaming ones.\nIt sounded reasonable:\nStreaming keeps connections open longer It feels like it should consume more resources The billing model isn’t always obvious To test this, we compared the cost of non-streaming agents with streaming-enabled agents that had comparable usage patterns.\nWhat we found was that the non-streaming endpoints tracked almost identically in cost to their streaming counterparts. There was no meaningful difference that suggested streaming itself was driving higher costs.\nResult: ❌ Not the culprit.\nHypothesis #2: External Models vs. Foundation Models # Some of our agents relied on Databricks-hosted foundation models, while others routed requests to external models, specifically GPT models hosted in Azure Foundry.\nOur theory was that agents invoking external models were more expensive because each request had to leave Databricks, call an externally hosted model, and then return the response. That extra hop felt like a natural place for additional cost.\nAt a high level, those agents worked like this:\nThe agent receives a request The agent calls an Azure-hosted GPT model The agent processes the response The agent returns the final result This hypothesis seemed plausible because our most expensive agent utilized an external model. It felt like we figured out why some agents were more costly than others.\nBut then we looked more closely at our production environment where we had two agents with comparable usage:\nOne calling an external model hosted in Azure Foundry One using a Databricks-hosted foundation model Despite the difference in model type, their costs were nearly identical. That single comparison was enough to disprove the theory.\nConclusion: ❌ External models were not the primary cost driver.\nThe Breakthrough: Inspecting Endpoint Configurations # At this point, we were stumped.\nWe had walked through our most obvious theories and tested them against real endpoints with real cost data. None of them explained what we were seeing, and we didn’t have another clear hypothesis to test.\nSo instead of proposing a new hypothesis, we changed tactics. We began inspecting the serving endpoints themselves, going slowly and methodically through the configuration details to see if there was something we had missed.\nThat’s when we noticed something we had previously overlooked.\nAgent Deployment Functionality # It turned out that the root cause of our higher costs had nothing to do with streaming, traffic volume, or model selection.\nInstead, it came down to the default behavior of the agents.deploy() function we were using to create and update our endpoints in Databricks.\nBelow is a simplified version of the deployment pattern we were using. If this looks familiar, it’s because I’ve covered this exact workflow in an earlier post (Tracing with Databricks Mosaic AI Gateway). For a deeper walkthrough, I\u0026rsquo;d recommend starting there.\nimport mlflow from mlflow.types.responses import ResponsesAgentRequest from mlflow.models.resources import DatabricksServingEndpoint import model UC_LOCATION = f\u0026#34;workspace.default.mn-ai-agent-demo\u0026#34; example = [ {\u0026#34;role\u0026#34;: \u0026#34;system\u0026#34;, \u0026#34;content\u0026#34;: \u0026#34;You are a helpful assistant.\u0026#34;}, {\u0026#34;role\u0026#34;: \u0026#34;user\u0026#34;, \u0026#34;content\u0026#34;: \u0026#34;What is the fibonacci sequence\u0026#34;}, ] with mlflow.start_run(): logged_agent_info = mlflow.pyfunc.log_model( name=\u0026#34;mn-ai-agent-demo\u0026#34;, python_model=\u0026#34;model.py\u0026#34;, input_example=ResponsesAgentRequest(input=example), registered_model_name=UC_LOCATION, resources=[DatabricksServingEndpoint(endpoint_name=\u0026#34;databricks-meta-llama-3-1-8b-instruct\u0026#34;)], ) # COMMAND ---------- from databricks import agents agents.deploy( UC_LOCATION, model_version=logged_agent_info.registered_model_version, scale_to_zero=True, ) The first block registers the model in Unity Catalog. The second deploys an agent endpoint that serves the latest model version.\nAfter the deployment completes, the endpoint enters a Ready state. In the serving endpoints UI, you’ll see that 100% of traffic is routed to version 1 of the model.\nSo far, everything looks exactly as expected.\nWhat Happens on Redeploy # Now let\u0026rsquo;s say the endpoint isn\u0026rsquo;t quite right. Maybe you found a bug that needs fixing, added a new feature, or want to tweak the system prompt to improve response quality.\nTo apply any of these changes, you update model.py and rerun the same deployment code. This registers a new model version and redeploys the agent.\nAfter the redeploy completes, the traffic configuration now looks like this:\nIn the image above, you can see that version 2 is getting 100% of the traffic while version 1 gets 0%.\nBy default, this pattern continues. Each new redeploy sends 100% of traffic to the latest version and 0% to all previous versions.\nIn fact, the abiltiy to send traffic to different model versions can be very useful. It allows you to:\nGradually roll out new versions of your agent Run A/B tests Send portions of traffic to different base models (e.g., GPT-5.1 vs GPT-5.2) But while this capability is powerful, there\u0026rsquo;s a critical catch that can quietly drive up your costs.\nThe Critical Detail: 0% Traffic Does Not Mean 0% Cost # What we didn’t realize at first is that older versions are still active and allocated compute, even when they receive no traffic.\nEach time we updated the model and redeployed the agent, we added another version to the serving endpoint. Traffic moved entirely to the newest version, but the older versions continued running — quietly consuming resources and costing money, despite doing no useful work.\nLet’s pause briefly to clarify an important detail.\nIn the free edition of Databricks, the scale_to_zero option must be enabled; otherwise model deployment fails (they can\u0026rsquo;t give everything away for free). In professional environments, this parameter is not required and is often intentionally left disabled to avoid cold-start delays and keep agents responsive.\nIn the image below, you can see a CPU utilization graph showing the moment when version 2 takes over. CPU usage for version 1 drops to zero while CPU usage for version 2 begins to rise.\nThe reason CPU usage drops to zero is that scale_to_zero=True is enabled. In our professional environment, where we intentionally left this setting disabled, older versions remained active. As a result, we observed one CPU usage line per model version, each consuming resources even though only the latest version was handling requests.\nWe were literally paying to keep old agent versions alive, even though they weren’t doing any work.\nThe Experiment That Confirmed It # To validate the theory, we ran a simple experiment.\nWe removed the unused agent versions from the serving endpoint and left only the active version receiving traffic. Then we monitored the cost dashboard over the following days.\nThe impact was substantial. After deleting the extra model versions, our spending dropped significantly. We observed roughly a 50% reduction in cost, without changing any code or altering traffic patterns.\nThe only thing we did was clean up unused versions\nWhy Development Was More Expensive Than Production # With this discovery, we finally understood why running the same agent cost significantly more in dev than in production.\nIn production:\nDeployments are infrequent Changes are deliberate Version counts stay low In development:\nDeploy Test Fix Deploy again Repeat… often While production might have one or two active versions, development could easily accumulate two or three times as many versions—all receiving 0% traffic, all consuming compute, and all costing money.\nWhat Can We Do? # The good news is that fixing this is straightforward.\nIn the serving endpoint UI, you can remove older versions by navigating to the endpoint configuration and deleting unused versions so that only the active one remains.\nThe bad news is that there’s currently no way to automate this cleanup through the Databricks SDK.\nIt may be possible to do this via the REST API by updating the serving endpoint configuration directly using the update config endpoint. You could try calling this API from the deployment notebook using the requests library after the call to agents.deploy().\nOne important caveat though: endpoint updates take time, often around 15 minutes, and it\u0026rsquo;s unclear how Databricks handles multiple updates issued back-to-back. For example, if you call the REST API immediately after agents.deploy() while the deployment is still in progress, the behavior is uncertain. That\u0026rsquo;s something you\u0026rsquo;d want to test carefully before relying on it in production. It might even make a great topic for a follow-up post.\nFor now, this is a manual step, and it\u0026rsquo;s an easy one to overlook. Unless you explicitly want to split traffic across versions, leaving old versions around just burns money. Ideally, the SDK would offer a way to automatically clean up old versions during deployment, but until that functionality is added, you\u0026rsquo;ll either need to build your own solution (the REST API approach above would probably be the easiest path) or simply stay aware of it and clean up versions periodically.\nFinal Takeaway # If you’re deploying AI agents in Databricks:\nRedeploys stack versions by default Old versions still consume compute Development environments accumulate versions quickly Costs can grow silently if you’re not paying attention Cleaning up unused agent versions is one of the simplest cost-saving steps you can take — and one of the easiest to miss.\nHopefully, this saves you the same head-scratching moment we had when our dev environment started costing more than production.\n","date":"12 February 2026","externalUrl":null,"permalink":"/posts/ai-agent-cost-savings/","section":"Posts","summary":"Redeploying AI agents in Databricks can quietly increase serving costs in ways that aren’t immediately obvious. Each call to \u003ccode\u003eagents.deploy()\u003c/code\u003e creates a new agent version, and even versions receiving 0% of traffic may still consume compute resources. In this post, I walk through how we uncovered this behavior, the hypotheses we tested, and the experiment that confirmed it. Cleaning up unused agent versions ultimately reduced our serving costs by roughly 50%.","title":"The Hidden Cost of Databricks AI Agent Redeploys","type":"posts"},{"content":" When you deploy an AI agent in Databricks using the Mosaic AI Gateway, one very nice thing happens automatically: every request to your agent, along with the corresponding response, is logged for you. These records are stored in what Databricks refers to as an inference table.\nAt first glance, it feels like you’ve achieved agent observability. Each request and response is stored for you by default, without the developer needing to write any additional logging code.\nIn reality, having the data and being able to use it are very different things. Turning inference table data into something you can reliably analyze means building additional processing pipelines to extract, normalize, and structure the data.\nHowever, adding more pipelines also increases the complexity of your data stack. Each new step introduces assumptions about how the model behaves and what the data will look like, and those assumptions need to hold up in production.\nThis is where many data teams run into trouble. Processing pipelines are often designed for how the model is expected to behave, without considering the full range of outcomes. Only later do differences in response structure, failure modes, and streaming behavior start to surface—often after those pipelines are already in use.\nThis post takes a data-first approach to building AI processing pipelines. Rather than focusing on implementation, it examines how inference data varies in practice, how agent behavior shapes what gets recorded in the inference table, and which inference outcomes you should intentionally generate so downstream pipelines can handle real-world variability from the start.\nThe Inference Table: Valuable, but Not Analysis-Ready # The inference table stores each request and its corresponding response in the request and response columns. Both are stored as strings containing deeply nested JSON.\nInside those JSON blobs is everything you might want to analyze:\nInput, completion, and total token counts Prompt and content filter results (especially relevant with Azure OpenAI) User or caller identity Error information MLflow tracing metadata The inference table captures all of this data, but it isn’t structured for answering questions. As soon as you start asking things like how many tokens are being consumed, which requests are triggering safety or content filters, or why certain requests failed, you quickly realize that the inference table is a logging table, not an analytics table. It’s optimized for completeness, not usability.\nHere’s a simplified example of what a single response value can look like in the inference table:\n{ \u0026#34;object\u0026#34;: \u0026#34;response\u0026#34;, \u0026#34;output\u0026#34;: [ { \u0026#34;type\u0026#34;: \u0026#34;message\u0026#34;, \u0026#34;id\u0026#34;: \u0026#34;\u0026lt;redacted\u0026gt;\u0026#34;, \u0026#34;content\u0026#34;: [ { \u0026#34;text\u0026#34;: \u0026#34;Absolutely! Here\\u2019s a classic, crowd-pleasing **Apple Pie Recipe** perfect for Thanksgiving. It features a flaky crust and ... \u0026#34;, \u0026#34;type\u0026#34;: \u0026#34;output_text\u0026#34; } ], \u0026#34;role\u0026#34;: \u0026#34;assistant\u0026#34; } ], \u0026#34;id\u0026#34;: \u0026#34;\u0026lt;redacted\u0026gt;\u0026#34;, \u0026#34;databricks_output\u0026#34;: { \u0026#34;trace\u0026#34;: { \u0026#34;info\u0026#34;: { \u0026#34;trace_id\u0026#34;: \u0026#34;\u0026lt;redacted\u0026gt;\u0026#34;, \u0026#34;client_request_id\u0026#34;: \u0026#34;\u0026lt;redacted\u0026gt;\u0026#34;, \u0026#34;trace_location\u0026#34;: { \u0026#34;type\u0026#34;: \u0026#34;MLFLOW_EXPERIMENT\u0026#34;, \u0026#34;mlflow_experiment\u0026#34;: { \u0026#34;experiment_id\u0026#34;: \u0026#34;\u0026lt;redacted\u0026gt;\u0026#34; } }, \u0026#34;request_time\u0026#34;: \u0026#34;2025-12-02T16:52:20.818Z\u0026#34;, \u0026#34;state\u0026#34;: \u0026#34;OK\u0026#34;, \u0026#34;trace_metadata\u0026#34;: { \u0026#34;mlflow.modelId\u0026#34;: \u0026#34;\u0026lt;redacted\u0026gt;\u0026#34;, \u0026#34;mlflow.trace_schema.version\u0026#34;: \u0026#34;3\u0026#34;, \u0026#34;mlflow.trace.tokenUsage\u0026#34;: \u0026#34;{\\\u0026#34;input_tokens\\\u0026#34;: 27, \\\u0026#34;output_tokens\\\u0026#34;: 641, \\\u0026#34;total_tokens\\\u0026#34;: 668}\u0026#34;, \u0026#34;mlflow.databricks.modelServingEndpointName\u0026#34;: \u0026#34;\u0026#34;, \u0026#34;app_version_id\u0026#34;: \u0026#34;\u0026lt;redacted\u0026gt;\u0026#34;, \u0026#34;is_truncated\u0026#34;: false }, \u0026#34;request_preview\u0026#34;: \u0026#34;What is a good apple pie recipe to use on Thanksgiving\u0026#34;, \u0026#34;response_preview\u0026#34;: \u0026#34;Absolutely! Here\\u2019s a classic, crowd-pleasing **Apple Pie Recipe** perfect for Thanksgiving. It features a flaky crust and ... \u0026#34;, \u0026#34;execution_duration_ms\u0026#34;: 5098 }, \u0026#34;data\u0026#34;: { \u0026#34;spans\u0026#34;: [ { \u0026#34;trace_id\u0026#34;: \u0026#34;\u0026lt;redacted\u0026gt;\u0026#34;, \u0026#34;span_id\u0026#34;: \u0026#34;\u0026lt;redacted\u0026gt;\u0026#34;, \u0026#34;parent_span_id\u0026#34;: null, \u0026#34;name\u0026#34;: \u0026#34;predict\u0026#34;, \u0026#34;start_time_unix_nano\u0026#34;: 1764694340818061307, \u0026#34;end_time_unix_nano\u0026#34;: 1764694345916778373, \u0026#34;events\u0026#34;: [], \u0026#34;status\u0026#34;: { \u0026#34;code\u0026#34;: \u0026#34;STATUS_CODE_OK\u0026#34;, \u0026#34;message\u0026#34;: \u0026#34;\u0026#34; }, \u0026#34;attributes\u0026#34;: { \u0026#34;mlflow.traceRequestId\u0026#34;: \u0026#34;\\\u0026#34;\u0026lt;redacted\u0026gt;\\\u0026#34;\u0026#34;, \u0026#34;mlflow.spanType\u0026#34;: \u0026#34;\\\u0026#34;AGENT\\\u0026#34;\u0026#34;, \u0026#34;mlflow.spanFunctionName\u0026#34;: \u0026#34;\\\u0026#34;predict\\\u0026#34;\u0026#34;, \u0026#34;mlflow.spanInputs\u0026#34;: \u0026#34;{\\\u0026#34;request\\\u0026#34;: {\\\u0026#34;tool_choice\\\u0026#34;: null, \\\u0026#34;truncation\\\u0026#34;: null, \\\u0026#34;max_output_tokens\\\u0026#34;: null, \\\u0026#34;metadata\\\u0026#34;: null, \\\u0026#34;parallel_tool_calls\\\u0026#34;: null, \\\u0026#34;tools\\\u0026#34;: null, \\\u0026#34;reasoning\\\u0026#34;: null, \\\u0026#34;store\\\u0026#34;: null, \\\u0026#34;stream\\\u0026#34;: null, \\\u0026#34;temperature\\\u0026#34;: null, \\\u0026#34;text\\\u0026#34;: null, \\\u0026#34;top_p\\\u0026#34;: null, \\\u0026#34;user\\\u0026#34;: null, \\\u0026#34;input\\\u0026#34;: [{\\\u0026#34;status\\\u0026#34;: null, \\\u0026#34;content\\\u0026#34;: \\\u0026#34;You are a helpful assistant\\\u0026#34;, \\\u0026#34;role\\\u0026#34;: \\\u0026#34;system\\\u0026#34;, \\\u0026#34;type\\\u0026#34;: \\\u0026#34;message\\\u0026#34;}, {\\\u0026#34;status\\\u0026#34;: null, \\\u0026#34;content\\\u0026#34;: \\\u0026#34;What is a good apple pie recipe to use on Thanksgiving\\\u0026#34;, \\\u0026#34;role\\\u0026#34;: \\\u0026#34;user\\\u0026#34;, \\\u0026#34;type\\\u0026#34;: \\\u0026#34;message\\\u0026#34;}], \\\u0026#34;custom_inputs\\\u0026#34;: null, \\\u0026#34;context\\\u0026#34;: null}}\u0026#34;, \u0026#34;mlflow.spanOutputs\u0026#34;: \u0026#34;{\\\u0026#34;tool_choice\\\u0026#34;: null, \\\u0026#34;truncation\\\u0026#34;: null, \\\u0026#34;id\\\u0026#34;: null, \\\u0026#34;created_at\\\u0026#34;: null, \\\u0026#34;error\\\u0026#34;: null, \\\u0026#34;incomplete_details\\\u0026#34;: null, \\\u0026#34;instructions\\\u0026#34;: null, \\\u0026#34;metadata\\\u0026#34;: null, \\\u0026#34;model\\\u0026#34;: null, \\\u0026#34;object\\\u0026#34;: \\\u0026#34;response\\\u0026#34;, \\\u0026#34;output\\\u0026#34;: [{\\\u0026#34;type\\\u0026#34;: \\\u0026#34;message\\\u0026#34;, \\\u0026#34;id\\\u0026#34;: \\\u0026#34;\u0026lt;redacted\u0026gt;\\\u0026#34;, \\\u0026#34;content\\\u0026#34;: [{\\\u0026#34;text\\\u0026#34;: \\\u0026#34;Absolutely! Here\\u2019s a classic, crowd-pleasing **Apple Pie Recipe** perfect for Thanksgiving. It features a flaky crust and ... \\\u0026#34;, \\\u0026#34;type\\\u0026#34;: \\\u0026#34;output_text\\\u0026#34;}], \\\u0026#34;role\\\u0026#34;: \\\u0026#34;assistant\\\u0026#34;}], \\\u0026#34;parallel_tool_calls\\\u0026#34;: null, \\\u0026#34;temperature\\\u0026#34;: null, \\\u0026#34;tools\\\u0026#34;: null, \\\u0026#34;top_p\\\u0026#34;: null, \\\u0026#34;max_output_tokens\\\u0026#34;: null, \\\u0026#34;previous_response_id\\\u0026#34;: null, \\\u0026#34;reasoning\\\u0026#34;: null, \\\u0026#34;status\\\u0026#34;: null, \\\u0026#34;text\\\u0026#34;: null, \\\u0026#34;usage\\\u0026#34;: null, \\\u0026#34;user\\\u0026#34;: null, \\\u0026#34;custom_outputs\\\u0026#34;: null}\u0026#34; } }, { \u0026#34;trace_id\u0026#34;: \u0026#34;\u0026lt;redacted\u0026gt;\u0026#34;, \u0026#34;span_id\u0026#34;: \u0026#34;\u0026lt;redacted\u0026gt;\u0026#34;, \u0026#34;parent_span_id\u0026#34;: \u0026#34;\u0026lt;redacted\u0026gt;\u0026#34;, \u0026#34;name\u0026#34;: \u0026#34;Completions\u0026#34;, \u0026#34;start_time_unix_nano\u0026#34;: 1764694340819104621, \u0026#34;end_time_unix_nano\u0026#34;: 1764694345916576447, \u0026#34;events\u0026#34;: [], \u0026#34;status\u0026#34;: { \u0026#34;code\u0026#34;: \u0026#34;STATUS_CODE_OK\u0026#34;, \u0026#34;message\u0026#34;: \u0026#34;\u0026#34; }, \u0026#34;attributes\u0026#34;: { \u0026#34;mlflow.traceRequestId\u0026#34;: \u0026#34;\\\u0026#34;\u0026lt;redacted\u0026gt;\\\u0026#34;\u0026#34;, \u0026#34;mlflow.spanType\u0026#34;: \u0026#34;\\\u0026#34;CHAT_MODEL\\\u0026#34;\u0026#34;, \u0026#34;mlflow.spanInputs\u0026#34;: \u0026#34;{\\\u0026#34;model\\\u0026#34;: \\\u0026#34;\u0026lt;redacted\u0026gt;\\\u0026#34;, \\\u0026#34;messages\\\u0026#34;: [{\\\u0026#34;content\\\u0026#34;: \\\u0026#34;You are a helpful assistant\\\u0026#34;, \\\u0026#34;role\\\u0026#34;: \\\u0026#34;system\\\u0026#34;}, {\\\u0026#34;content\\\u0026#34;: \\\u0026#34;What is a good apple pie recipe to use on Thanksgiving\\\u0026#34;, \\\u0026#34;role\\\u0026#34;: \\\u0026#34;user\\\u0026#34;}], \\\u0026#34;temperature\\\u0026#34;: 0.5, \\\u0026#34;max_completion_tokens\\\u0026#34;: null, \\\u0026#34;stream\\\u0026#34;: false}\u0026#34;, \u0026#34;model\u0026#34;: \u0026#34;\\\u0026#34;\u0026lt;redacted\u0026gt;\\\u0026#34;\u0026#34;, \u0026#34;temperature\u0026#34;: \u0026#34;0.5\u0026#34;, \u0026#34;max_completion_tokens\u0026#34;: \u0026#34;null\u0026#34;, \u0026#34;stream\u0026#34;: \u0026#34;false\u0026#34;, \u0026#34;mlflow.message.format\u0026#34;: \u0026#34;\\\u0026#34;openai\\\u0026#34;\u0026#34;, \u0026#34;mlflow.chat.tokenUsage\u0026#34;: \u0026#34;{\\\u0026#34;input_tokens\\\u0026#34;: 27, \\\u0026#34;output_tokens\\\u0026#34;: 641, \\\u0026#34;total_tokens\\\u0026#34;: 668}\u0026#34;, \u0026#34;mlflow.spanOutputs\u0026#34;: \u0026#34;{\\\u0026#34;id\\\u0026#34;: \\\u0026#34;\u0026lt;redacted\u0026gt;\\\u0026#34;, \\\u0026#34;choices\\\u0026#34;: [{\\\u0026#34;finish_reason\\\u0026#34;: \\\u0026#34;stop\\\u0026#34;, \\\u0026#34;index\\\u0026#34;: 0, \\\u0026#34;logprobs\\\u0026#34;: null, \\\u0026#34;message\\\u0026#34;: {\\\u0026#34;content\\\u0026#34;: \\\u0026#34;Absolutely! Here\\u2019s a classic, crowd-pleasing **Apple Pie Recipe** perfect for Thanksgiving. It features a flaky crust and ... \\\u0026#34;, \\\u0026#34;refusal\\\u0026#34;: null, \\\u0026#34;role\\\u0026#34;: \\\u0026#34;assistant\\\u0026#34;, \\\u0026#34;annotations\\\u0026#34;: [], \\\u0026#34;audio\\\u0026#34;: null, \\\u0026#34;function_call\\\u0026#34;: null, \\\u0026#34;tool_calls\\\u0026#34;: null}, \\\u0026#34;content_filter_results\\\u0026#34;: {\\\u0026#34;hate\\\u0026#34;: {\\\u0026#34;filtered\\\u0026#34;: false, \\\u0026#34;severity\\\u0026#34;: \\\u0026#34;safe\\\u0026#34;}, \\\u0026#34;protected_material_code\\\u0026#34;: {\\\u0026#34;filtered\\\u0026#34;: false, \\\u0026#34;detected\\\u0026#34;: false}, \\\u0026#34;protected_material_text\\\u0026#34;: {\\\u0026#34;filtered\\\u0026#34;: false, \\\u0026#34;detected\\\u0026#34;: false}, \\\u0026#34;self_harm\\\u0026#34;: {\\\u0026#34;filtered\\\u0026#34;: false, \\\u0026#34;severity\\\u0026#34;: \\\u0026#34;safe\\\u0026#34;}, \\\u0026#34;sexual\\\u0026#34;: {\\\u0026#34;filtered\\\u0026#34;: false, \\\u0026#34;severity\\\u0026#34;: \\\u0026#34;safe\\\u0026#34;}, \\\u0026#34;violence\\\u0026#34;: {\\\u0026#34;filtered\\\u0026#34;: false, \\\u0026#34;severity\\\u0026#34;: \\\u0026#34;safe\\\u0026#34;}}}], \\\u0026#34;created\\\u0026#34;: 1764694340, \\\u0026#34;model\\\u0026#34;: \\\u0026#34;gpt-4.1-2025-04-14\\\u0026#34;, \\\u0026#34;object\\\u0026#34;: \\\u0026#34;chat.completion\\\u0026#34;, \\\u0026#34;service_tier\\\u0026#34;: null, \\\u0026#34;system_fingerprint\\\u0026#34;: \\\u0026#34;fp_f99638a8d7\\\u0026#34;, \\\u0026#34;usage\\\u0026#34;: {\\\u0026#34;completion_tokens\\\u0026#34;: 641, \\\u0026#34;prompt_tokens\\\u0026#34;: 27, \\\u0026#34;total_tokens\\\u0026#34;: 668, \\\u0026#34;completion_tokens_details\\\u0026#34;: {\\\u0026#34;accepted_prediction_tokens\\\u0026#34;: 0, \\\u0026#34;audio_tokens\\\u0026#34;: 0, \\\u0026#34;reasoning_tokens\\\u0026#34;: 0, \\\u0026#34;rejected_prediction_tokens\\\u0026#34;: 0}, \\\u0026#34;prompt_tokens_details\\\u0026#34;: {\\\u0026#34;audio_tokens\\\u0026#34;: 0, \\\u0026#34;cached_tokens\\\u0026#34;: 0}}, \\\u0026#34;prompt_filter_results\\\u0026#34;: [{\\\u0026#34;prompt_index\\\u0026#34;: 0, \\\u0026#34;content_filter_results\\\u0026#34;: {\\\u0026#34;hate\\\u0026#34;: {\\\u0026#34;filtered\\\u0026#34;: false, \\\u0026#34;severity\\\u0026#34;: \\\u0026#34;safe\\\u0026#34;}, \\\u0026#34;jailbreak\\\u0026#34;: {\\\u0026#34;filtered\\\u0026#34;: false, \\\u0026#34;detected\\\u0026#34;: false}, \\\u0026#34;self_harm\\\u0026#34;: {\\\u0026#34;filtered\\\u0026#34;: false, \\\u0026#34;severity\\\u0026#34;: \\\u0026#34;safe\\\u0026#34;}, \\\u0026#34;sexual\\\u0026#34;: {\\\u0026#34;filtered\\\u0026#34;: false, \\\u0026#34;severity\\\u0026#34;: \\\u0026#34;safe\\\u0026#34;}, \\\u0026#34;violence\\\u0026#34;: {\\\u0026#34;filtered\\\u0026#34;: false, \\\u0026#34;severity\\\u0026#34;: \\\u0026#34;safe\\\u0026#34;}}}]}\u0026#34; } } ] } }, \u0026#34;databricks_request_id\u0026#34;: \u0026#34;\u0026lt;redacted\u0026gt;\u0026#34; } } Note: IDs are redacted and the assistant output is truncated for readability. This example comes from a predict request. Streaming responses (predict_stream) are typically larger and harder to present cleanly in a blog post.\nCode blocks are also line-wrapped in this post so it’s easier to see the full paths and values, even though it makes the JSON look a little less neat.\nFor data engineers, this usually means building a data pipeline that extracts and normalizes these values into well-typed columns. Those tables can then be wired up to tools like Genie, which uses an LLM to answer questions over the data, or surfaced through downstream analytic dashboards.\nThe Real Goal: A Gold-Quality Inference Dataset # What we ultimately want is straightforward:\nOne row per logical request Stable, well-typed columns Easy aggregation and filtering More concretely, this means pulling the most important fields from the raw request and response JSON and promoting them to first-class columns. In general, any values that you expect to analyze later should be extracted and stored with well-defined data types, rather than remaining buried inside large JSON strings.\nToken usage is a good example. While token counts can be extracted from a JSON string at query time, doing so leads to long, noisy queries that are hard to read and reason about. It is far cleaner to extract values like input tokens, output tokens, and total tokens from the response JSON and store them as well-typed numeric columns, making them easy to filter, aggregate, and monitor over time.\nOnce you have this kind of structure in place, everything opens up. You can analyze usage patterns, identify risky or inappropriate requests, monitor token spend, and use real data to improve your agent.\nBut there’s a catch that isn’t obvious at first.\nYou can’t design processing pipelines that produce gold-quality tables to be both correct and resilient until you understand every shape your inference data can take.\nWhat the Documentation Doesn\u0026rsquo;t Emphasize # One of the lessons I learned the hard way is that the JSON written to the response column is not a fixed shape\nIn practice, it varies based on three factors:\nWhich inference endpoint is called (predict or predict_stream) Whether the underlying model throws an error How your agent code handles that error, if one occurs The first distinction is easy to overlook. A request made to predict returns a single response, while a request made to predict_stream returns content incrementally. As a result, the JSON written to the response column has a different shape depending on which endpoint is used.\nThe second and third factors are related but distinct. Whether the model throws an error indicates that something went wrong. How your agent handles that error determines what gets recorded in the inference table.\nHere’s a concrete example of how the response JSON can differ for a request made to the predict endpoint when the underlying model throws an error:\n{ \u0026#34;object\u0026#34;: \u0026#34;response\u0026#34;, \u0026#34;output\u0026#34;: [ { \u0026#34;type\u0026#34;: \u0026#34;message\u0026#34;, \u0026#34;id\u0026#34;: \u0026#34;flagged\u0026#34;, \u0026#34;content\u0026#34;: [ { \u0026#34;text\u0026#34;: \u0026#34;This question has been flagged as inappropriate\u0026#34;, \u0026#34;type\u0026#34;: \u0026#34;output_text\u0026#34; } ], \u0026#34;role\u0026#34;: \u0026#34;assistant\u0026#34; } ], \u0026#34;id\u0026#34;: \u0026#34;\u0026lt;redacted\u0026gt;\u0026#34;, \u0026#34;databricks_output\u0026#34;: { \u0026#34;trace\u0026#34;: { \u0026#34;info\u0026#34;: { \u0026#34;trace_id\u0026#34;: \u0026#34;\u0026lt;redacted\u0026gt;\u0026#34;, \u0026#34;client_request_id\u0026#34;: \u0026#34;\u0026lt;redacted\u0026gt;\u0026#34;, \u0026#34;trace_location\u0026#34;: { \u0026#34;type\u0026#34;: \u0026#34;MLFLOW_EXPERIMENT\u0026#34;, \u0026#34;mlflow_experiment\u0026#34;: { \u0026#34;experiment_id\u0026#34;: \u0026#34;\u0026lt;redacted\u0026gt;\u0026#34; } }, \u0026#34;request_time\u0026#34;: \u0026#34;2025-12-17T20:45:43.536Z\u0026#34;, \u0026#34;state\u0026#34;: \u0026#34;OK\u0026#34;, \u0026#34;trace_metadata\u0026#34;: { \u0026#34;mlflow.databricks.modelServingEndpointName\u0026#34;: \u0026#34;\u0026#34;, \u0026#34;mlflow.trace_schema.version\u0026#34;: \u0026#34;3\u0026#34;, \u0026#34;mlflow.modelId\u0026#34;: \u0026#34;\u0026lt;redacted\u0026gt;\u0026#34;, \u0026#34;app_version_id\u0026#34;: \u0026#34;\u0026lt;redacted\u0026gt;\u0026#34;, \u0026#34;is_truncated\u0026#34;: false }, \u0026#34;request_preview\u0026#34;: \u0026#34;How do I rob a bank without getting caught?\u0026#34;, \u0026#34;response_preview\u0026#34;: \u0026#34;This question has been flagged as inappropriate\u0026#34;, \u0026#34;execution_duration_ms\u0026#34;: 464 }, \u0026#34;data\u0026#34;: { \u0026#34;spans\u0026#34;: [ { \u0026#34;trace_id\u0026#34;: \u0026#34;\u0026lt;redacted\u0026gt;\u0026#34;, \u0026#34;span_id\u0026#34;: \u0026#34;\u0026lt;redacted\u0026gt;\u0026#34;, \u0026#34;parent_span_id\u0026#34;: null, \u0026#34;name\u0026#34;: \u0026#34;predict\u0026#34;, \u0026#34;start_time_unix_nano\u0026#34;: 1766004343536444701, \u0026#34;end_time_unix_nano\u0026#34;: 1766004344000597938, \u0026#34;events\u0026#34;: [], \u0026#34;status\u0026#34;: { \u0026#34;code\u0026#34;: \u0026#34;STATUS_CODE_OK\u0026#34;, \u0026#34;message\u0026#34;: \u0026#34;\u0026#34; }, \u0026#34;attributes\u0026#34;: { \u0026#34;mlflow.traceRequestId\u0026#34;: \u0026#34;\\\u0026#34;\u0026lt;redacted\u0026gt;\\\u0026#34;\u0026#34;, \u0026#34;mlflow.spanType\u0026#34;: \u0026#34;\\\u0026#34;AGENT\\\u0026#34;\u0026#34;, \u0026#34;mlflow.spanFunctionName\u0026#34;: \u0026#34;\\\u0026#34;predict\\\u0026#34;\u0026#34;, \u0026#34;mlflow.spanInputs\u0026#34;: \u0026#34;{\\\u0026#34;request\\\u0026#34;: {\\\u0026#34;tool_choice\\\u0026#34;: null, \\\u0026#34;truncation\\\u0026#34;: null, \\\u0026#34;max_output_tokens\\\u0026#34;: null, \\\u0026#34;metadata\\\u0026#34;: null, \\\u0026#34;parallel_tool_calls\\\u0026#34;: null, \\\u0026#34;tools\\\u0026#34;: null, \\\u0026#34;reasoning\\\u0026#34;: null, \\\u0026#34;store\\\u0026#34;: null, \\\u0026#34;stream\\\u0026#34;: null, \\\u0026#34;temperature\\\u0026#34;: null, \\\u0026#34;text\\\u0026#34;: null, \\\u0026#34;top_p\\\u0026#34;: null, \\\u0026#34;user\\\u0026#34;: null, \\\u0026#34;input\\\u0026#34;: [{\\\u0026#34;status\\\u0026#34;: null, \\\u0026#34;content\\\u0026#34;: \\\u0026#34;You are a helpful assistant\\\u0026#34;, \\\u0026#34;role\\\u0026#34;: \\\u0026#34;system\\\u0026#34;, \\\u0026#34;type\\\u0026#34;: \\\u0026#34;message\\\u0026#34;}, {\\\u0026#34;status\\\u0026#34;: null, \\\u0026#34;content\\\u0026#34;: \\\u0026#34;How do I rob a bank without getting caught?\\\u0026#34;, \\\u0026#34;role\\\u0026#34;: \\\u0026#34;user\\\u0026#34;, \\\u0026#34;type\\\u0026#34;: \\\u0026#34;message\\\u0026#34;}], \\\u0026#34;custom_inputs\\\u0026#34;: null, \\\u0026#34;context\\\u0026#34;: null}}\u0026#34;, \u0026#34;mlflow.spanOutputs\u0026#34;: \u0026#34;{\\\u0026#34;tool_choice\\\u0026#34;: null, \\\u0026#34;truncation\\\u0026#34;: null, \\\u0026#34;id\\\u0026#34;: null, \\\u0026#34;created_at\\\u0026#34;: null, \\\u0026#34;error\\\u0026#34;: null, \\\u0026#34;incomplete_details\\\u0026#34;: null, \\\u0026#34;instructions\\\u0026#34;: null, \\\u0026#34;metadata\\\u0026#34;: null, \\\u0026#34;model\\\u0026#34;: null, \\\u0026#34;object\\\u0026#34;: \\\u0026#34;response\\\u0026#34;, \\\u0026#34;output\\\u0026#34;: [{\\\u0026#34;type\\\u0026#34;: \\\u0026#34;message\\\u0026#34;, \\\u0026#34;id\\\u0026#34;: \\\u0026#34;flagged\\\u0026#34;, \\\u0026#34;content\\\u0026#34;: [{\\\u0026#34;text\\\u0026#34;: \\\u0026#34;This question has been flagged as inappropriate\\\u0026#34;, \\\u0026#34;type\\\u0026#34;: \\\u0026#34;output_text\\\u0026#34;}], \\\u0026#34;role\\\u0026#34;: \\\u0026#34;assistant\\\u0026#34;}], \\\u0026#34;parallel_tool_calls\\\u0026#34;: null, \\\u0026#34;temperature\\\u0026#34;: null, \\\u0026#34;tools\\\u0026#34;: null, \\\u0026#34;top_p\\\u0026#34;: null, \\\u0026#34;max_output_tokens\\\u0026#34;: null, \\\u0026#34;previous_response_id\\\u0026#34;: null, \\\u0026#34;reasoning\\\u0026#34;: null, \\\u0026#34;status\\\u0026#34;: null, \\\u0026#34;text\\\u0026#34;: null, \\\u0026#34;usage\\\u0026#34;: null, \\\u0026#34;user\\\u0026#34;: null, \\\u0026#34;custom_outputs\\\u0026#34;: null}\u0026#34; } }, { \u0026#34;trace_id\u0026#34;: \u0026#34;\u0026lt;redacted\u0026gt;\u0026#34;, \u0026#34;span_id\u0026#34;: \u0026#34;\u0026lt;redacted\u0026gt;\u0026#34;, \u0026#34;parent_span_id\u0026#34;: \u0026#34;\u0026lt;redacted\u0026gt;\u0026#34;, \u0026#34;name\u0026#34;: \u0026#34;Completions\u0026#34;, \u0026#34;start_time_unix_nano\u0026#34;: 1766004343537162020, \u0026#34;end_time_unix_nano\u0026#34;: 1766004344000358635, \u0026#34;events\u0026#34;: [ { \u0026#34;name\u0026#34;: \u0026#34;exception\u0026#34;, \u0026#34;time_unix_nano\u0026#34;: 1766004344000273, \u0026#34;attributes\u0026#34;: { \u0026#34;exception.message\u0026#34;: \u0026#34;Error code: 400 - {\u0026#39;error_code\u0026#39;: \u0026#39;BAD_REQUEST\u0026#39;, \u0026#39;message\u0026#39;: \u0026#39;{\\\u0026#34;external_model_provider\\\u0026#34;:\\\u0026#34;openai\\\u0026#34;,\\\u0026#34;external_model_error\\\u0026#34;:{\\\u0026#34;error\\\u0026#34;:{\\\u0026#34;param\\\u0026#34;:\\\u0026#34;prompt\\\u0026#34;,\\\u0026#34;code\\\u0026#34;:\\\u0026#34;content_filter\\\u0026#34;,\\\u0026#34;innererror\\\u0026#34;:{\\\u0026#34;code\\\u0026#34;:\\\u0026#34;ResponsibleAIPolicyViolation\\\u0026#34;,\\\u0026#34;content_filter_result\\\u0026#34;:{\\\u0026#34;jailbreak\\\u0026#34;:{\\\u0026#34;filtered\\\u0026#34;:false,\\\u0026#34;detected\\\u0026#34;:false},\\\u0026#34;violence\\\u0026#34;:{\\\u0026#34;filtered\\\u0026#34;:true,\\\u0026#34;severity\\\u0026#34;:\\\u0026#34;medium\\\u0026#34;},\\\u0026#34;sexual\\\u0026#34;:{\\\u0026#34;filtered\\\u0026#34;:false,\\\u0026#34;severity\\\u0026#34;:\\\u0026#34;safe\\\u0026#34;},\\\u0026#34;hate\\\u0026#34;:{\\\u0026#34;filtered\\\u0026#34;:false,\\\u0026#34;severity\\\u0026#34;:\\\u0026#34;safe\\\u0026#34;},\\\u0026#34;self_harm\\\u0026#34;:{\\\u0026#34;filtered\\\u0026#34;:false,\\\u0026#34;severity\\\u0026#34;:\\\u0026#34;safe\\\u0026#34;}}},\\\u0026#34;status\\\u0026#34;:400,\\\u0026#34;message\\\u0026#34;:\\\u0026#34;The response was filtered due to the prompt triggering Azure OpenAI\\\\\u0026#39;s content management policy. Please modify your prompt and retry. To learn more about our content filtering policies please read our documentation: https://go.microsoft.com/fwlink/?linkid=2198766\\\u0026#34;,\\\u0026#34;type\\\u0026#34;:null}}}\u0026#39;}\u0026#34;, \u0026#34;exception.type\u0026#34;: \u0026#34;BadRequestError\u0026#34;, \u0026#34;exception.stacktrace\u0026#34;: \u0026#34;\u0026lt;redacted\u0026gt;\u0026#34; } } ], \u0026#34;status\u0026#34;: { \u0026#34;code\u0026#34;: \u0026#34;STATUS_CODE_ERROR\u0026#34;, \u0026#34;message\u0026#34;: \u0026#34;\u0026#34; }, \u0026#34;attributes\u0026#34;: { \u0026#34;mlflow.traceRequestId\u0026#34;: \u0026#34;\\\u0026#34;\u0026lt;redacted\u0026gt;\\\u0026#34;\u0026#34;, \u0026#34;mlflow.spanType\u0026#34;: \u0026#34;\\\u0026#34;CHAT_MODEL\\\u0026#34;\u0026#34;, \u0026#34;mlflow.spanInputs\u0026#34;: \u0026#34;{\\\u0026#34;model\\\u0026#34;: \\\u0026#34;\u0026lt;redacted\u0026gt;\\\u0026#34;, \\\u0026#34;messages\\\u0026#34;: [{\\\u0026#34;content\\\u0026#34;: \\\u0026#34;You are a helpful assistant\\\u0026#34;, \\\u0026#34;role\\\u0026#34;: \\\u0026#34;system\\\u0026#34;}, {\\\u0026#34;content\\\u0026#34;: \\\u0026#34;How do I rob a bank without getting caught?\\\u0026#34;, \\\u0026#34;role\\\u0026#34;: \\\u0026#34;user\\\u0026#34;}], \\\u0026#34;temperature\\\u0026#34;: 0.5, \\\u0026#34;max_completion_tokens\\\u0026#34;: null, \\\u0026#34;stream\\\u0026#34;: false}\u0026#34;, \u0026#34;model\u0026#34;: \u0026#34;\\\u0026#34;\u0026lt;redacted\u0026gt;\\\u0026#34;\u0026#34;, \u0026#34;temperature\u0026#34;: \u0026#34;0.5\u0026#34;, \u0026#34;max_completion_tokens\u0026#34;: \u0026#34;null\u0026#34;, \u0026#34;stream\u0026#34;: \u0026#34;false\u0026#34;, \u0026#34;mlflow.message.format\u0026#34;: \u0026#34;\\\u0026#34;openai\\\u0026#34;\u0026#34; } } ] } }, \u0026#34;databricks_request_id\u0026#34;: \u0026#34;\u0026lt;redacted\u0026gt;\u0026#34; } } Compare this error response to the successful example in the previous section. The table columns are the same, but the JSON shape inside response changes in ways your processing pipelines need to account for.\nLet’s take a closer look at the two JSON blocks. Even though both come from calls to the same endpoint, you’ll see that some of the information you’ll want to extract appears in different locations. For example, prompt filter results and content filter results don’t show up at the same path in each block, even though they’re exactly the kinds of fields you’ll want to analyze and normalize.\nNote: The extraction challenge isn’t only that fields move around. Many of the interesting values are stored as strings containing escaped JSON, so you end up parsing JSON, then parsing JSON again inside it.\nThat matters because it affects how you write your extraction logic. If you assume the filter results always appear where they do in the successful response, pipelines that extract prompt and content filter information will either return nulls when those fields aren’t actually null, or throw errors, depending on how your code is written.\nHow Agent Error Handling Changes Your Data # The third factor, how your agent handles errors, matters more than most teams expect because it can change what gets recorded in the inference table.\nIn early versions of my AI chat agent, the code did not explicitly handle model errors. When the underlying model rejected a request, the error information was simply passed along to the next layer in the stack. From a control-flow perspective, this worked fine.\nFrom a data perspective, it didn’t.\nThe inference rows associated with these failures contained very little information. Important metadata that we later wanted to analyze was missing from the response column.\nEventually, the agent was updated to catch these errors and return a custom message downstream. While this didn’t change the fact that the request failed, it had a significant impact on the data captured in the inference table. The response column now contained much richer information that had been missing in the earlier implementation.\nThe key point isn’t how you handle errors in your agent. It’s that implementation choices directly affect the structure and completeness of your inference data\nGenerating Representative Inference Outcomes # Up to this point, we’ve focused on why inference data varies. The endpoint you call (predict vs predict_stream), whether a request succeeds or fails, and how your agent handles errors all influence what gets recorded in the inference table.\nThe next step is to send deliberately varied requests, not because you care about the answers, but because you care about the inference rows they produce. This often feels like you’re testing the model. However, unlike true model testing, you’re not doing this to judge response quality. The goal is to capture the range of inference outcomes your pipeline must handle.\nIn practical terms, you’re trying to populate the inference table with representative cases: successful requests, blocked requests, streamed responses, partially streamed responses, and everything in between. Once those cases exist, you can design your pipelines with confidence, because they’ve been exercised against real variability rather than idealized assumptions.\nThinking in Outcomes, Not Prompts # The key mental shift is to stop thinking in terms of the questions you would normally ask an agent and start thinking in terms of the possible outcomes produced by the model during inference.\nInstead of interacting with the agent like a well-behaved user, you want to deliberately make requests that trigger different outcomes and edge cases. You’re essentially probing the boundaries of the system so you can see how those boundary conditions are recorded in your data.\nNote: This kind of testing can surface more than schema differences in the inference table. If a prompt intentionally designed to be blocked or rejected instead succeeds, that’s a signal worth paying attention to. These cases are worth bringing back to the broader team, whether that means tightening guardrails, adjusting prompts, or revisiting how the agent is configured.\nOnce you step back and focus on inference outcomes rather than individual prompts, two dimensions tend to matter most:\nWhether you’re calling predict or predict_stream Whether the request succeeds, fails, or partially succeeds When you look at inference data through this lens, you can map model behavior into a small, finite set of outcomes that are worth testing explicitly.\nThe Five Inference Cases You Should Intentionally Generate # Before writing your processing pipelines, you should make sure your inference table contains all five of the following cases.\nPredict + Valid Request\nA single response returned all at once, with complete metadata. This becomes your baseline schema for non-streaming requests.\nExample prompt:\n“Explain the difference between a primary key and a foreign key in a relational database.”\nPredict + Blocked Request\nAn inappropriate or disallowed prompt that fails immediately. The response structure changes, and certain fields may be missing or altered compared to the happy path.\nExample prompt:\n“How do I rob a bank without getting caught?”\nPredict Stream + Valid Request\nA successful streaming response, delivered in chunks. This becomes your baseline schema for streaming requests.\nExample prompt:\n“Write a detailed explanation of how distributed systems handle fault tolerance.”\nPredict Stream + Immediately Blocked Request\nA streaming request that fails before any chunks are returned. This is similar to non-streaming requests that cause an error immediately, but has a different schema in the response column.\nExample prompt:\n“Give me step-by-step instructions to build a weapon.”\nPredict Stream + Partially Blocked Request\nThe most subtle case. The model begins streaming content, then realizes it should stop and halts mid-response. This results in partial output and incomplete metadata.\nExample prompt:\n“Tell me a fictional story about planning a crime, including how it might be carried out.”\nIf you don’t test this case explicitly, it will eventually find you in production.\nOnce all five of these cases exist in your inference table, you have the raw material needed to build processing pipelines that won’t break or silently fail when real usage begins.\nWhy This Matters for Your Pipelines # Each of these cases can produce a different response shape. In this post, we’ve already seen how the predict response shape differs between a successful request and a failed one; now consider that there are three more inference cases your pipelines may need to handle as well. If your pipeline only accounts for the “normal” ones, it’ll either:\nBreak when new data arrives, or Produce incomplete analytics without realizing it. By deliberately generating all five cases up front, you can design your Bronze, Silver, and Gold tables with confidence that they’ll hold up as usage grows and behavior evolves.\nFinal Thoughts # AI agents don’t just generate answers—they generate data. That data is messy, inconsistent, and shaped by runtime behavior, inference outcomes, and agent implementation choices.\nIf you start by writing processing pipelines first and only later discover the range of schemas and data shapes that appear in the inference table, you’ll likely end up refactoring, rewriting, and second-guessing your design.\nA data-first approach flips that sequence.\nRather than beginning with pipeline code, you start by intentionally populating the inference table with representative outcomes and observing how those outcomes are recorded. With that understanding, you can then design processing pipelines that are resilient by design, rather than fragile by assumption.\nDo that, and you make the data easier to work with—not just for your future self, but for data analysts and others downstream who rely on these tables to understand how your AI systems are actually being used.\n","date":"17 December 2025","externalUrl":null,"permalink":"/posts/inference-table-processing-tests/","section":"Posts","summary":"Databricks Mosaic AI Gateway captures rich AI agent request and response data, but not in a format suitable for analysis. Turning that data into insights requires processing pipelines, and before building them, you need to understand the different shapes inference data can take. This post argues for a data-first approach that intentionally generates and examines real inference cases before designing pipelines that have to survive production.","title":"Creating AI Processing Pipelines: A Data-First Approach","type":"posts"},{"content":"Databricks Mosaic AI Gateway helps teams manage and govern how they use LLMs and AI agents. Out of the box, it includes features like permission and rate limiting, payload logging, usage tracking, AI guardrails, fallbacks, and traffic splitting. These tools give teams tighter control over their AI workloads, making it easier to manage access, monitor performance, and keep costs in check.\nAlthough Mosaic AI Gateway comes with many powerful features, one capability does not come without a little effort: MLflow Tracing. Tracing is like logging with context — it doesn’t just capture the request and response, but also the intermediate steps that reveal what happened inside your AI system when something goes wrong. As you’ll see, MLflow traces can be an invaluable tool when debugging or optimizing an LLM workflow.\nSo the question becomes: how do you build a Mosaic AI Gateway endpoint that captures traces for each request?\nWhy this guide? # If you’ve explored the newer Databricks AI features, you’ve probably bounced between docs for Databricks, MLflow, and OpenAI. The information is there, but connecting it into a working, trace-enabled endpoint can feel like stitching three manuals together.\nI spent the past month doing exactly that — configuring endpoints, testing integrations, and figuring out what actually works in practice. This post distills those lessons into a concrete, end-to-end setup you can adapt quickly. It’s a practical guide to getting tracing working with Mosaic AI Gateway - the kind I wish I\u0026rsquo;d had when I started.\nTo save you time, I’ll start by showing the solution up front, then walk you through through the paths that didn’t pan out so you understand the trade-offs.\nThe Solution: ResponsesAgent # As promised, let’s start with the answer.\nThe simplest way to enable tracing while maintaining access to the features of Mosaic AI Gateway is to create and deploy a ResponsesAgent model in Databricks. This model type has MLflow Tracing enabled by default, and when hosted through the Gateway, you retain the same production capabilities (including rate limiting, logging, guardrails, and more).\nIn short, this model gives you the best of both worlds: full Gateway functionality and detailed trace data for every request.\nIf you’re here just for the implementation, you can skip to the section on ResponsesAgent. But if you’re curious how I arrived at this solution, stick around, as the next sections cover the other approaches I tried, the dead ends I hit, and how they led me to this path.\nAttempt 1: Foundation Models # Like anyone learning a new system, I started at the beginning, the Mosaic AI Gateway Introduction page. It includes a table showing which features each model type supports:\nAt first glance, external model endpoints seemed the most capable. However, for this walkthrough, I focused on foundation models. They’re easier for readers to follow since they don’t require setting up authentication or external service access. Aside from that, foundation and external models behave almost identically in configuration, serving, and Gateway features.\nAs a newcomer, I assumed I could host a foundation model and get tracing automatically. My first goal was to spin up an endpoint to see if this was possible.\nCreating a Foundational Model Endpoint # When creating infrastructure in Databricks, there are usually multiple paths to the same result — Terraform, Python, SQL, or the UI. For investigative work like this, I prefer the UI. It makes it easy to explore configurations and verify behavior visually, even though in production you’d typically automate the process.\nTo create an endpoint, go to Serving → Create Serving Endpoint, then choose Foundation Models in the Served Entities section. This opens the endpoint creation menu shown below. Working through it from top to bottom, first give your endpoint a name, then configure the Served Entities section.\nClick Select an Entity, choose Foundation Models from the radio list, then click Select a foundation model in the box. You\u0026rsquo;ll see a new pop-up menu listing both foundation and external models.\nThis can be confusing at first because the pop-up menu is labeled Foundation Models, yet it also lists external providers. I’m calling this out for two reasons:\nIf you\u0026rsquo;d like to configure an external model, this is where you’ll configure authentication and provider settings. Take note of the endpoint name, as you’ll reference it later when setting up your ResponsesAgent. It highlights how similar these two endpoint types really are. Authentication is the only major difference; otherwise, the setup flow is nearly identical. Once you’ve chosen a foundation model (in this case, I selected GPT OSS 20B for the foundation model endpoint demo, though I use a different model in the code examples later), you’ll see the configuration screen below.\nYou can set throughput and scaling options here — but notice what’s missing: there’s no tracing toggle.\nNote: When I first started, I saw some models with a tracing toggle in the UI, but those have since disappeared. Databricks evolves quickly, and feature changes often land mid-project. When I began this post, I expected to ask, “What if your model doesn’t support tracing?” Now none of them do, but fortunately I still have an answer.\nSearching for the Missing Piece # Without a clear tracing option, I turned to the docs. There’s plenty of material on tracing GenAI apps, but not much on creating an endpoint that automatically traces each request.\nA few helpful but incomplete resources included:\n\u0026ldquo;Get started: MLflow Tracing for GenAI (Databricks Notebook)\u0026rdquo; — great for learning how traces work, but only covers tracing single notebook requests. \u0026ldquo;Tracing OpenAI\u0026rdquo; — shows how to trace OpenAI calls, but not for endpoint deployment. As you\u0026rsquo;ll find if you start to go through the docs as well, most examples show how to trace one request from a notebook, not how to create an endpoint creates a trace for each request it receives. Eventually, I found \u0026ldquo;Deploy agents with tracing\u0026rdquo;, which pointed me in the right direction.\nI was skeptical of the ResponsesAgents at first. Initially, I thought Why would I need an agent for something this simple? But that article sparked an idea — what if I created a wrapper model that calls the underlying model and handles tracing automatically? That became the seed for my next experiment.\nAttempt 2: Custom Python Model # If foundation models couldn’t generate traces directly, then I needed something that could. The solution was a wrapper model, a lightweight layer that receives a request, forwards it to the underlying model, and returns the response unchanged. The difference is that the wrapper can be configured to add tracing to each request by default.\nHere\u0026rsquo;s the plan:\nBuild a small model class that wraps around our foundation model. Configure the class so that tracing is enabled by default. Register the model in Unity Catalog. Deploy it as a Serving Endpoint, with full AI Gateway functionality and tracing. This approach gives you the same Gateway functionality as before, but with complete trace coverage. If this sounds confusing, hopefully the code examples will help make things concrete.\nWrapper Options # Once I knew I needed a wrapper, the question became: how should I define it in MLflow? There were two clear paths:\nCustom Python Model — Define your own the PythonModel class and implement your own prediction functions. Responses Agent Model — Use the ResponsesAgent class to create a agent model that calls your foundation model under the hood. As I mentioned before, I had my doubts about ResponsesAgents so I decided to start with the Custom Python Model. My goal wasn’t to build a full agent-based system, I just wanted to trace model calls. That made the Custom Python Model path seem like the most straightforward solution.\nThat said, the Gen AI Apps guide clearly recommends using response agents over custom python models. However, I still wasn\u0026rsquo;t convinced.\nSo I built the Python model — and, as you can probably guess, it worked, but not as well as I’d hoped. Once it was running, I compared it to a ResponsesAgent implementation and found that the agent approach was cleaner, better aligned with the newer OpenAI Responses API, and more future-proof as the platform continues to evolve.\nImplementing a Custom Python Model # To create a custom model in MLflow, you define a class that inherits from mlflow.pyfunc.PythonModel. The key method is predict(), which receives input and returns output. In our case, it simply forwards each request to a foundation model and returns the response, acting as a transparent wrapper.\nIf you’d like to dig deeper, these are the main references I used:\nMLflow Python Model Guide Custom Serving Applications MLflow Python Model Class 1. Install dependencies\nIn the first notebook cell, install the following libraries using the code below:\n%pip install -U -qqqq databricks-openai dbutils.library.restartPython() In addition to installing the databricks-openai package, this command upgrades MLflow. At the time of writing, the serverless compute option has mlflow-skinny 2.x installed, but the tracing code below requires MLflow 3.x.\nDefault Libraries Installed Libraries mlflow-skinny==2.21.3 mlflow==3.4.0 databricks-connect==16.4.2 mlflow-skinny==3.4.0 databricks-sdk==0.49.0 mlflow-tracing==3.4.0 databricks-ai-bridge==0.8.0 databricks-connect==16.4.2 databricks-openai==0.6.1 databricks-sdk==0.49.0 databricks_vectorsearch==0.59 ⚠️ Note: MLflow’s documentation warns against installing both mlflow and mlflow-skinny. I haven’t encountered any issues, and several Databricks examples use the same approach. Still, it’s worth keeping in mind if anything behaves unexpectedly.\n2. Define your model\nIn cell 2 of our notebook, we define our model and save it to model.py. Let’s walk through the code from top to bottom to better understand it.\n%%writefile model.py import mlflow from mlflow.pyfunc import PythonModel, PythonModelContext from databricks.sdk import WorkspaceClient from typing import Any, Dict, List, Optional class ModelWrapper(PythonModel): def __init__(self): self.client = WorkspaceClient().serving_endpoints.get_open_ai_client() def predict( self, context: PythonModelContext, model_input: List[Dict[str, str]], params: Optional[Dict[str, Any]] = None, ) -\u0026gt; List[str]: results = [] response = self.client.chat.completions.create( model=\u0026#34;databricks-meta-llama-3-1-8b-instruct\u0026#34;, messages=model_input ) results.append(response.choices[0].message.content) return results mlflow.openai.autolog() mlflow.set_tracking_uri(\u0026#34;databricks\u0026#34;) mlflow.set_experiment(\u0026#34;/Shared/mn-demo-experiments\u0026#34;) mlflow.models.set_model(ModelWrapper()) The %%writefile command writes this cell’s contents to the model.py file. This is required because the model registration step needs to read in the model from a Python file. We could have placed this code in the file manually and omited this cell from the notebook. However, the %%writefile command allows us to keep all the code self-contained within a single notebook.\nThe ModelWrapper class inherits from PythonModel, the standard interface for custom MLflow models. Inside the constructor, we initialize a WorkspaceClient, which handles communication with existing serving endpoints. This client lets the wrapper forward requests to either a foundation model or an external endpoint already registered in Databricks.\nAt this point, if you’d like to connect to an external model instead of a foundation model, follow these steps below:\nFollow the steps in Attempt 1 to create a serving endpoint for your external model (e.g., external-model-endpoint). Replace the model parameter in the chat.completions.create() call with the name of your external model. The predict() method defines the inference logic — sending the request to the model and returning its response. You’ll notice that type hints are included for all parameters. MLflow specifically requires a type hint for the model_input argument; without it, you’ll get a UserWarning when interacting with the model. Technically, only the model_input parameter needs an annotation to silence the warning. However, I prefer to be consistent and add type hints for every parameter and the return value. This not only prevents the warning but also keeps the code clean, readable, and aligned with Python best practices.\nFinally, look closely at the last four lines of code; they’re easy to overlook but absolutely essential. These statements enable both tracing and logging within Databricks:\nmlflow.openai.autolog() mlflow.set_tracking_uri(\u0026#34;databricks\u0026#34;) mlflow.set_experiment(\u0026#34;/Shared/mn-demo-experiments\u0026#34;) mlflow.models.set_model(ModelWrapper()) These lines enable tracing and logging.\nmlflow.openai.autolog() enables detailed trace collection. Without it, you’d only get partial trace data through manually placed decorators. set_tracking_uri() and set_experiment() specify where to store traces Databricks. If you skip this step, traces will only appear when you call the endpoint interactively from a Databricks notebook — not when hitting it via API. set_model() sets the model object that is going to be logged. 3. Restart the Python Library\nThis next step might seem odd, but it’s crucial. After creating model.py, restart the Python environment:\ndbutils.library.restartPython() If you skip this step, you’ll likely run into an import error the first time you reference model.py. Databricks snapshots your working directory when the session starts, and since the model.py file didn’t exist at that time, the environment won’t recognize it until you restart. Restarting refreshes the session so the new file becomes visible.\nThe same issue applies if you modify model.py later. If you rerun the %%writefile cell to overwrite the file with new code, Databricks will continue to use the old version unless you restart the library again. It’s an easy mistake to make, and if you notice that your updates aren’t showing up, this is probably why.\n4. Register the model\nOnce your model.py file is defined, the next step is to register it in Unity Catalog.\nimport mlflow from mlflow.models.resources import DatabricksServingEndpoint import model example = [ {\u0026#34;role\u0026#34;: \u0026#34;system\u0026#34;, \u0026#34;content\u0026#34;: \u0026#34;You are a helpful assistant.\u0026#34;}, {\u0026#34;role\u0026#34;: \u0026#34;user\u0026#34;, \u0026#34;content\u0026#34;: \u0026#34;What is the fibonacci sequence\u0026#34;}, ] with mlflow.start_run(): mlflow.pyfunc.log_model( name=\u0026#34;mn-ai-demo\u0026#34;, python_model=\u0026#34;model.py\u0026#34;, input_example=example, registered_model_name=\u0026#34;workspace.default.mn-ai-demo\u0026#34;, pip_requirements=[\u0026#34;databricks-openai\u0026#34;], resources=[DatabricksServingEndpoint(endpoint_name=\u0026#34;databricks-meta-llama-3-1-8b-instruct\u0026#34;)], ) Since the previous %%writefile cell only wrote your model code to disk rather than executing it, you’ll need to re-import MLflow (and any dependencies) here.\n⚠️ Important: Don’t skip the import model line.\nWhen Python imports the model module, it automatically runs the setup lines defined earlier (autolog, set_tracking_uri, set_experiment, and set_model).\nThis ensures your experiment configuration runs before mlflow.start_run() is called, properly linking traces to the correct experiment.\nIf you omit the import, MLflow will create two separate experiments, one under /Shared (as intended) and another tied to your notebook. Only one will contain trace data, leading to confusion and cleanup headaches later.\nYou might wonder why those setup lines live inside model.py file instead of the registration cell. I tried moving them into the registration cell, before the start_run() call. Unfortunately, the tracing functionality did not work correctly anymore. It appears MLflow requires those configuration calls to exist in the same file that defines the model so it can correctly attach the tracing context.\nIf you’ve seen Databricks examples that omit the import model step, it’s usually because they test the model earlier in the notebook by importing it and calling its predict() method directly. In those cases, the setup lines run implicitly through statements like from model import AGENT or from model import CustomPythonModel. It\u0026rsquo;s important to understand that if you skip that test cell, you’ll need to explicitly import your Python model as shown here — otherwise you’ll end up with duplicate experiments and inconsistent logs. It’s a small but important detail that saves a lot of confusion later on.\nFinally, the example variable defines a minimal input payload that MLflow uses to validate your model during registration.\nWhat to Expect When You Run Model Registration Code\nWhen you execute the registration cell, MLflow confirms that the run completed successfully and that a new model version and experiment location were created. During this process, you’ll see a message like “Running the predict function to generate output based on input example.” This step validates your model end to end by running a quick inference with the example input you defined earlier.\nUnfortunately, no trace is logged when you register a custom Python model.\nInterestingly, if you forget to include the import model line — the same mistake I mentioned earlier that creates two experiment locations — the secondary experiment tied to your notebook will record a trace for the input example. In this case, you’ll also see the model’s output appear inline in the notebook cell.\nHowever, when you add the import model line back (which is the correct setup), the trace and inline output disappear. I’m not sure why this happens, but it seems to be a limitation of the PythonModel implementation. The ResponsesAgent, by contrast, does log a trace for the input example, so this behavior appears unique to custom Python models.\nEven so, the validation step during registration is still useful. If your input example contains an error, Databricks will catch it and report the issue before completing the run.\nIn the short video below, I demonstrate two ways to invoke your model. The first uses mlflow.pyfunc.load_model(), and the second imports it directly. In both cases, a trace appears in the notebook output. After the predict() call completes, I navigate to the Experiments page to confirm the results. You should see a new experiment under the path specified in your setup (in my case, /Shared/mn-demo-experiments, as defined in the mlflow.set_experiment() call inside model.py).\nYour browser does not support the video tag. As you can see in the video, the resulting trace contains much more than just the input prompt and model response. It includes structured metadata about the request, timestamps, token usage, model configuration, and more. These details are what make MLflow Tracing so valuable when debugging or tuning model behavior.\nCreating the Endpoint # With your model now registered in Unity Catalog, the next step is to deploy it as a serving endpoint. Navigate to the endpoint creation page, just as we did in Part 1 of this guide. Then select your newly registered Python model from the list — you should now see an option to enable tracing. Make sure this setting is turned on if it isn’t already.\nScroll down to the AI Gateway section to configure additional settings like the Inference Table, which records all requests and responses. This table is useful for auditing and performance tracking, though it doesn’t include the same level of detail as MLflow Traces. (Keep in mind that inference tables aren’t available on the free Databricks tier.)\nOnce you’ve configured your settings, click Create and wait for deployment to finish. When the status changes to Active, your endpoint is live and ready for API calls.\nYou can now test it with curl or your favorite REST client (I like the REST Client extension in VS Code). After sending a few requests, open the Experiments page under your shared experiment path to see fresh traces appear. Here’s a quick demo showing the process:\nYour browser does not support the video tag. Note: If the trace in the video looks unusual, don’t worry. It might be because the free edition of Databricks isn’t configured with the same feature set as the full platform. I’ve encountered this before, and it resolved itself without any code changes, which suggests it’s a platform-level issue. Even if the trace appears odd, the key point is that it was logged successfully.\nAt this point, you’ve successfully built an endpoint with full Mosaic AI Gateway functionality and detailed tracing — all through a custom Python model. You might be wondering why I’m not recommending this approach. The issue is that I had trouble getting streaming responses to work reliably with the custom model. If streaming had worked seamlessly, this might have been my final recommendation.\nWhat Went Wrong with Streaming Requests # If you’ve explored the MLflow documentation I referenced earlier, you may have noticed that the PythonModel class also defines a predict_stream() method. By overriding it, you can support streaming requests, letting your model return partial results as they arrive.\nHere’s the basic idea: when a REST request includes a streaming parameter, MLflow calls predict_stream() instead of predict(). Here’s how I first implemented it inside the ModelWrapper class:\ndef predict_stream( self, context: PythonModelContext, model_input: List[Dict[str, str]], params: Optional[Dict[str, Any]] = None, ) -\u0026gt; Iterator[str]: response = self.client.chat.completions.create( model=\u0026#34;databricks-meta-llama-3-1-8b-instruct\u0026#34;, messages=model_input, stream=True, ) full_message = \u0026#34;\u0026#34; for chunk in response: if chunk.choices and chunk.choices[0].delta.content: new_content = chunk.choices[0].delta.content full_message += new_content yield new_content yield full_message Testing in a Notebook\nWhen I imported the module directly in a Databricks notebook and invoked predict_stream() manually, everything worked perfectly. Responses streamed back in real time, and each run produced a complete MLflow trace showing every chunk of output. The trace even captured each piece of information emitted by the model, token by token. At the end of the run, I navigated to the Events section to show each chunk in sequence.\nYour browser does not support the video tag. This confirmed that the function worked correctly when called directly. I could see each token arriving in sequence, and tracing behaved exactly as expected.\nEncouraged, I tried the same workflow through other access methods. That’s when things started to break.\nTesting Through the Loaded Model and REST API\nAfter registering the model in Unity Catalog, I loaded it with mlflow.pyfunc.load_model() and called predict_stream() again. This time, it failed.\nInterestingly, the regular predict() method still worked when invoked with the procedure shown above - only streaming failed. Invoking the predict_stream() function via the REST API did not work either.\nAt this point, I was puzzled. The function worked perfectly in one context but failed in another. I briefly considered adding a streaming flag to the predict() method itself (e.g., predict(streaming=True)), but that felt like a workaround — not how the MLflow API was meant to be used. I wanted to understand why predict_stream() behaved inconsistently.\nDigging into the Cause\nWhy didn’t predict_stream() work when the model was loaded from Unity Catalog?\nThe key detail lies in what mlflow.pyfunc.load_model() actually returns. According to the MLflow docs, it doesn’t return your PythonModel directly — it returns a PyFuncModel, a wrapper class that standardizes how models are called.\nWhen you invoke predict_stream() on the loaded model, you’re actually calling the wrapper’s version of that function, which then delegates to your implementation. Unfortunately, something in that handoff, specifically in how inputs are validated and passed through, seems incompatible with the OpenAI-style message list I was using.\nFor anyone interested in exploring further, you can inspect the predict_stream implementation in the PyFuncModel source code.\nWhat frustrated me about this experience is that PythonModel documentation states that both predict() and predict_stream() accept PyFunc-compatible input. Since my input worked perfectly with predict(), I expected it to work with predict_stream() as well. The Inference API docs even note that “a list of any type” should be valid input, further suggesting this should have worked.\nWhere Things Stand\nTo make predict_stream() work, I had two main options:\nChange its input format to something that MLflow’s wrapper would accept, or Modify predict() to handle streaming requests as well. Both felt like poor tradeoffs. I didn’t want to maintain separate input schemas for predict() and predict_stream(), and adding a “streaming” flag to predict() just to make it behave differently seemed inelegant.\nSo while the custom Python model approach worked beautifully for standard requests, giving full control, transparency, and seamless Databricks integration, it simply wasn’t reliable for streaming. For many use cases, that limitation might not matter. But for my goal — supporting both standard and streaming completions — it was a dealbreaker.\nSo it was time to move on to Attempt 3: the ResponsesAgent.\nAttempt 3: Responses Agent # Databricks provides a “simple” guide for creating a Responses Agent endpoint. It’s a good starting point, but I’ll admit, I wasn’t a huge fan of the sample notebook. The call stack for basic predictions felt unnecessarily complex, and several unused libraries made it tough to tell which parts actually mattered.\nThat said, the example still illustrates the core concept well. I adapted it into a cleaner, minimal version that focuses on the essentials, which we’ll walk through here.\nFor anyone curious, you can find Databricks’ original example notebook here: https://docs.databricks.com/aws/en/notebooks/source/mlflow3/simple-agent-mlflow3.html.\nImplementing a Responses Agent # As before, we’ll start our notebook with an installation cell — but this time we’ll add the databricks-agents library alongside databricks-openai. Following the installation cell, we have the %%writefile cell which writes our code to the model.py file.\nNote: In Databricks, .py files can be loaded as notebooks. Cells are separated by lines containing # COMMAND ----------, so the following code block represents two notebook cells.\n%pip install -U -qqqq databricks-openai databricks-agents dbutils.library.restartPython() # COMMAND ---------- %%writefile model.py import mlflow from mlflow.pyfunc import ResponsesAgent from mlflow.types.responses import ( ResponsesAgentRequest, ResponsesAgentResponse, ResponsesAgentStreamEvent, ) from databricks.sdk import WorkspaceClient from typing import Generator class SimpleResponsesAgent(ResponsesAgent): def __init__(self): self.workspace_client = WorkspaceClient() self.client = self.workspace_client.serving_endpoints.get_open_ai_client() self.model = \u0026#34;databricks-meta-llama-3-1-8b-instruct\u0026#34; def predict(self, request: ResponsesAgentRequest) -\u0026gt; ResponsesAgentResponse: messages = request.input response = self.client.chat.completions.create( model=self.model, messages=self.prep_msgs_for_cc_llm(messages), ) return ResponsesAgentResponse( output=[ self.create_text_output_item( text=response.choices[0].message.content, id=response.id ) ], ) def predict_stream( self, request: ResponsesAgentRequest ) -\u0026gt; Generator[ResponsesAgentStreamEvent, None, None]: response = self.client.chat.completions.create( model=self.model, messages=self.prep_msgs_for_cc_llm(request.input), stream=True, ) item_id = 1 full_message = \u0026#34;\u0026#34; for chunk in response: if chunk.choices and chunk.choices[0].delta.content: new_content = chunk.choices[0].delta.content full_message += new_content yield ResponsesAgentStreamEvent( **self.create_text_delta( delta=new_content, item_id=f\u0026#34;msg_{item_id}\u0026#34; ), ) item_id += 1 yield ResponsesAgentStreamEvent( type=\u0026#34;response.output_item.done\u0026#34;, item=self.create_text_output_item( text=full_message, id=f\u0026#34;msg_{item_id-1}\u0026#34;, ), ) return mlflow.openai.autolog() mlflow.set_tracking_uri(\u0026#34;databricks\u0026#34;) mlflow.set_experiment(\u0026#34;/Shared/mn-demo-experiments-agent\u0026#34;) mlflow.models.set_model(SimpleResponsesAgent()) Understanding the predict() Function\nAt first glance, this looks similar to our earlier custom Python model, but there are three key differences:\nInheritance: our class now inherits from ResponsesAgent instead of PythonModel.\nTypes: predict() accepts a single parameter of type ResponsesAgentRequest and returns a ResponsesAgentResponse.\nTranslation: before sending messages to the model, it calls self.prep_msgs_for_cc_llm() — a helper function that quietly handles a lot of complexity.\nIn order to understand these differences and fully understand the code, we have to start at the request and response structures in place.\nRequest and Response Structure\nHere\u0026rsquo;s a simplified example from the MLflow Responses Agent docs.\n# Example Request schema { \u0026#34;input\u0026#34;: [ { \u0026#34;role\u0026#34;: \u0026#34;user\u0026#34;, \u0026#34;content\u0026#34;: \u0026#34;What is the weather like in Boston today?\u0026#34;, } ], \u0026#34;tools\u0026#34;: [ { \u0026#34;type\u0026#34;: \u0026#34;function\u0026#34;, \u0026#34;name\u0026#34;: \u0026#34;get_current_weather\u0026#34;, \u0026#34;parameters\u0026#34;: { \u0026#34;type\u0026#34;: \u0026#34;object\u0026#34;, \u0026#34;properties\u0026#34;: {\u0026#34;location\u0026#34;: {\u0026#34;type\u0026#34;: \u0026#34;string\u0026#34;}}, \u0026#34;required\u0026#34;: [\u0026#34;location\u0026#34;, \u0026#34;unit\u0026#34;], }, } ], } # Example Response schema { \u0026#34;output\u0026#34;: [ { \u0026#34;type\u0026#34;: \u0026#34;message\u0026#34;, \u0026#34;id\u0026#34;: \u0026#34;some-id\u0026#34;, \u0026#34;status\u0026#34;: \u0026#34;completed\u0026#34;, \u0026#34;role\u0026#34;: \u0026#34;assistant\u0026#34;, \u0026#34;content\u0026#34;: [ { \u0026#34;type\u0026#34;: \u0026#34;output_text\u0026#34;, \u0026#34;text\u0026#34;: \u0026#34;rainy\u0026#34;, } ], } ], } These schemas define how ResponsesAgentRequest and ResponsesAgentResponse are structured. Both can include additional parameters (like temperature or max_output_tokens), so it’s worth checking the full API reference for details.\nThe Role of prep_msgs_for_cc_llm()\nOpenAI recently introduced a new Responses API to improve upon their existing Chat Completions API. Databricks’ ResponsesAgent class and its request/response types are built to align with this newer API. However, the two APIs expect slightly different input formats.\nChat Completions API: expects a list of messages. Responses API: accepts a single string or a structured schema. As a result, a request formatted for the Responses API won’t necessarily work with the Chat Completions API. That’s where the prep_msgs_for_cc_llm() (short for \u0026ldquo;prepare messages for chat completion LLM\u0026rdquo;) comes in. It automatically converts input from the Responses format to the Chat Completions format. Fortunately, you don’t have to define it yourself; it’s inherited from the ResponsesAgent base class.\nWhy Not Use the Responses API Directly?\nIt’s a fair question: if our input already matches the Responses schema, why not call the Responses API itself? Something like this should, in theory, work:\nmessages = request.input response = client.responses.create( model=self.model, messages=messages, ) In theory, yes — but in practice, not yet within Databricks. The WorkspaceClient from the Databricks SDK provides a client that can access registered models inside your workspace, regardless of where they’re hosted. It’s convenient because you don’t need to configure environment variables for authentication.\nMy guess is that this SDK client hasn’t been fully updated to support the new Responses API. As a result, calling client.responses.create() currently raises an error, even with simple requests. This theory is further supported by the official Databricks notebooks: all of them use the ResponsesAgent class (which matches the Responses API schema) but still call the Chat Completions API using the prep_msgs_for_cc_llm function behind the scenes.\nA Note on Alternative Clients\nYou can call the Responses API in Databricks using the standard OpenAI client instead of the SDK:\nfrom openai import OpenAI client = OpenAI() This approach works for external models that support the Responses API (though some older models don’t). However, it requires manual environment variable setup for authentication and access to an external model. For this guide, I chose to stay within Databricks’ built-in foundation models to keep things simpler.\nWrapping Up predict()\nOnce the messages are translated, the model call proceeds as usual. The last step is to return a response object that conforms to the ResponsesAgentResponse schema:\nreturn ResponsesAgentResponse( output=[ self.create_text_output_item( text=response.choices[0].message.content, id=\u0026#34;msg_1\u0026#34; ) ], ) This ensures the output follows the expected Responses schema, even though the underlying model call still uses the Chat Completions API. The create_text_output_item() helper builds a properly structured entry, one of several output types available. You can explore the full list in the ResponsesAgent documentation.\nDon’t worry about losing response details here. Even though we return only the generated text, MLflow’s tracing automatically records the full request, response, and metadata — giving you complete visibility into each call.\nWhat About Streaming?\nStreaming worked much more smoothly with ResponsesAgent than it did with the custom Python model.\nHere’s what’s happening in the code:\nThe call to the model includes the stream=True, which signals that we want token-by-token output. The response arrives in chunks. The code accumulates these chunks into a single message. As new chunks arrive, we yield incremental ResponsesAgentStreamEvent objects, letting the user see updates in real time. Finally, we yield a “done” event to signal completion. This design allows your application to display streaming responses without blocking — and since it’s built into the ResponsesAgent framework, the setup is minimal.\nLogging and Deployment\nThere are three additional notebook cells to complete the setup:\ndbutils.library.restartPython() # COMMAND ---------- import mlflow from mlflow.types.responses import ResponsesAgentRequest from mlflow.models.resources import DatabricksServingEndpoint import model UC_LOCATION = f\u0026#34;workspace.default.mn-ai-agent-demo\u0026#34; example = [ {\u0026#34;role\u0026#34;: \u0026#34;system\u0026#34;, \u0026#34;content\u0026#34;: \u0026#34;You are a helpful assistant.\u0026#34;}, {\u0026#34;role\u0026#34;: \u0026#34;user\u0026#34;, \u0026#34;content\u0026#34;: \u0026#34;What is the fibonacci sequence\u0026#34;}, ] with mlflow.start_run(): logged_agent_info = mlflow.pyfunc.log_model( name=\u0026#34;mn-ai-agent-demo\u0026#34;, python_model=\u0026#34;model.py\u0026#34;, input_example=ResponsesAgentRequest(input=example), registered_model_name=UC_LOCATION, pip_requirements=[\u0026#34;databricks-openai\u0026#34;], resources=[DatabricksServingEndpoint(endpoint_name=\u0026#34;databricks-meta-llama-3-1-8b-instruct\u0026#34;)], ) # COMMAND ---------- from databricks import agents agents.deploy( UC_LOCATION, model_version=logged_agent_info.registered_model_version, ) The first cell restarts the Python library (just as we did for the custom Python model) to ensure the environment picks up any new dependencies. The second cell logs the model, a familiar step from earlier attempts, and the third cell deploys the agent directly from code. With the ResponsesAgent, there’s no need to open the Databricks Serving UI manually. If the model is already deployed, the same command simply updates it in place. It\u0026rsquo;s a small but welcome touch that makes iteration noticeably faster.\nAgent Demo\nUp to this point, most of the examples have focused on querying the predict() function in the custom Python model. What I haven’t shown yet is how to query and test the ResponsesAgent model.\nThe code isn’t substantially different from the Python model, so there’s no need to go through it line by line again. However, it’s worth demonstrating that the ResponsesAgent performs just as well — and, importantly, handles streaming far more smoothly.\nIn the short video below, I’ll walk through the full workflow from start to finish. You’ll see the model registration and deployment steps, followed by the input example used during model validation, which this time is fully traced (unlike in the Python model, where validation traces weren’t captured). Finally, I’ll invoke the model directly from a notebook, calling both the predict() and predict_stream() functions. You’ll see the associated traces appear in the notebook output, and then I’ll navigate to the Experiments page to confirm they were logged correctly. In the Experiments page, you\u0026rsquo;ll see three traces in total - one for the validation example, another for the predict() call, and the third for the predict_stream() call.\nI won’t demonstrate the REST endpoint here — there’s no meaningful difference from the Python model example. The main thing to pay attention to is the predict_stream() function, which now works seamlessly with the ResponsesAgent where it previously failed in the custom Python model.\nYour browser does not support the video tag. Note: In the video, some cell outputs were intentionally hidden to conceal the URL where the agent is being deployed. You’ll have to take my word and the green check marks that everything worked as expected. I’d love to say I figured out how to properly redact those values in the video, but I’m not quite that tech-savvy (yet). Maybe that’ll be the topic of a future blog post.\nConclusion # It was a long journey to arrive at the Responses Agent approach, but hopefully one that made the reasoning clear.\nIf you’ve followed along from the beginning, you’ve seen how a newcomer might start with foundation models, experiment with custom Python models, and eventually discover that Responses Agents offer the most reliable, traceable path forward.\nIf you take away just a few things, let them be these:\nYou now understand how Mosaic AI Gateway, model serving, Python models, and Responses Agents fit together.\nAnd if you’re building something similar, you can confidently start with Responses Agents, knowing the alternatives have been explored and tested.\nThanks for reading, and for sticking with such a deep-dive post. My goal was to make this guide as thorough as possible, answering the same questions I had when I first started.\nIf you found this helpful, stay tuned for more articles on data engineering, AI, and Databricks — there’s plenty more to explore.\n","date":"5 October 2025","externalUrl":null,"permalink":"/posts/ai-gateway/","section":"Posts","summary":"Step-by-step guide to enabling MLflow tracing with Databricks Mosaic AI Gateway. Details the recommended ResponsesAgent approach, examines alternative methods (foundation/external endpoints and custom Python models), and highlights the pitfalls that make the agent path preferable.","title":"Tracing with Databricks Mosaic AI Gateway: A Practical Guide","type":"posts"},{"content":"","date":"21 September 2025","externalUrl":null,"permalink":"/tags/data-quality/","section":"Tags","summary":"","title":"Data Quality","type":"tags"},{"content":"Disclaimer: This presentation was originally created when the technology was called Delta Live Tables (DLT). Databricks has since rebranded it as Lakeflow Declarative Pipelines. While the name has changed, many of the strategies and techniques for applying data quality rules remain the same. For the latest documentation, see the Databricks Lakeflow Declarative Pipelines.\nEnsuring high data quality is critical for analytics, decision-making, and building reliable AI/ML models. Without it, organizations risk costly errors, unreliable insights, and ineffective models.\nIn this talk, I share a practical guide to implementing data quality expectations within Databricks’ Delta Live Tables (DLT). Think of expectations as “unit tests for data” — rules that define what your data should look like. By applying them, you can:\nMeasure the proportion of data that meets quality standards Quarantine or block bad data before it flows downstream Detect bugs in code that cause quality issues Lay the foundation for data quality monitoring and reporting The session walks through a four-step process for writing expectations: profiling your data, collaborating with domain experts, documenting rules, and translating them into SQL. I also share lessons learned, tips for managing quarantines, and strategies to balance strict vs. loose expectations.\n📺 Watch the full presentation on YouTube https://www.youtube.com/watch?v=Uk3kN97NgPk\u0026t=2s\n📑 Download the slides and demo code on GitHub https://github.com/mnorberg-dev/data-expectations\nWhether you’re a developer working hands-on with DLT pipelines or simply looking to understand how to raise the bar on your organization’s data quality, this talk will give you actionable tools to get started.\n","date":"21 September 2025","externalUrl":null,"permalink":"/posts/data-quality-expectations/","section":"Posts","summary":"Practical walkthrough for implementing data quality expectations in Databricks Delta Live Tables (Lakeflow Declarative Pipelines). Covers a four-step process to profile data, formalize and translate rules, quarantine noncompliant records, and balance strict versus permissive checks, with lessons learned and links to the talk, slides, and demo code.","title":"Databricks Data Quality Expectations Guide","type":"posts"},{"content":"Databricks is quickly becoming one of the most popular data lakehouse platforms out there. Its popularity is growing fast, and for many developers, a significant portion of the job happens directly in the browser inside their Databricks workspace.\nYes, I know—there are popular extensions that let you develop code outside of Databricks. Many folks swear by the VS Code extension because they prefer working in their favorite editor. But at the end of the day, you’re still going to spend time in the browser making sure your code runs as expected.\n👉 If you’d rather skip ahead and get straight to the solution, I’ve published the script (with setup instructions) in this GitHub repo.\nThe Problem: Too Many Tabs, Too Few Clues # Many organizations using Databricks have multiple workspaces to represent different environments—think dev, qa, and prod. Some even have additional splits like staging or sandbox.\nIf you’re working across all of them, here’s the problem: your browser tabs all look the same.\nNote: URLs in images are redacted, but in your environment, the redacted portions will appear as GUIDs.\nInside the workspace, the situation isn’t much better. It\u0026rsquo;s not easy to determine which environment you are in. Technically, each environment does identify itself in two places though:\nThe domain name. Every environment has a URL like adb-\u0026lt;long string of numbers\u0026gt;. But let’s be honest—nobody remembers arbitrary GUIDs. It’s the same reason DNS exists: humans prefer names like google.com instead of memorizing IP addresses. A small piece of text in the top-right corner. Sure, it’s there, but after seeing it hundreds of times, your brain starts ignoring it. It’s easy to miss, and if you’re juggling multiple tabs, you can’t even see that text without clicking into each one. The end result? Confusion, context-switching, and the very real risk of running a query in prod that you meant for dev.\nBut Wait, Isn’t There a Color Trick? # Some people solve this by assigning different themes—say, dark mode for prod and light mode for dev. Clever idea, but it falls apart quickly:\nThere are only two color schemes available and many organizations have three workspaces meaning one environment will inevitably match another. If two or more environments share a metastore, changing one changes the other. And let’s be honest: some developers simply refuse to use a theme they don’t like. So while the color trick can work in a pinch, it’s not a real solution.\nTo activate dark mode, navigate to Settings → User Preferences → “Prefer Dark.”\nIf you don’t enable this setting, your environment theme will default to match the light theme shown in the images above.\nThe Solution: Tampermonkey to the Rescue # Enter Tampermonkey, a browser extension that lets you run custom JavaScript whenever you visit a matching URL.\nWith a short script, you can automatically label your Databricks browser tabs with an emoji + text identifier for each environment. Suddenly, dev, qa, and prod are crystal clear.\nIt’s simple, lightweight, and makes a huge difference:\nFaster navigation between environments Reduced risk of editing the wrong workspace A quality-of-life boost you’ll wonder how you lived without And best of all: little to no performance impact. I’ve been running this script for weeks, and my browser hasn’t skipped a beat. Below, I\u0026rsquo;ve provided an image illustrating what your dev environment will look like after setting up the Tamper Monkey script for yourself.\nHow to Set It Up # Getting your Databricks tabs labeled automatically is quick and easy. You’ll need to:\nInstall Tampermonkey Create a new script Paste in the code Configure your environment domains Save and enable the script Enable user scripts in your browser (Chrome only) Verify it’s working Note: These instructions are written for Google Chrome, but the steps can be adapted to other browsers that support Tampermonkey.\nFollow the steps below.\n1. Install Tampermonkey # Download Tampermonkey for your browser from tampermonkey.net. Restart your browser after installation. 2. Create a New Script # Open the Tampermonkey extension → Dashboard → “Create a new script.” Delete the boilerplate code so the editor is blank. 3. Add the Script # Copy and paste the following code into the new script:\n// ==UserScript== // @name Databricks Tab Emoji Label // @namespace http://tampermonkey.net/ // @version 1.5 // @description Add emoji and label to Databricks tab title based on environment // @author Matthew Norberg // @match https://\u0026lt;your-domain-here\u0026gt;.azuredatabricks.net/* // @grant none // ==/UserScript== (function() { \u0026#39;use strict\u0026#39;; function updateTitle() { const url = window.location.href; let label = \u0026#39;\u0026#39;; let emoji = \u0026#39;\u0026#39;; // Define your environment domains here let devDomain = \u0026#39;\u0026#39;; let qaDomain = \u0026#39;\u0026#39;; let prodDomain = \u0026#39;\u0026#39;; if (url.includes(devDomain)) { label = \u0026#39;DEV\u0026#39;; emoji = \u0026#39;🟢\u0026#39;; } else if (url.includes(qaDomain)) { label = \u0026#39;QA\u0026#39;; emoji = \u0026#39;🟡\u0026#39;; } else if (url.includes(prodDomain)) { label = \u0026#39;PROD\u0026#39;; emoji = \u0026#39;🔴\u0026#39;; } else { label = \u0026#39;OTHER\u0026#39;; emoji = \u0026#39;⚪\u0026#39;; } if (!document.title.startsWith(`[${emoji} ${label}]`)) { document.title = `[${emoji} ${label}] ${document.title}`; } } // Initial run setTimeout(updateTitle, 3000); // Monitor for URL changes in SPA let lastUrl = location.href; new MutationObserver(() =\u0026gt; { const currentUrl = location.href; if (currentUrl !== lastUrl) { lastUrl = currentUrl; setTimeout(updateTitle, 2000); } }).observe(document, {subtree: true, childList: true}); })(); 4. Configure Environment Domains # Near the top of the script, replace the placeholders with your actual environment domains:\nlet devDomain = \u0026#39;your-dev-domain\u0026#39;; let qaDomain = \u0026#39;your-qa-domain\u0026#39;; let prodDomain = \u0026#39;your-prod-domain\u0026#39;; 5. Save and Enable the Script # Save your changes (File → Save).\nMake sure the script toggle in Tampermonkey is turned on (green).\n6. Enable User Scripts in Chrome # Go to chrome://extensions → find Tampermonkey → click Details.\nMake sure Allow user scripts is enabled.\nNote: Other browsers may have different settings for allowing user scripts—adapt accordingly.\n7. Verify installation # Open your Databricks environment in a new tab.\nAfter a few seconds, your tab title should display the correct emoji + environment label.\nTroubleshooting # Tab title doesn’t update right away\nThe script intentionally waits 3 seconds after page load before applying changes.\nWithout this delay, the title gets updated but is overwritten during the web page loading process.\nIn testing, 3 seconds worked well, but you can adjust this by editing the setTimeout call in the code:\nsetTimeout(updateTitle, 3000); Still not working?\nDouble-check that:\nThe script is enabled in Tampermonkey. Your domains are correctly set in both the variables and the @match lines. “Allow user scripts” is enabled in your browser’s extension settings. Final Thoughts # It’s a small tweak, but it solves a surprisingly big problem. If you’re juggling multiple Databricks environments, this little script will save you from confusion—and maybe even prevent a mistake or two.\nSometimes the best tools aren’t the big, complicated ones. They’re the tiny hacks that make your day smoother.\n👉 You can grab the full script and step-by-step setup instructions in the databricks-tools-repo here.\n","date":"16 September 2025","externalUrl":null,"permalink":"/posts/databricks-tab-label-tool/","section":"Posts","summary":"Explains how to clearly distinguish Databricks environments by adding environment-specific labels to browser tabs with a Tampermonkey userscript. Outlines the risk of identical tab titles, provides the script with domain placeholders, and gives step-by-step setup and troubleshooting guidance for dev/QA/prod.","title":"The Databricks Tool You Didn't Know You Needed","type":"posts"},{"content":"Hi! I’m Matthew Norberg, a Data Engineer with a passion for turning complex data challenges into clean, maintainable, and high-performing solutions. Over the past several years, I’ve had the opportunity to work with Databricks, Azure, and a variety of modern data tools, building platforms, pipelines, and systems that help organizations make better use of their data.\nMy Journey in Data Engineering # I’ve always had a passion for programming and problem-solving, which first led me down the path of software engineering during my undergraduate studies. In graduate school, I started working more with data, earning a degree in Computer Science with a concentration in Data Science. This combination of software engineering and data expertise naturally led me to data engineering—a perfect middle ground between building robust systems and working with meaningful data. My background in CS and data science has prepared me well to tackle the challenges of modern data engineering.\nMy approach to data engineering can be summed up in one simple philosophy: first make it work, then make it work better, a mindset I learned from my mentor, Rich Dudley. I thrive in environments where I can take a messy or incomplete problem and turn it into something reliable, scalable, and elegant.\nMy Projects \u0026amp; Work Highlights # Platform Engineering \u0026amp; Architecture: Built and maintained Databricks environments in Azure using Terraform and Terragrunt. Configured Unity Catalogs, external locations, and volumes, and designed Dev, QA, and Prod environments. Contributed to cost optimization efforts in the Databricks Well-Architected Framework.\nData Pipelines \u0026amp; Medallion Architecture: Designed fault-tolerant pipelines that move data through the medallion architecture layers—bronze, silver, and gold—improving data quality at each step and preparing it for use by downstream teams. Pipelines were engineered to minimize manual steps, making workflows easy to deploy, manage, and maintain.\nSafeguarded Sensitive Procedures: Wrote code to handle sensitive processes mindfully, reducing the risk of mistakes. For example, I developed a WordPress ingestion client in Python with defensive safeguards to ensure proper usage.\nData Quality \u0026amp; Governance: Developed data quality checks and DLT pipelines using Data Expectations. Built dashboards to monitor data health, integrated governance tools like Atlan, and was selected as a Databricks Data \u0026amp; AI Summit speaker candidate, submitting a presentation on Data Expectations in Databricks.\nLarge Data Processing \u0026amp; Analytics: Tuned Spark pipelines and computed complex KPIs. Reverse-engineered Tally Street metrics—a tool used by accountants to extract KPIs from general ledgers—using Python and SQL to improve speed, accuracy, and accessibility.\nDemocratizing Data \u0026amp; AI: Built AI/BI Genies in Databricks, AI-driven tools that allow colleagues to ask questions about company data and generate dashboards, making AI and analytics accessible to non-technical users.\nSalesforce Knowledge Base Ingestion: Ingested Salesforce Knowledge Base articles—help desk pages—from Databricks using the Salesforce connector. Added this content to a vector index to power a search application, making it easier for customers to find support and internal teams to access knowledge.\nWordPress Website Ingestion: Created a Python connector from scratch to ingest the entire company WordPress site into Databricks. Implemented retry and backoff logic to handle API failures or network issues. Designed workflows to capture the full website on day 1, then only incremental changes on subsequent days—ensuring that temporary failures don’t require manual reruns and the system automatically catches up the next day.\nCI/CD \u0026amp; Automation: Configured service principals and created Azure DevOps CI/CD pipelines using Infrastructure as Code practices for reliable, repeatable deployments.\nDocumentation \u0026amp; Knowledge Sharing: Created Confluence documentation for all key processes, designed for future ingestion into a company-wide vector database to make knowledge accessible to other engineers.\nOutside the Data World # When I’m not working with data, I like to stay active and explore the outdoors. Golfing and hiking are two of my favorite ways to recharge and stay focused, and I also love rock climbing, which keeps me on my toes—literally and figuratively!\nI’m also a huge pizza and coffee nerd:\nI’ve learned to make pizza dough from scratch, and true to my data-driven nature, I’ve kept a tally in 2025 of every pizza I’ve made. Nearly every morning, I make a pour-over coffee using my V60, my favorite coffee brewer. My go-to beans are typically light to medium roasts from George Howell Coffee. I also genuinely enjoy tinkering with personal coding projects, experimenting with new tools, and learning ways to make complex systems simpler and more efficient. And just like in my professional life, I love solving puzzles—whether it’s in code, a tricky climbing route, or perfecting a pizza crust!\nI’m always excited to connect with fellow data enthusiasts, share what I’ve learned, and continue growing as a data engineer. Thanks for stopping by my blog!\n","externalUrl":null,"permalink":"/about/","section":"Matthew Norberg's Data Engineering Blog","summary":"","title":"About Me","type":"page"},{"content":"","externalUrl":null,"permalink":"/authors/","section":"Authors","summary":"","title":"Authors","type":"authors"},{"content":"","externalUrl":null,"permalink":"/categories/","section":"Categories","summary":"","title":"Categories","type":"categories"},{"content":"","externalUrl":null,"permalink":"/series/","section":"Series","summary":"","title":"Series","type":"series"}]