across industries. Traditional engineering domains are not any exception.
Previously two years, I’ve been constructing LLM-powered tools with engineering domain experts. Those are process engineers, reliability engineers, cybersecurity analysts, etc., who spend most of their day in logs, specs, schematics, and reports, and doing tasks similar to troubleshooting, failure mode evaluation, test planning, compliance checks, etc.
The promise is compelling: due to its extensive pre-trained knowledge, the LLMs can, in theory, reason like domain experts and speed up the tedious, pattern-matching parts of engineering work, and liberate experts for higher-order decisions.
The practice, nonetheless, is messier. “Just add a chatbox” rarely translates into useful engineering tools. There continues to be quite a big gap between a formidable demo and a system that engineers actually trust and use.
It has every part to do with the way to frame the issue, the way to structure the workflow, and the way to integrate it into the engineer’s real environment.
On this post, I’d wish to share 10 lessons I learned from my past projects. They are only my collection of “field notes” reasonably than a comprehensive checklist. But if you happen to also plan to construct or are currently constructing LLM applications for domain experts, I hope those lessons could enable you to avoid just a few painful dead ends.
I organize the teachings into three phases, which exactly align with the stages of a typical LLM project:
- Before you begin: frame the suitable problem and set the suitable expectation.
- Throughout the project: design clear workflows and implement structure in every single place.
- After you’ve gotten built: integrate where engineers work and evaluate with real cases.
With that in mind, let’s start.
Phase 1: Before You Start
What you do before even writing a single line of code largely shapes whether an LLM project will succeed or fail.
Which means if you happen to are chasing the incorrect problem or failing to set the suitable expectation upfront, your application will struggle to achieve traction later, regardless of how technically sound you make it.
In the next, I’d wish to share some lessons on laying the suitable foundation.
Lesson 1: Not every problem can or must be addressed by LLMs
Once I have a look at a brand new use case from engineers, I’d at all times try very hard to challenge my “LLM-first” reflex and really ask myself: can I solve the issue without using LLMs?
For the core reasoning logic, that’s, the decision-making bottleneck you need to automate, there are frequently a minimum of three classes of methods you may consider:
- Rule-based and analytical methods
- Data-driven ML models
- LLMs
Rule-based and analytical methods are low cost, transparent, and straightforward to check. Nonetheless, they could be inflexible and only possess limited power within the messy reality.
Classic ML models, even a straightforward regression or classification, can often provide you with fast, reliable, and simply scalable decisions. Nonetheless, they require historical data (and very often, also the labels) to learn the patterns.
LLMs, then again, shine if the core challenge is about understanding, synthesizing, or generating language across messy artifacts. Think skimming through 50 incident reports to surface likely relevant ones, or turning free-text logs into labeled, structured events. But LLMs are expensive, slow, and typically don’t behave deterministically as you may want.
Before deciding to make use of an LLM for a given problem, ask yourself:
- Could 80% of the issue be solved with a rule engine, an analytical model, or a classic model? If yes, simply start there. You possibly can at all times layer an LLM on top later if needed.
- Does this task require precise, reproducible numerical results? In that case, then keep the computation in analytical code or ML models, and use LLMs just for explanation or contextualization.
- Will there be no human within the loop to review and approve the output? If that’s the case, then an LLM may not be a great selection because it rarely provides strong guarantees.
- At our expected speed and volume, would LLM calls be too expensive or too slow? If you must process 1000’s of log lines or alerts per minute, counting on LLM alone will quickly make you hit a wall on each cost and latency.
In case your answers are mostly “no”, you’ve probably found a great candidate to explore with LLMs.
Lesson 2: Set the suitable mindset from day one
Once I’m convinced that an LLM-based solution is acceptable for a particular use case, the subsequent thing I’d do is to align on the suitable mindset with the domain experts.
One thing I find extremely crucial is the positioning of the tool. A framing I often adopt that works thoroughly in practice is that this: the goal of our LLM tool is for augmentation, not automation. The LLM only helps you (i.e., domain experts) analyze faster, triage faster, and explore more, but you remain the decision-maker.
That difference matters rather a lot.
While you position the LLM tool as an augmentation, engineers tend to interact it with enthusiasm, as they see it as something that might make their work faster and fewer tedious.
Then again, in the event that they sense that the brand new tool is something that will threaten their role or autonomy, they may distance themselves from the project and provide you with very limited support.
From a developer’s perspective (which is you and me), setting this “amplify as an alternative of replacing” mindset also reduces anxiety. Why? Since it makes it much easier to discuss mistakes! When the LLM gets something incorrect (and it can), the conversation won’t simply be “your AI failed.”, but it surely’s more about “the suggestion wasn’t quite right, but it surely’s still insightful and offers me some ideas.” That’s a really different dynamic.
Next time, when you’re constructing LLM Apps for domain experts, try to emphasise:
- LLMs are, at best, junior assistants. They’re fast, work across the clock, but not at all times right.
- Experts are the reviewers and supreme decision-makers. You’re experienced, cautious, and accountable.
Once this mindset is in place, you’ll see engineers start to guage your solution through the lens of “Does this help me?” reasonably than “Can this replace me?” That matters rather a lot in constructing trust and enhancing adoption.
Lesson 3: Co-design with experts and define what “higher” means
Once we’ve agreed that LLMs are appropriate for the duty at hand and the goal is augmentation not automation, the subsequent critical point I’ll attempt to work out is:
To get a extremely good understanding on that, you must bring the domain experts into the design loop as early as possible.
Concretely, you must spend time to take a seat down with the domain experts, walk through how they solve the issue today, take notes on which tools they use, and which docs/specs they confer with. Remember to ask them to indicate where the pain point really is, and higher understand what’s OK to be “approximate” and what varieties of mistakes are annoying or unacceptable.
A concrete end result of those conversations with domain experts is a shared definition of “higher” in their very own language. These are the metrics you’re optimizing for, which might be the quantity of triage time being saved, the variety of false leads being reduced, or the variety of manual steps being skipped.
Once the metric(s) are defined, you’d mechanically have a practical baseline (i.e., whatever it takes by the present manual process) to benchmark your solution later.
Besides the technical effects, I’d say the mental effects are only as essential: by involving experts early, you’re showing to them that you just’re genuinely attempting to learn the way their world works. That alone goes a great distance in earning trust.
Phase 2: During The Project
After organising the stage, you’re now able to construct. Exciting stuff!
In my experience, there are a few essential decisions you must make to make sure your labor actually earns trust and gets adopted. Let’s discuss those decision points.
Lesson 4: It’s Co-pilot, not Auto-pilot
A temptation I see rather a lot (also in myself) is the will to construct something “fully autonomous”. As an information scientist, who can really resist constructing an AI system that provides the user the ultimate answer with only one button push?
Well, the fact is less flashy but far simpler. In practice, this “autopilot” mindset rarely works well with domain experts, because it fundamentally goes against the proven fact that engineers are used to systems where they understand the logic and the failure modes.
In case your LLM app simply does every part within the background and only presents a end result, two things often occur:
- Engineers don’t trust the outcomes because they will’t see the way it got there.
- They can’t correct it, even in the event that they see something obviously off.
Subsequently, as an alternative of defaulting to an “autopilot” mode, I prefer to intentionally design the system with multiple control points where experts can influence the LLMs’ behavior. For instance, as an alternative of LLM auto-classifying all 500 alarms and creating tickets, we will design the system to first group alarms into 5 candidate incident threads, pause, show the expert the grouping rationale and key log lines for every thread. Then, experts could merge or split groups. After experts approve the grouping, the LLM can proceed to generate draft tickets.
Yes, from a UI perspective, this adds a little bit of work, as you’ve gotten to implement human-input mechanisms, expose intermediate reasoning traces and results clearly, and so forth. However the payoff is real: your experts will actually trust and use your system since it gives them the sense that they’re on top of things.
Lesson 5: Give attention to workflow, roles, and data flow before picking a framework
Once we get into the implementation phase, a typical query many developers (including myself up to now) are likely to ask first is:
This instinct is completely comprehensible. In any case, there are such a lot of shiny frameworks on the market, and it does feel like selecting the “right” one is the primary big decision. But for prototyping with engineering domain experts, I’d argue that this is generally not the suitable place to start out.
In my very own experience, for the primary version, you may go a great distance with the great old from openai import OpenAI or from google import genai (or every other LLM providers you prefer).
Why? Because at this stage, probably the most pressing query will not be which framework to construct upon, but:
And you must confirm it as quickly as possible.
To try this, I’d wish to concentrate on three pillars as an alternative of frameworks:
- Pipeline design: How will we decompose the issue into clear steps?
- Role design: How should we instruct the LLMs at each step?
- Data flow & context design: What goes out and in of every step?
For those who treat each LLM call as a pure function, like this:
inputs → LLM reasoning → output
Then, you may wire these “functions” along with just normal control flow, e.g., if/else conditions, for/while loops, retries, etc., that are already natural to you as a developer.
This is applicable to tool calling, too. If the LLM decides it must call a tool, it could simply output the function name and the associated parameters, and your regular code can execute the actual function and feed the result back into the subsequent LLM call.
You actually don’t need frameworks just to specific the pipeline.
In fact, I’m not saying that you must avoid using frameworks. They’re quite helpful in production as they supply observability, concurrency, state management, etc., out of the box. But for the early stage, I believe it’s a great strategy to only keep things easy, so that you would be able to iterate faster with domain experts.
Once you’ve gotten verified your key assumptions along with your experts, it’s not going to be difficult to migrate your pipeline/role/data design to a more production-ready framework.
In my view, that is lean development in motion.
Lesson 6: Try workflows before jumping to agents
Recently, there was quite a number of discussion around workflows vs. agents. Every major player in the sphere seems desperate to emphasize that they’re “constructing agents,” as an alternative of just “running predefined workflows.”
As developers, it’s very easy to feel the temptation:
““
No.
On paper, AI agents sound super attractive. But in practice, especially in engineering domains, I’d argue that a well-orchestrated workflow with domain-specific logics can already solve a big fraction of the true problems.
And here is the thing: it does so with far less randomness.
Typically, engineers already follow a certain workflow to unravel that specific problem. As a substitute of letting LLM agents “rediscover” that workflow, it’s much better if you happen to translate that “domain knowledge” directly right into a deterministic, staged workflow. This immediately gives you a pair of advantages:
- Workflows are way easier to debug. In case your system starts to behave strangely, you may easily spot which step is causing the difficulty.
- Domain experts can easily understand what you’re constructing, because a workflow maps naturally to their mental model.
- Workflows naturally invite human feedback. They’ll easily be paused, accept recent inputs, after which resume.
- You get far more consistent behavior. The identical input would result in an identical path or end result, and that matters a ton in engineering problem-solving.
Again, I’m not saying that AI agents are useless. There are actually many situations where more flexible, agentic-like behavior is justified. But I’d say at all times start with a transparent, deterministic workflow that explicitly encodes domain knowledge, and validate with experts that it’s actually helpful. You possibly can introduce more agentic behavior if you happen to hit limitations that a straightforward workflow cannot solve.
Yes, it’d sound boring. But your ultimate goal is to unravel the issue in a predictable and explainable way that bring business values, not some fancy agentic demos. It’s good to at all times keep that in mind.
Lesson 7: Structure every part you may – inputs, outputs, and knowledge
A standard perception of LLMs is that they’re good at handling free-form texts. So the natural instinct is: let’s just feed reports and logs in and ask the model to reason, right?
No.
In my experience, especially in engineering domains, that’s leaving a number of performance on the table. In reality, LLMs are likely to behave significantly better while you give them structured input and ask them to supply structured output.
Engineering artifacts often are available in semi-structured form already. As a substitute of dumping entire raw documents into the prompt, I find it very helpful to extract and structure the important thing information first. For instance, for free-text incident reports, we will parse them into the next JSON:
{
"incident_id": "...",
"equipment": "...",
"symptoms": ["..."],
"start_time": "...",
"end_time": "...",
"suspected_causes": ["..."],
"mitigations": ["..."]
}
That structuring step could be done in various ways: we will resort to classic es, or develop small helper scripts. We are able to even employ a separate LLM whose only job is to normalize the free-texts right into a consistent schema.
This fashion, you may give the essential reasoning LLMs a clean view of what happened. And the bonus point is, with this structure in place, you may ask the LLMs to cite specific facts when reaching their conclusion. And that saves you quite a while in debugging.
For those who’re doing RAG, this structured layer can be what you must retrieve over, as an alternative of the raw PDFs or logs. You’d recuperate precision and more reliable citations when retrieving over clean, structured artifacts.
Now, on the output side, structure is largely mandatory if you need to plug the LLM right into a larger workflow. Concretely, this implies as an alternative of asking:
I prefer something like:
{
"likely_causes": [
medium
],
"recommended_next_steps": [
{"description": "...", "priority": 1}
],
"summary": "short free-text summary for the human"
}
Often, that is defined as a Pydantic model and you may leverage the “” feature to explicitly instruct the LLMs to supply output that conforms to it.
I used to see LLMs as “text in, text out”. But now I see it more as “structure in, structure out”, and this is very true in engineering domains where we’d like precision and robustness.
Lesson 8: Don’t ignore analytical AI
I do know we’re constructing LLM-based solutions. But as we learned in Lesson 1, LLMs aren’t the one tool in your toolbox. We even have the “old skool” analytical AI models.
In lots of engineering domains, there may be an extended track record of applying classic analytical AI/ML methods to deal with various features of the issues, e.g., anomaly detection, time-series forecasting, clustering, classification, you name it.
These methods are still incredibly worthwhile, and in lots of cases, they must be doing the heavy lifting as an alternative of being thrown away.
To effectively solve the issue at hand, repeatedly it could be value considering a hybrid approach of analytical AI + GenAI: analytical ML to handle the heavy-lifting of the pattern matching and detection, and LLMs operate on top to reason, explain, and recommend next steps.
For instance, say you’ve gotten 1000’s of incident events per week. You may start with using classical clustering algorithms to group similar events into patterns, possibly also compute some aggregate stats for every cluster. Then, the workflow can feed those cluster analytical results into an LLM and ask it to label each pattern, describe what it means, and suggest what to envision first. Afterward, engineers review and refine the labels.
So why does this matter? Because analytical methods provide you with the speed, reliability, and precision on structured data. They’re deterministic, they scale to tens of millions of information points, they usually don’t hallucinate. LLMs, then again, excels well at synthesis, context, and communication. You must use each for what it’s best at.
Phase 3: After You Have Built
You’ve built a system that works technically. Now comes the toughest part: getting it adopted. Regardless of how smart your implementation is, a tool that’s placed on a shelf is a tool that brings zero value.
On this section, I’d wish to share two final lessons on integration and evaluation. You ought to be certain that your system lands in the true world and earns trust through evidence, right?
Lesson 9: Integrate where engineers actually work
A separate UI, similar to a straightforward web app or a notebook, works perfectly high-quality for exploration and getting first-hand feedback. But for real adoption, you must think beyond what your app does and concentrate on where your app shows up.
Engineers have already got a collection of tools they depend on each day. Now, in case your LLM tool presents itself as “yet one more web app with a login and a chat box”, you may already see that it can struggle to turn into a part of the engineers’ routine. People will try it a couple of times, then when things get busy, they only fall back to whatever they’re used to.
So, the way to address this issue?
I’d ask myself this query at this point:
In practice, what does this imply?
Probably the most powerful integration is commonly UI-level embedding. That principally means you embed LLM capabilities directly into the tools engineers already use. For instance, in a regular log viewer, besides the standard dashboard plots, you may add a side panel with buttons like “summarize the chosen events” or “suggest next diagnostic steps”. This empowers the engineers with the LLM intelligence without interrupting their usual workflow.
One caveat value mentioning, though: UI-level embedding often requires buy-in from the team that owns that tool. If possible, start constructing those relationships early.
Then, as an alternative of a generic chat window, I’d concentrate on buttons with concrete verbs that match how engineers take into consideration their tasks, be it summarize, group, explain, or compare. A chat interface (or something similar) can still exist if engineers have follow-up questions, need clarifications, or want to input free-form feedback after the LLM produces its initial output. But the first interaction here must be task-specific actions, not open-ended conversation.
Also essential: you must make the context of LLMs dynamic and adaptive. If the system already knows which incident or time window experts are taking a look at, pass that context on to the LLM calls. Don’t make them copy-paste IDs, logs, or descriptions into yet one more UI.
If this integration is completed well, the barrier to trying it (and ultimately adopting it) would turn into much lower. And for you as a developer, it’s much easier to get richer and more honest feedback because it’s tested under real conditions.
Lesson 10: Evaluation, evaluation, evaluation
Once you’ve gotten shipped the primary version, you may think your work is completed. Well, the reality is, in practice, that is strictly the purpose where the true work starts.
It’s the start of the evaluation.
There are two things I need to debate here:
- Make the system show its work in a way that engineers can inspect.
- Sit down with experts and walk through real cases together.
Let’s discuss them in turn.
First, make the system show its work. Once I say “show its work”, I don’t just mean a final answer. I need the system to show, at an inexpensive level of detail, three concrete things: what it checked out, what steps it took, and the way confident LLMs are.
- What it checked out: those are essentially the evidence LLMs use. It’s a great practice to at all times instruct LLMs to cite specific evidence after they produce a conclusion or suggestion. That evidence could be the particular log lines, the particular incident IDs, or spec sections that support the claim. Remember in Lesson 7, we talked about structured input? That is useful for LLM citation management and verification.
- What steps did it take: those confer with the reasoning trace produced by LLMs. Here, I’d expose the output produced in key intermediate steps of the pipeline. For those who’re adopting a multi-step workflow (Lessons 5 & 6), you’ll have already got these steps as separate LLM calls or functions. And if you happen to’re enforcing structured output (Lesson 7), surfacing them on UI becomes easy.
- How confident LLMs are: finally, I almost at all times ask the LLM to output a confidence level (low/medium/high), plus a brief rationale on why assigning this confidence level. In practice, what you’ll obtain is something like this: “The LLM said A, based on B and C, with medium confidence due to D and E assumptions.” Engineers are far more comfortable with that sort of statement, and again, this is an important step towards constructing trust.
Now, let’s go to the second point: evaluate with experts using real cases.
My suggestion is, once the system can properly show its work, you must schedule dedicated evaluation sessions with domain experts.
It’s like doing user testing.
A typical session could appear to be this: you and the expert pick a set of real cases. These is usually a mixture of typical ones, edge cases, and just a few historical cases with known outcomes. You run them through the tool together. Throughout the process, ask the expert to think aloud: What do you expect the tool to do here? Is that this summary accurate? Are these suggested next steps reasonable? Would you agree that the cited evidence actually supports the conclusion? Meanwhile, remember to take detailed notes on things like where the tool clearly saves time, where it still fails, and what essential context is currently missing.
After a few sessions with the experts, you may tie the outcomes back to the “higher” we defined earlier (Lesson 3). This doesn’t need to be a “formal” quantitative evaluation, but trust me, even a handful of concrete before/after comparisons could be eye-opening, and provide you with a solid foundation to maintain iterating your solution.
Conclusion
Now, looking back at those ten lessons, what recurring themes do you see?
Here’s what I see:
First, respect the domain expertise. Start from how domain engineers actually work, genuinely learn their pain points and needs. Position your tool as something that helps them, not something that replaces them. All the time let experts stay on top of things.
Second, engineer the system. Start with easy SDK calls, deterministic workflows, structured inputs/outputs, and blend traditional analytics with the LLM if that is sensible. Remember, LLMs are only one component in a bigger system, not the whole solution.
Third, treat deployment as the start, not the tip. The moment you deliver the primary working version is when you may finally start having meaningful conversations with experts. Walking through real cases together, collecting their feedback, and keeping iterating.
In fact, these lessons are only my current reflections of what seems to work when constructing LLM applications for engineers, they usually are actually not the one technique to go. Still, they’ve served me well, and I hope they will spark some ideas for you, too.
Blissful constructing!
