Why CrewAI’s Manager-Employee Architecture Fails — and The best way to Fix It

-

is one of the vital promising applications of LLMs, and CrewAI has quickly turn out to be a well-liked framework for constructing agent teams. But certainly one of its most vital features—the hierarchical process—simply doesn’t function as documented. In real workflows, the manager doesn’t effectively coordinate agents; as a substitute, CrewAI executes tasks sequentially, resulting in incorrect reasoning, unnecessary tool calls, and intensely high latency. This issue has been highlighted in several online forums with no clear resolution.

In this text, I exhibit CrewAI’s hierarchical process fails, show the evidence from actual Langfuse traces, and supply a reproducible pathway to make the manager-worker pattern work reliably using custom prompting.

Multi-agent Orchestration

Before we get into the main points, allow us to understand what orchestration means in an agentic context. In easy terms, orchestration is managing and coordinating multiple inter-dependent tasks in a workflow. But have’nt workflow management tools (eg; RPA) been available without end to just do that? So what modified with LLMs?

The reply is the power of LLMs to know meaning and intent from natural language instructions, just as people in a team would. While earlier workflow tools were rule-based and rigid, with LLMs functioning as agents, the expectation is that they’ll have the opportunity to know the intent of the user’s query, use reasoning to create a multi-step plan, infer the tools for use, derive their inputs in the right formats, and synthesize all different intermediate leads to a precise response to the user’s query. And the orchestration frameworks are supposed to guide the LLM with appropriate prompts for planning, tool-calling, generating response etc.

Among the many orchestration frameworks, CrewAI, with its natural language based definition of tasks, agents and crews depends probably the most on the LLM’s ability to know language and manage workflows. While not as deterministic as LangGraph (since LLM outputs can’t be fully deterministic), it abstracts away many of the complexity of routing, error handling etc into easy, user-friendly constructs with parameters, which the user can tune for appropriate behavior. So it’s a superb framework for creating prototypes by product teams and even non-developers.

As an instance, let’s take a use-case to work with. And likewise evaluate the response based on the next criteria:

  1. Quality of orchestration
  2. Quality of ultimate response
  3. Explainability
  4. Latency and usage cost

Use Case

Take the case where a team of customer support agents resolve technical or billing tickets. When a ticket comes, a triage agent categorizes the ticket, then assigns to the technical or billing support specialist for resolution. There’s a Customer Support Manager who coordinates the team, delegates tasks and validates quality of response.

Together they will probably be solving queries akin to:

  1. Why is my laptop overheating?
  2. Why was I charged twice last month?
  3. My laptop is overheating and in addition, I used to be charged twice last month?
  4. My invoice amount is wrong after system glitch?

The primary query is only technical, so only the technical support agent needs to be invoked by the manager, the second is Billing only and the third and fourth ones require answers from each technical and billing agents.

Let’s construct this team of CrewAI agents and see how well it really works.

Crew of Customer Support Agents

Hierarchical Process

In response to CrewAI documentation ,“. “ Also, the manager agent may be created in two ways, robotically by CrewAI or explicitly set by the user. Within the latter case, you will have more control over instructions to the manager agent. We’ll try each ways for our use case.

CrewAI Code

Following is the code for the use case. I actually have used gpt-4o because the LLM and Langfuse for observability.
from crewai import Agent, Crew, Process, Task, LLM
from dotenv import load_dotenv
import os
from observe import * # Langfuse trace

load_dotenv()
verbose = False
max_iter = 4

API_VERSION = os.getenv(API_VERSION')
# Create your LLM
llm_a = LLM(
    model="gpt-4o",
    api_version=API_VERSION,
    temperature = 0.2,
    max_tokens = 8000,
)

# Define the manager agent
manager = Agent(
    role="Customer Support Manager",
    goal="Oversee the support team to make sure timely and effective resolution of customer inquiries. Use the tool to categorize the user query first, then resolve the following steps.Syntesize responses from different agents if needed to offer a comprehensive answer to the shopper.",
    backstory=( """
        You don't try to search out a solution to the user ticket {ticket} yourself. 
        You delegate tasks to coworkers based on the next logic:
        Note the category of the ticket first by utilizing the triage agent.
        If the ticket is categorized as 'Each', all the time assign it first to the Technical Support Specialist, then to the Billing Support Specialist, then print the ultimate combined response. Make sure that the ultimate response answers each technical and billing issues raised within the ticket based on the responses from each Technical and Billing Support Specialists.
        ELSE
        If the ticket is categorized as 'Technical', assign it to the Technical Support Specialist, else skip this step.
        Before proceeding further, analyse the ticket category. Whether it is 'Technical', print the ultimate response. Terminate further actions.
        ELSE
        If the ticket is categorized as 'Billing', assign it to the Billing Support Specialist.
        Finally, compile and present the ultimate response to the shopper based on the outputs from the assigned agents.
        """
    ),
    llm = llm_a,
    allow_delegation=True,
    verbose=verbose,
)

# Define the triage agent
triage_agent = Agent(
    role="Query Triage Specialist",
    goal="Categorize the user query into technical or billing related issues. If a question requires each facets, reply with 'Each'.",
    backstory=(
        "You might be a seasoned expert in analysing intent of user query. You answer precisely with one word: 'Technical', 'Billing' or 'Each'."
    ),
    llm = llm_a,
    allow_delegation=False,
    verbose=verbose,
)

# Define the technical support agent
technical_support_agent = Agent(
    role="Technical Support Specialist",
    goal="Resolve technical issues reported by customers promptly and effectively",
    backstory=(
        "You might be a highly expert technical support specialist with a robust background in troubleshooting software and hardware issues. "
        "Your primary responsibility is to help customers in resolving technical problems, ensuring their satisfaction and the graceful operation of their products."
    ),
    llm = llm_a,
    allow_delegation=False,
    verbose=verbose,
)

# Define the billing support agent
billing_support_agent = Agent(
    role="Billing Support Specialist",
    goal="Address customer inquiries related to billing, payments, and account management",
    backstory=(
        "You might be an experienced billing support specialist with expertise in handling customer billing inquiries. "
        "Your fundamental objective is to offer clear and accurate information regarding billing processes, resolve payment issues, and assist with account management to make sure customer satisfaction."
    ),
    llm = llm_a,
    allow_delegation=False,
    verbose=verbose,
)

# Define tasks
categorize_tickets = Task(
    description="Categorize the incoming customer support ticket: '{ticket} based on its content to find out whether it is technical or billing-related. If a question requires each facets, reply with 'Each'.",
    expected_output="A categorized ticket labeled as 'Technical' or 'Billing' or 'Each'. Don't be verbose, just reply with one word.",
    agent=triage_agent,
)

resolve_technical_issues = Task(
    description="Resolve technical issues described within the ticket: '{ticket}'",
    expected_output="Detailed solutions provided to every technical issue.",
    agent=technical_support_agent,
)

resolve_billing_issues = Task(
    description="Resolve billing issues described within the ticket: '{ticket}'",
    expected_output="Comprehensive responses to every billing-related inquiry.",
    agent=billing_support_agent,
)

# Instantiate your crew with a custom manager and hierarchical process
crew_q = Crew(
    agents=[triage_agent, technical_support_agent, billing_support_agent],
    tasks=[categorize_tickets, resolve_technical_issues, resolve_billing_issues],
    # manager_llm = llm_a, # Uncomment for auto-created manager
    manager_agent=manager, # Comment for auto-created manager
    process=Process.hierarchical,
    verbose=verbose,
)

As is obvious, this system reflects the team of human agents. Not only is there a manger, triage agent, technical and billing support agent, however the CrewAI objects akin to Agent, Task and Crew are self-evident of their meaning and straightforward to visualise. One other commentary is that there could be very little python code and many of the reasoning, planning and behavior is natural language based which depends upon the power of the LLM to derive meaning and intent from language, then reason and plan for the goal.

A CrewAI code subsequently, scores high on ease of development. It’s a low-code way of making a flow quickly with many of the heavy-lifting of the workflow being done by the orchestration framework quite than the developer.

How well does it work?

As we’re testing the hierarchical process, the method parameter is about to within the Crew definition. We will try different features of CrewAI as follows and measure performance:

  1. Manager agent auto-created by CrewAI
  2. Using our custom manager agent

1. Auto-created manager agent

Here is the Langfuse trace:

Why is my laptop overheating?

The important thing observations are as follows:

  1. First the output is For a question that was obviously a technical issue, this can be a poor response.
  2. Why does it occur? The left panel shows that the execution first went to triage specialist, then to technical support after which strangely, to billing support specialist as well. The next graphic depicts this as well:
Langfuse trace graph

Looking closely, we discover that the triage specialist accurately identified the ticket as “Technical” and the technical support agent gave an awesome reply as follows:

Technical support agent response

But then, as a substitute of stopping and replying with the above because the response, the Crew Manager went to the Billing support specialist and tried to

Billing support agent response

This resulted within the Billing agent’s response overwriting the Technical agent’s response, with the Crew Manager doing a sub-optimal job of validating the standard of the ultimate response against the user’s query.

Why did it occur?

crew_q = Crew(
    agents=[triage_agent, technical_support_agent, billing_support_agent],
    tasks=[categorize_tickets, resolve_technical_issues, resolve_billing_issues],
    manager_llm = llm_a,
    process=Process.hierarchical,
    verbose=verbose,
)

When you now ask a billing-related query, it is going to appear to present an accurate answer just because the is the last task within the sequence.

What about a question that requires each technical and billing support, akin to ?” On this case also, the triage agent accurately categorizes the ticket type as “Each”, and the technical and billing agents give correct answers to their individual queries, however the manager is unable to mix all of the responses right into a coherent reply to user’s query. As a substitute, the ultimate response only considers the billing response since that’s the last task to be called in sequence.

Response to a combined query

Latency and Usage: As may be seen from the above image, the Crew execution took almost 38 secs and spent 15759 tokens. The ultimate output is simply about 200 tokens. The remainder of the tokens were spent in all of the pondering, agent calling, generating intermediate responses etc – all to generate an unsatisfactory response at the tip. The performance may be categorised as “Poor”.

Evaluation of this approach

  • Quality of orchestration:
  • Quality of ultimate output:
  • Explainability:
  • Latency and Usage:

But perhaps, the above result’s as a result of the undeniable fact that we relied on CrewAI’s built-in manager, which didn’t have our custom instructions. Due to this fact, in our next approach we replace the CrewAI automated manager with our custom Manager agent, which has detailed instructions on what to do in case of Technical, Billing or Each tickets.

2. Using Custom Manager Agent

Our Customer Support Manager is defined with the next instructions. Note that this requires some experimentation to get it working, and a generic manager prompt akin to that mentioned within the CrewAI documentation will give the identical erroneous results because the built-in manager agent above.

    role="Customer Support Manager",
    goal="Oversee the support team to make sure timely and effective resolution of customer inquiries. Use the tool to categorize the user query first, then resolve the following steps.Syntesize responses from different agents if needed to offer a comprehensive answer to the shopper.",
    backstory=( """
        You don't try to search out a solution to the user ticket {ticket} yourself. 
        You delegate tasks to coworkers based on the next logic:
        Note the category of the ticket first by utilizing the triage agent.
        If the ticket is categorized as 'Each', all the time assign it first to the Technical Support Specialist, then to the Billing Support Specialist, then print the ultimate combined response. Make sure that the ultimate response answers each technical and billing issues raised within the ticket based on the responses from each Technical and Billing Support Specialists.
        ELSE
        If the ticket is categorized as 'Technical', assign it to the Technical Support Specialist, else skip this step.
        Before proceeding further, analyse the ticket category. Whether it is 'Technical', print the ultimate response. Terminate further actions.
        ELSE
        If the ticket is categorized as 'Billing', assign it to the Billing Support Specialist.
        Finally, compile and present the ultimate response to the shopper based on the outputs from the assigned agents.
        """

And within the Crew definition, we use the custom manager as a substitute of the built-in one:

crew_q = Crew(
    agents=[triage_agent, technical_support_agent, billing_support_agent],
    tasks=[categorize_tickets, resolve_technical_issues, resolve_billing_issues],
    # manager_llm = llm_a,
    manager_agent=manager,
    process=Process.hierarchical,
    verbose=verbose,
)

Let’s repeat the test cases

The trace is the next:

Why is my laptop overheating?
Graph of Why is my laptop overheating?

Crucial commentary is that now for this technical query, the flow didn’t go to the Billing support specialist agent. The manager accurately followed instructions, classified the query as technical and stopped execution once the Technical Support Specialist had generated its response. From the response preview displayed, it is obvious that it’s a superb response for the user query. Also, the latency is 24 secs and token usage is 10k.

The trace is as follows:

Response to ‘Why was I charged twice last month?’
Graph of Why was I charged twice last month?

As may be seen, the manager accurately skipped executing the Technical Support Specialist, although that was before the Billing agent within the Crew definition. As a substitute the response generated is of fine quality from the Billing Support Specialist only. Latency is 16 secs and token usage

The trace shows the Manager executed each Technical and Billing support agents and provided a combined response.

Response to multi-agent query
The response preview within the figure above doesn’t show the complete response, which is as follows, and combines responses from each support agents. Latency is 38 secs and token usage is 20k, which is commensurate with the multiple agents orchestration and the detailed response generated.
Dear Customer,

Thanks for reaching out to us regarding the problems you're experiencing. We sincerely apologize for any inconvenience caused. Below are the detailed solutions to handle your concerns:

**1. Laptop Overheating Issue:**
   - **Check for Proper Ventilation**: Ensure your laptop is placed on a tough, flat surface to permit proper airflow. Avoid using it on soft surfaces like beds or couches that may block the vents. Think about using a laptop cooling pad or stand with built-in fans to enhance airflow.
   - **Clean the Laptop's Vents and Fans**: Dust and debris can accumulate within the vents and fans, restricting airflow. Power off the laptop, unplug it, and use a can of compressed air to softly blow out dust from the vents. When you are comfortable, you'll be able to clean the interior fans and components more thoroughly, or take the laptop to knowledgeable technician for internal cleansing.
   - **Monitor Running Applications and Processes**: Open the Task Manager (Windows: Ctrl + Shift + Esc, macOS: Activity Monitor) and check for processes consuming high CPU or GPU usage. Close unnecessary applications or processes to scale back the load on the system.
   - **Update Drivers and Software**: Update your operating system, drivers (especially graphics drivers), and another critical software to the most recent versions.
   - **Check for Malware or Viruses**: Run a full system scan using a good antivirus program to detect and take away any malware.
   - **Adjust Power Settings**: Adjust your power settings to "Balanced" or "Power Saver" mode (Windows: Control Panel > Power Options, macOS: System Preferences > Energy Saver).
   - **Inspect the Laptop's Hardware**: If the laptop remains to be overheating, there could also be a problem with the hardware, akin to a failing fan or thermal paste that needs substitute. Seek the advice of knowledgeable technician to examine and replace the thermal paste or faulty hardware components if needed.
   - **Environmental Aspects**: Operate the laptop in a cool, well-ventilated environment. Avoid using the laptop in direct sunlight or near heat sources.
   - **Consider Upgrading Components**: If the laptop is older, consider upgrading components akin to RAM or switching to an SSD to scale back the strain on the system and help with heat management.
   - **Monitor Temperature Levels**: Install a temperature monitoring tool (e.g., HWMonitor, Core Temp, or Macs Fan Control) to maintain track of the CPU and GPU temperatures. This may help discover if the laptop is consistently running at high temperatures and supply insights into when the overheating occurs.

If the issue persists after trying all of the above solutions, please contact the laptop manufacturer’s support team or an authorized repair technician for further assistance.

**2. Duplicate Billing Issue:**
   - **Verification of Account Details**: I reviewed your account information to make sure accuracy and confirmed that the account is energetic and matches the reported issue.  
   - **Billing Records Check**: Upon examining the billing records for last month, I identified two equivalent charges processed on the identical day. This confirms the duplicate billing issue.  
   - **Payment Method Details**: I verified the payment method used for the transactions and confirmed that each charges were processed using the identical payment method.  
   - **Resolution**:  
       - For the reason that duplicate charge was confirmed, I actually have initiated a refund for the second transaction. The refund will probably be processed back to the unique payment method inside 5-7 business days.  
       - Moreover, I actually have applied a credit to your account as a gesture of goodwill for the inconvenience caused.  
   - **Customer Communication**: You'll receive an email or phone call from us to tell you of the resolution, including the refund timeline and the applied credit.  

We sincerely apologize for the inconvenience attributable to these issues. Please tell us should you need further assistance or have any additional questions. We're here to assist

And our final, more nuanced query is as follows:

?

The trace shows that the query was accurately categorised as ‘Each’, requiring responses from Technical and Billing agents and a combined response was provided. Latency is 24 secs and tokens used are 12k.

Response to multi-agent query
The total response is included here:
Thanks for bringing this issue to our attention. Based on the technical investigation, listed below are the steps we're taking to resolve the system glitch that caused the inaccurate invoice amount:

1. We've verified the main points of the inaccurate invoice amount and identified the difficulty within the system.
2. Our technical team has analyzed the system logs and reviewed the invoice calculation logic to pinpoint the foundation explanation for the error.
3. The needed fixes have been implemented to handle the glitch, and the system has been thoroughly tested to make sure the difficulty is resolved.
4. A corrected invoice has been generated, and we have now verified its accuracy.

Moreover, here’s the whole resolution for the billing issue related to the inaccurate invoice amount attributable to the system glitch:

1. **Corrected Invoice Verification**: The technical team has resolved the system glitch and generated a corrected invoice. The corrected invoice has been personally verified to make sure its accuracy, including cross-checking the amounts, dates, and any applicable charges or credits.

2. **Customer Communication**: We'll promptly notify you in regards to the resolution of the difficulty. This communication includes:
   - An apology for the inconvenience attributable to the inaccurate invoice.
   - Confirmation that the system glitch has been resolved.
   - Assurance that the corrected invoice has been thoroughly reviewed for accuracy.
   - A duplicate of the corrected invoice to your records.

3. **Additional Steps Taken**: To forestall similar issues in the long run, the technical team has implemented measures to make sure system stability and accuracy in invoice generation.

4. **Account Adjustment (if applicable)**: If the inaccurate invoice resulted in any overpayment or underpayment, the needed adjustments will probably be made to your account. This includes issuing a refund for any overpayment or providing clear instructions for settling any outstanding balance.

5. **Follow-Up**: We're here to help you with any further questions or concerns regarding your account or billing. Please don't hesitate to succeed in out to us, and we will probably be joyful to assist. To your convenience, we have now provided direct contact information for further communication.

We sincerely apologize for any inconvenience this may occasionally have caused and assure you that we're taking steps to stop similar issues in the long run. Thanks to your understanding and patience.

Evaluation of this approach

  • Quality of orchestration: Good
  • Quality of ultimate output: Good
  • Explainability: Good (we understand why it did what it did)
  • Latency and Usage: Fair (commensurate with the complexity of the output)

Takeaway

In summary, the hierarchical Manager–Employee pattern in CrewAI doesn’t function as documented. The core orchestration logic is weak; as a substitute of allowing the manager to selectively delegate tasks, CrewAI executes all tasks sequentially, causing incorrect agent invocation, overwritten outputs, and inflated latency/token usage. Why it failed comes right down to the framework’s internal routing—hierarchical mode doesn’t implement conditional branching or true delegation, so the ultimate response is effectively determined by whichever task happens to run last. The fix is introducing a custom manager agent with explicit, step-wise instructions: it uses the triage result, conditionally calls only the required agents, synthesizes their outputs, and terminates execution at the fitting point—restoring correct routing, improving output quality, and significantly optimising token costs.

Conclusion

CrewAI, within the spirit of keeping the LLM at the middle of orchestration, depends upon it for many of the heavy-lifting of orchestration, utilising user prompts combined with detailed scaffolding prompts embedded within the framework. Unlike LangGraph and AutoGen, this approach sacrifices determinism for developer-friendliness. And sometimes leads to unexpected behavior for critical features akin to the manager-worker pattern, crucial for a lot of real-life use cases. This text attempts to exhibit a pathway for achieving the specified orchestration for this pattern using careful prompting. In future articles, I intend to explore more features for CrewAI, LangGraph and others for his or her applicability in practical use cases.

You should use CrewAI to design an interactive conversational assistant on a document store and further make the responses truly multimodal. Refer my articles on GraphRAG Design and Multimodal RAG.

All images in this text drawn by me or generated using Copilot or Langfuse. Code shared is written by me.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x