Beyond Prompting: The Power of Context Engineering

-

an LLM can see before it generates a solution. This includes the prompt itself, instructions, examples, retrieved documents, tool outputs, and even the prior conversation history.

Context has a huge effect on answer quality. For instance, should you ask an LLM to write down a SQL query without providing the information schema, the result will almost actually be suboptimal. Worse, if the model has no access to the database in any respect, it might simply hallucinate a question that doesn’t work. Even when tools can be found, the model still needs time beyond regulation and energy to infer the schema before it could possibly produce an accurate answer.

Because context plays such a central role in LLM-based applications, context engineering has emerged as a discipline focused on systematically optimising what information goes right into a model’s prompt. The goal is to construct “self-improving” systems that learn from experience without counting on expensive fine-tuning (retraining models and updating thousands and thousands of parameters).

Context engineering comes with several key benefits:

  • it’s less expensive and doesn’t require specialised fine-tuning expertise;
  • context and directions remain transparent, interpretable, and simple for humans to switch;
  • iteration cycles are much faster, since updates could be made immediately without retraining or redeploying models;
  • it’s more agile, especially when information must be forgotten for privacy or legal reasons.

With all these benefits, it’s not surprising that context engineering is gaining a lot attention. What’s interesting, though, is how quickly the approaches themselves are evolving. In this text, I’ll walk through that evolution after which experiment with considered one of the newer frameworks for prompt optimisation: Agentic Context Engineering (ACE).

Evolution of context engineering approaches 

Context engineering didn’t appear overnight. It has evolved through several distinct stages.

The earliest stage was static prompting. Here, prompts were hand-crafted instructions that never modified. Many of the effort went into classic prompt engineering: rigorously selecting wording, structure, and formatting to squeeze higher performance out of the model. 

The following major step was dynamic retrieval. As a substitute of counting on a set prompt, systems began pulling in relevant information (documents, examples, or facts) at inference time. Retrieval-Augmented Generation (RAG) became some of the popular approaches on this category. By grounding responses in external data, RAG significantly improved accuracy and reduced hallucinations, especially for knowledge-heavy tasks.

More recently, the main focus has shifted toward self-improving contexts. Somewhat than treating context as something that’s merely retrieved or injected, these approaches allow the system to update and refine its own context based on past performance. In other words, the prompt itself becomes adaptive, evolving through reflection and feedback.

Quite a lot of frameworks have emerged around this concept. Below are a few of the most influential ones.

  • One among the earliest and most vital works is Reflexion: Language Agents with Verbal Reinforcement Learning” by Shinn et al. This research introduced the concept language agents can learn from mistakes through natural language reflection quite than gradient-based updates. Reflexion agents analyse feedback from previous attempts, generate verbal reflections about what went improper, and store these reflections in an episodic memory buffer. These stored reflections then guide higher decision-making in subsequent trials.
  • One other vital contribution is “TextGrad: Automatic Differentiation via Text” by Yuksekgonul et al. TextGrad borrows concepts from deep learning optimisation (corresponding to gradients, backpropagation, and gradient descent) but replaces numerical derivatives with natural language feedback. On this framework, LLMs generate textual critiques describing how a variable should change to enhance the final result. These “textual gradients” are then propagated backwards through the system using prompting, effectively performing a natural-language version of backpropagation across a compound AI system.
  • The paper “GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning” by Agrawal et al. takes a unique angle by combining evolutionary algorithms with language-based reflection. Prompts are treated like organisms: they mutate, compete, and evolve under selection pressure. Over time, better-performing prompts survive and propagate. This approach is implemented in DSPy, and Hugging Face provides a practical guide for applying it in real-world use cases.
  • Finally, “Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory” by Suzgun et al. explores test-time learning through persistent memory. On this setup, a black-box LLM is given a notebook where it could possibly write down useful strategies, patterns, and code snippets during inference. As a substitute of repeatedly rediscovering the identical insights, the model accumulates and reuses knowledge across tasks. This adaptive memory significantly improves performance without requiring explicit labels or human feedback.

Agentic Context Engineering

Now that we’ve covered how context engineering has evolved, let’s take a more in-depth have a look at Agentic Context Engineering (ACE), considered one of the newer approaches and the predominant focus of this text. ACE is introduced within the paper “Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models” by Zhang et al., published in 2025.

The paper starts by identifying two key problems with existing self-improving context methods.

  • Brevity bias is the tendency for systems to oversimplify vital details and step by step collapse toward short, generic prompts. While compact prompts are attractive, they often lose the nuances that truly drive good performance.
  • Context collapse. When systems repeatedly rewrite your complete prompt, they have a tendency to forget useful knowledge gathered earlier. Over time, this results in instability and regressions quite than regular improvement.

To handle these issues, the authors propose Agentic Context Engineering (ACE), a framework designed for scalable and efficient context adaptation in each offline settings (corresponding to system prompt optimisation) and online scenarios (like test-time memory adaptation). As a substitute of compressing knowledge right into a single static prompt, ACE allows the model to constantly evolve its context by accumulating successful strategies, reflecting on failures, and organising knowledge in a structured way. Conceptually, it resembles an AI assistant that improves over time by keeping detailed notes and refining its own playbook.

On the core of ACE is an agentic learning loop that mirrors how humans learn through experimentation: try, reflect, and consolidate. The framework consists of three components:

  • Generator, which produces reasoning trajectories while solving tasks;
  • Reflector, which analyses successes and failures and distils actionable insights;
  • Curator, which integrates those insights into the shared context as small, incremental updates.

Somewhat than maintaining a single monolithic prompt, ACE organises context as a playbook made up of structured bullet points. Each bullet comprises metadata (corresponding to a novel identifier and counters tracking how often it has been helpful or harmful) in addition to content representing a small, reusable unit of information. This could be a general strategy, a domain-specific concept, or a standard failure mode.

Figure from the paper Zhang et al, 2025 | source

The ACE workflow consists of several phases.

  1. Generation phase. The Generator tackles latest problems using the present playbook, marking which bullets were helpful or misleading.
  2. Reflection phase. The Reflector analyses the total trajectory, extracting lessons from each successes and failures through iterative refinement.
  3. Curation phase. The Curator turns these insights into compact “delta” updates — latest or modified bullets which can be merged into the prevailing playbook using lightweight, non-LLM logic.
  4. Grow-and-refine phase. Latest bullets are appended, existing ones are updated in place, and periodic deduplication removes redundancy using semantic embeddings.

This design enables parallel processing of multiple updates and supports multi-epoch adaptation, where the identical queries could be revisited to progressively strengthen the context over time.

Empirically, ACE delivers strong results. On benchmark evaluations, it outperforms other self-improving context approaches, achieving a +10.6% improvement on AI agent tasks and a +8.6% gain in specialised domains corresponding to finance. 

Figure from the paper Zhang et al, 2025 | source

Beyond accuracy, ACE can also be more cost-efficient due to its incremental update mechanism, showing 83.6% lower token costs in comparison with baseline methods.

Together, these results position ACE as a practical and scalable step forward in constructing self-improving LLM systems.

Using ACE for banking intent data

The ACE framework looks promising on paper, so the following step is to see the way it performs in practice. Fortunately, the authors have shared an open-source implementation on GitHub, which provides us a solid start line.

Loading the data

To maintain the experiment focused, I made a decision to use ACE to a classification task. I’m using a publicly available dataset of banking intents released by PolyAI (). This dataset reflects a quite common real-world problem: identifying customer intent when someone contacts customer support. Accurate intent classification is critical for routing requests to the precise team, triggering semi-automated responses, or just monitoring recurring issues.

On this dataset, each customer message (for instance, must be mapped to a particular banking intent, corresponding to declined_card_payment. In total, there are 77 distinct intent categories.

To maintain the experiment manageable, I sampled 500 examples from the dataset and split them into training, test, and validation sets. Below is the code used to load the information and create the splits.

full_df = pd.read_csv('./poly_ai_banking_data/train.csv')

# params
total_number_of_samples = 500 
train_share = 0.5
test_share = 0.4
val_share = 0.1

sample_df = full_df.sample(n=total_number_of_samples, random_state=42)
  .reset_index(drop=True)

random.seed(42)
sample_df['group'] = random.selections(['train', 'test', 'val'], 
  weights=(train_share, test_share, val_share), k=total_number_of_samples)

train_df = sample_df[sample_df['group'] == 'train'].reset_index(drop=True)
test_df = sample_df[sample_df['group'] == 'test'].reset_index(drop=True)
val_df = sample_df[sample_df['group'] == 'val'].reset_index(drop=True)

Extending ACE to banking intent data

The following step is to increase the ACE framework so it could possibly work with our banking intent dataset. Fortunately, the authors provide a detailed guide that makes this process relatively straightforward.

Preparing the information

The very first thing we’d like to do is prepare the dataset in a format that ACE expects. I saved the training, validation, and test splits as CSV files under banking/data. Each example comprises:

  • text: the shopper support message,
  • category: the goal intent label we wish to predict,
  • group: an auxiliary field indicating whether the instance belongs to the train, test, or validation set.

The group field won’t be used later by the framework itself, but it surely’s convenient for dataset management and reproducibility.

Here’s what the information format looks like.

text,category,group
Is it possible for me to alter my PIN number?,change_pin,test
What's the $1 transaction on my account?,extra_charge_on_statement,test
How much does top up fees cost?,top_up_by_card_charge,test
I live within the EU - can I get a card?,country_support,test

Next, we’d like to inform ACE where to search out each split. This is completed by specifying dataset paths in banking/data/task_config.json.

{
  "banking": {
    "train_data": "./banking/data/train.csv",
    "val_data": "./banking/data/val.csv",
    "test_data": "./banking/data/test.csv"
  }
}

Implementing the DataProcessor

To integrate a brand new task, the framework requires a custom DataProcessor module. In keeping with the guide, this involves implementing three core methods: process_task_data, answer_is_correct and evaluate_accuracy.

As well as, we’d like a helper function to load the raw data from disk. Let’s start with that.

Below is the implementation of the data-loading function. It reads a CSV file, validates its existence, and converts each row right into a dictionary that the remainder of the pipeline can work with.

def load_data(data_path: str) -> List[Dict[str, Any]]:
  """
  Load and process data from a CSV file.
  
  Expected CSV format: text,category,group (with header)
  
  Args:
    data_path: Path to the CSV file
      
  Returns:
    List of dictionaries containing the information
  """
  if not os.path.exists(data_path):
    raise FileNotFoundError(f"Data file not found: {data_path}")
  
  data = []
  with open(data_path, 'r', encoding='utf-8') as f:
    reader = csv.DictReader(f)
    for row in reader:
      data.append({
        'text': row['text'],
        'category': row['category'],
        'group': row.get('group', '')
      })
  
  print(f"Loaded {len(data)} samples from {data_path}")
  return data

With the data-loading function in place, we are able to move on to implementing the remaining DataProcessor methods.

The predominant purpose of process_task_data is to convert the raw dataset into ACE’s standardised input format.

ACE expects each example to contain three fields: context, query, and goal. In our case, the mapping is fairly easy. We map the intent category on to goal, and we leave context empty since there’s no additional background information needed for classification.

Crucial part here is the query. We added extra context to make it clear to the LLM that it should classify the query quite than answer questions directly, while also providing the list of accessible topics to guide an LLM’s response.

def process_task_data(self, raw_data: List[Dict]) -> List[Dict]:
  """
  Convert raw CSV data into standardized format for ACE.
  
  Args:
    raw_data: Raw data loaded from CSV (list of dicts with 'text', 'category')
      
  Returns:
    List of dicts with keys: 'context', 'query', 'goal'
  """
  processed_data = []
  
  # Gather the list of topics to incorporate into the query
  topics_list = ", ".join(self.allowed_topics)
  
  for item in raw_data:
    customer_query = item.get('text', '')
    ground_truth_topic = item.get('category', '')
    
    # The query provides the classification task instruction
    query = (
      f"Classify the next banking customer support query into considered one of the predefined topics.nn"
      f"Customer Query: {customer_query}nn"
      f"Available Topics: {topics_list}nn"
      f"Respond with ONLY the subject name, nothing else."
    )
    
    processed_item = {
      "context": "",  # No additional context needed
      "query": query,
      "goal": ground_truth_topic,
      "others": {
        "original_text": customer_query,
        "task": self.task_name,
      }
    }
    
    processed_data.append(processed_item)
  
  return processed_data

The following method, answer_is_correct, checks whether a model’s prediction matches the bottom truth label. Since we explicitly instruct the LLM to reply with only the category name, an easy case-insensitive string comparison is sufficient here.

def answer_is_correct(self, predicted: str, ground_truth: str) -> bool:
  """
  Check if the anticipated topic matches the bottom truth.
  Uses easy case-insensitive comparison.
  
  Args:
    predicted: Model's predicted topic
    ground_truth: Ground truth topic
      
  Returns:
    bool: True if prediction is correct, False otherwise
  """
  return predicted.lower().strip() == ground_truth.lower().strip()

The ultimate method we’d like to implement is evaluate_accuracy, which computes overall classification accuracy across multiple predictions. There’s nothing fancy occurring here. We simply calculate the fraction of cases where answer_is_correct(prediction, ground_truth) returns True.

def evaluate_accuracy(self, predictions: List[str], ground_truths: List[str]) -> float:
  """
  Calculate classification accuracy across multiple predictions.
  
  Args:
    predictions: List of model predictions
    ground_truths: List of ground truth topics
      
  Returns:
    Accuracy as a float between 0 and 1
  """
  if len(predictions) != len(ground_truths):
    raise ValueError("Predictions and ground truths should have same length")
  
  if not predictions:
    return 0.0
  
  correct = sum(
    1 for pred, truth in zip(predictions, ground_truths)
    if self.answer_is_correct(pred, truth)
  )
  
  return correct / len(predictions)

Putting together the workflow script

With the DataProcessor in place, the following step is to assemble a comprehensive run script for ACE. I created a run_ace_workflow script that accepts several key arguments:

  • api_provider selects the language model API to make use of ('anthropic', 'openai', 'together', or 'sambanova'), defaulting to 'anthropic'.
  • generator_model specifies the model for the Generator agent (default: 'claude-haiku-4-5').
  • reflector_model specifies the model for the Reflector agent (default: 'claude-sonnet-4-5').
  • curator_model specifies the model for the Curator agent (default: 'claude-sonnet-4-5').
  • max_train and max_test are optional limits on the train and test set sizes, useful for quick experiments or debugging.

Let’s discuss how this script actually works. The script begins by loading the banking intent data and initialising the DataProcessor. Here’s the helper function I wrote for this.

def load_banking_data(max_train=None, max_test=None):
  """Load and process banking dataset."""
  from banking.data_processor import DataProcessor, load_data
  
  base_path = os.path.dirname(__file__)
  data_path = os.path.join(base_path, "data")
  
  # Load raw data
  train_raw = load_data(os.path.join(data_path, "train.csv"))
  val_raw = load_data(os.path.join(data_path, "val.csv"))
  test_raw = load_data(os.path.join(data_path, "test.csv"))
  
  # Limit samples if specified
  if max_train:
    train_raw = train_raw[:max_train]
    val_raw = val_raw[:max(max_train // 4, 10)]
  if max_test:
    test_raw = test_raw[:max_test]
  
  # Process data
  processor = DataProcessor(task_name="banking")
  train_samples = processor.process_task_data(train_raw)
  val_samples = processor.process_task_data(val_raw)
  test_samples = processor.process_task_data(test_raw)
  
  return train_samples, val_samples, test_samples, processor

train_samples, val_samples, test_samples, processor = load_banking_data(
  max_train=args.max_train,
  max_test=args.max_test
)

The following step is to define a playbook template. This is vital because the present ACE implementation can’t dynamically create latest sections, so we predefine the structure to guide the model. Here’s the template I used for the banking domain.

BANKING_PLAYBOOK_TEMPLATE = """
## GENERAL
## CLASSIFICATION PRINCIPLES
## CATEGORY DISAMBIGUATION
## BANKING DOMAIN KNOWLEDGE
## COMMON PATTERNS
## HANDLING AMBIGUOUS QUERIES
## COMMON MISTAKES TO AVOID
## OTHERS
"""

With the information and template ready, we are able to initialise the ACE object with the predominant parameters.

ace_system = ACE(
  api_provider=args.api_provider,
  generator_model=args.generator_model,
  reflector_model=args.reflector_model,
  curator_model=args.curator_model,
  max_tokens=4096,
  initial_playbook=BANKING_PLAYBOOK_TEMPLATE,
  use_bulletpoint_analyzer=True, # enabling deduplication of bullet points within the playbook
  generator_temperature=0.1, # prioritising consistency for generator
  reflector_temperature=0.7, # prioritising creativity for reflector and curator
  curator_temperature=0.7,
)

Finally, we define a function to run the ACE training workflow, which incorporates initial evaluation, iterative reflection, curation, and final evaluation.

def run_ace_training(ace_system, train_samples, val_samples, test_samples, processor, results_dir):
  """Train ACE to enhance the playbook (includes initial and final evaluations)."""
  config = {
    'num_epochs': 1,
    'max_num_rounds': 3,  # max reflection rounds per sample
    'curator_frequency': 5,  # run curator every 5 steps
    'eval_steps': max(len(train_samples) // 10, 10),  # evaluate 10 times during training
    'save_steps': max(len(train_samples) // 10, 10),
    'playbook_token_budget': 80000,
    'task_name': 'banking_ace',
    'json_mode': False,
    'no_ground_truth': False,
    'save_dir': os.path.join(results_dir, "training"),
    'test_workers': 10,
  }
  
  results = ace_system.run(
    mode='offline',
    train_samples=train_samples,
    val_samples=val_samples,
    test_samples=test_samples,
    data_processor=processor,
    config=config
  )
  
  # Extract results
  initial_acc = results.get('initial_test_results', {}).get('accuracy', 0)
  final_acc = results.get('final_test_results', {}).get('accuracy', 0)
  training_results = results.get('training_results', {})
  
  return ace_system.best_playbook, results

best_playbook, training_results = run_ace_training(
  ace_system, train_samples, val_samples, test_samples, 
  processor, results_dir
)

And that’s it! That’s all of the core logic we’d like to run ACE. I’ve added some logging on top of the workflow for convenience, but it surely’s not essential to the predominant functionality.

Results

Let’s take a have a look at the outcomes and see how every thing comes together. First, take a look at the perfect playbook, which you’ll find at results/banking_{dt}/best_playbook.txt. The playbook is organised into itemised bullets, grouped based on the categories we defined in our initial template. Each bullet comprises detailed instructions and methods, together with metadata showing how often it was marked helpful or harmful. This structure makes it easy to see which topics and methods the system found most useful during training.

## GENERAL
## CLASSIFICATION PRINCIPLES
[cls-00001] helpful=1 harmful=0 :: Temporal indicators like 'was capable of before', 'worked previously', or 'used to work' are strong signals that the problem is restricted to the present transaction quite than a general system capability problem. These phrases suggest a change in status for a particular entity (beneficiary, card, account) quite than overall functionality.
[cls-00002] helpful=18 harmful=4 :: Apply specificity hierarchy: when multiple categories could apply, select essentially the most specific one which matches the contextual clues. For instance, beneficiary_not_allowed (specific to recipient) is more specific than declined_transfer (general failure).
[cls-00009] helpful=0 harmful=3 :: Specificity hierarchy works bidirectionally: select specific categories when contextual clues point to a specific transaction type, but use general categories (like 'extra_charge_on_statement') when the query lacks sufficient context to find out the precise nature of the transaction. Don't force specificity when the shopper's query is inherently general.
[cls-00017] helpful=5 harmful=1 :: Process-oriented vs Status-tracking distinction: Differentiate between questions on HOW to acquire/acquire something (process-oriented) versus questions on WHEN something will arrive or WHETHER it has arrived (status-tracking). Process questions deal with the steps and components needed, while status questions deal with timing and delivery confirmation. Use this distinction to choose from acquisition categories and tracking/arrival categories.
## CATEGORY DISAMBIGUATION
[dis-00003] helpful=1 harmful=0 :: declined_transfer vs beneficiary_not_allowed: If the shopper mentions they may transfer before but suddenly cannot, this strongly indicates beneficiary_not_allowed (recipient is blocked/restricted) quite than declined_transfer (general transfer failure as a consequence of funds, limits, or system errors).
[dis-00011] helpful=11 harmful=0 :: pending_* vs failed_* vs declined_*: Transaction state is critical for classification. 'Hasn't undergone yet' or 'taking too long' = pending state. 'Didn't work', 'was declined', or 'was rejected' = failed/declined state. 'Money got here back' or 'was returned' = reverted state. Match the category to the actual transaction state described.
[dis-00012] helpful=0 harmful=1 :: country_support vs supported_cards_and_currencies: Queries about geographic availability ('which countries', 'where can I', 'what regions') needs to be classified as 'country_support'. In contrast, 'supported_cards_and_currencies' is for questions on card types (Visa, Mastercard) and currency options, not geographic availability.
[dis-00014] helpful=2 harmful=0 :: Money withdrawal issues: Distinguish by transaction state and final result: 'pending_cash_withdrawal' (not accomplished yet, still processing), 'declined_cash_withdrawal' (rejected, no money received), 'cash_withdrawal_not_recognised' (customer doesn't recall the transaction), and 'wrong_amount_of_cash_received' (transaction accomplished but incorrect amount distributed). If money was received but the quantity was improper, use essentially the most specific category: wrong_amount_of_cash_received.
[dis-00015] helpful=3 harmful=3 :: card_arrival vs get_physical_card: Distinguish between status-tracking questions (card_arrival) and process-acquisition questions (get_physical_card). 'card_arrival' is for tracking existing orders ('Has my card arrived?', 'Where is my card?'). 'get_physical_card' encompasses your complete technique of obtaining a physical card including all components like PIN ('Where can I find my PIN?', 'How do I get my card and PIN?'). Questions on missing PINs with 'have not gotten it yet' indicate the shopper is within the acquisition process, not only tracking delivery.
[dis-00021] helpful=1 harmful=0 :: card_payment_not_recognised vs extra_charge_on_statement: When a customer mentions a 'payment' they do not recognize or didn't make ('payment I never submitted', 'payment I didn't authorize'), classify as 'card_payment_not_recognised' because 'payment' is a particular transaction type. Use 'extra_charge_on_statement' only when the shopper describes unexpected amounts, fees, or charges WITHOUT specifying the transaction type (e.g., 'I see an additional $5 on my statement', 'there's an odd charge' without mentioning payment/transfer/withdrawal).
[dis-00024] helpful=0 harmful=1 :: Fee/charge category specificity: When customers ask about fees or charges, prioritize transaction-type-specific fee categories over 'extra_charge_on_statement'. If the query mentions a particular transaction type (transfer, payment, withdrawal, top-up), use the corresponding specific fee category: 'transfer_fee_charged' for transfer fees, 'card_payment_fee_charged' for payment fees, 'atm_fee_charged' for withdrawal fees, 'top_up_fee' for top-up fees. Reserve 'extra_charge_on_statement' just for fee queries where no specific transaction type is mentioned (e.g., 'Why is there an additional $5 charge?' without context).
[dis-00026] helpful=0 harmful=0 :: receiving_money vs transfer_into_account: Distinguish between passive receipt and lively transfer. 'receiving_money' is for queries about receiving funds FROM one other party (passive, initiated by sender). 'transfer_into_account' is for queries in regards to the customer initiating a transfer TO add funds to their very own account (lively, self-initiated). Context clues: empty/low balance + asking about transfers = likely transfer_into_account. Questions on 'can I transfer funds' within the context of needing so as to add money = transfer_into_account, not receiving_money.
[dis-00029] helpful=0 harmful=0 :: beneficiary_not_allowed vs declined_transfer: When a question explicitly mentions 'beneficiary' or 'recipient' combined with restriction language ('not allowed', 'blocked', 'restricted', 'cannot add', 'unable so as to add'), classify as 'beneficiary_not_allowed' even without temporal indicators. The mix of the precise banking entity term (beneficiary/recipient) with restriction language is a powerful direct signal for recipient-level restrictions quite than general transfer failures.
## BANKING DOMAIN KNOWLEDGE
[bank-00006] helpful=0 harmful=0 :: In banking, when a previously successful transfer suddenly fails, common causes include: beneficiary being flagged/blocked by fraud systems, beneficiary account restrictions, or beneficiary being faraway from allowed list. These are distinct from general transfer declines as a consequence of insufficient funds or system errors.
[bank-00008] helpful=0 harmful=6 :: Small unexpected amounts (like £1, £0.01) appearing on statements often indicate authorization holds, verification charges, or miscellaneous fees. When customers query these without additional context, they needs to be classified as 'extra_charge_on_statement' quite than more specific transaction types.
[bank-00018] helpful=0 harmful=0 :: 'card_swallowed' is the banking industry term for ATM card retention scenarios where the machine keeps/retains the shopper's card. This is applicable when cards are stuck, won't come out, or are held by the ATM, no matter the precise phrasing utilized by the shopper.
[bank-00020] helpful=10 harmful=4 :: Banking terminology has a specificity hierarchy for transaction references. Specific transaction type keywords include: 'payment' (card payments), 'transfer' (money transfers), 'withdrawal' (money withdrawals), 'top-up' (account funding), 'direct debit', 'standing order'. Generic terms include: 'charge', 'amount', 'transaction', 'fee'. When a customer uses a particular transaction type keyword, it provides sufficient context to categorise into transaction-type-specific categories quite than general categories.
## COMMON PATTERNS
[pat-00004] helpful=0 harmful=0 :: Pattern: 'It worked before, now it doesn't' + transfer context = likely beneficiary-level restriction quite than system-level decline. The previous success indicates the account and transfer mechanism are functional, pointing to a particular restriction on the present recipient.
[pat-00007] helpful=3 harmful=6 :: Pattern: Customer describes transaction as 'strange', 'unexpected', 'unexplained', or asks 'what is that this charge' on their statement without providing specific transaction type context (transfer, payment, withdrawal, etc.) = classify as 'extra_charge_on_statement'. That is the suitable general category when the character of the charge is unclear.
[pat-00010] helpful=8 harmful=1 :: Pattern: Phrases like 'hasn't undergone yet', 'still waiting', 'not accomplished', or 'still pending' indicate a transaction in PENDING state, not a FAILED state. Select 'pending_*' categories over 'failed_*' or 'declined_*' categories when these language cues are present.
[pat-00013] helpful=0 harmful=2 :: Pattern: Questions with geographic scope indicators like 'which countries', 'where can I', 'what regions', or 'in what locations' are asking about service availability by geography = classify as 'country_support'. The core intent is knowing geographic reach of services.
[pat-00016] helpful=2 harmful=9 :: Pattern: 'Where can I find' or 'How do I get' phrasing indicates process-oriented questions in search of details about obtaining or acquiring something, not status-tracking questions. These should typically map to acquisition/setup categories (like 'get_physical_card') quite than delivery/tracking categories (like 'card_arrival' or 'card_delivery_estimate').
[pat-00019] helpful=0 harmful=0 :: Pattern: Phrases indicating a card is physically retained by an ATM ('card stuck in ATM', 'card won't come out', 'ATM kept my card', 'get my card out of ATM', 'retrieve card from machine') needs to be classified as 'card_swallowed'. The important thing indicator is the cardboard being physically held/retained by the machine quite than other card issues like damage, loss, or functionality problems.
[pat-00022] helpful=1 harmful=0 :: Pattern: Specific transaction type keyword + 'not recognized'/'didn't make'/'never submitted' = use transaction-type-specific 'not_recognised' category. Examples: 'payment I didn't make' → card_payment_not_recognised; 'transfer I do not recognize' → transfer_not_received_by_recipient or related transfer issue; 'withdrawal I never made' → cash_withdrawal_not_recognised. The presence of a particular transaction type keyword (payment, transfer, withdrawal) is sufficient context to avoid general categories.
[pat-00025] helpful=1 harmful=0 :: Pattern: Transaction type keyword + timing query ('how long', 'when will', 'how much time') + geographic mention = prioritize transaction-specific timing category (e.g., 'transfer_timing', 'card_delivery_estimate'). Treat geographic mentions as contextual information in regards to the transaction origin/destination unless the query explicitly asks about service availability ('which countries', 'where can I exploit', 'is it available in'). Example: 'transfer from China, how long?' → 'transfer_timing' (not 'country_support').
[pat-00027] helpful=0 harmful=0 :: Pattern: Account balance context + transfer inquiry = intent so as to add funds. When a customer mentions their account is empty/has no funds/needs money AND asks about transferring, they're asking about moving funds INTO their account (transfer_into_account), not about receiving money from others (receiving_money). The account state provides critical context for disambiguating transfer-related intents.
## HANDLING AMBIGUOUS QUERIES
## COMMON MISTAKES TO AVOID
[err-00005] helpful=2 harmful=0 :: Don't default to general categories (like declined_transfer) when temporal context ('was capable of before') suggests a more specific issue. The temporal change is a key discriminator that always points to entity-specific restrictions (beneficiary, card, account) quite than general failures.
[err-00023] helpful=2 harmful=0 :: Don't default to 'extra_charge_on_statement' when the shopper mentions a particular transaction type (payment, transfer, withdrawal, top-up) they do not recognize. 'extra_charge_on_statement' needs to be reserved for truly ambiguous cases where no transaction type is specified. When a customer says 'payment I never made', the word 'payment' provides sufficient context to make use of 'card_payment_not_recognised' as an alternative of the generic 'extra_charge_on_statement'.
[err-00028] helpful=0 harmful=0 :: Don't apply pattern rules or domain knowledge which can be irrelevant to the query. If a question has no geographic indicators, don't apply geographic patterns. If there is not any mention of fees, don't apply fee-related rules. Concentrate on rules that directly match the semantic content and context of the shopper's query quite than grasping for any applicable rule. Irrelevant rule application results in misclassification.
## OTHERS

For a deeper have a look at how each agent operates, you’ll be able to explore the detailed execution logs at results/banking_{dt}/training/ace_run_{dt}/detailed_llm_logs . I highly recommend browsing these logs. On the very least, skim through the prompts and see how the Generator, Reflector, and Curator interact. It’s a terrific technique to understand how ACE evolves the context step-by-step.

In fact, essentially the most interesting metric is accuracy. You will discover the initial and final test ends in results/banking_{datetime}/training/initial_test_results.json and results/banking_{datetime}/training/final_test_results.json.

# initial results 
{
  "test_results": {
    "accuracy": 0.7512437810945274,
    "correct": 151,
    "total": 201,
    "no_answer": 0
  },
  "error_log": {
    "accuracy": 0.7512437810945274,
    "errors": [
      {
        "index": 2,
        "prediction": "declined_card_payment",
        "ground_truth": "declined_transfer"
      },
      {
        "index": 9,
        "prediction": "top_up_limits",
        "ground_truth": "automatic_top_up"
      },
      {
        "index": 7,
        "prediction": "transfer_not_received_by_recipient",
        "ground_truth": "balance_not_updated_after_cheque_or_cash_deposit"
      },
      ...
    ]
  }
}

# final results 
{
  "test_results": {
    "accuracy": 0.736318407960199,
    "correct": 148,
    "total": 201,
    "no_answer": 0
  },
  "error_log": {
    "accuracy": 0.736318407960199,
    "errors": [
      {
        "index": 9,
        "prediction": "top_up_limits",
        "ground_truth": "automatic_top_up"
      },
      {
        "index": 2,
        "prediction": "declined_card_payment",
        "ground_truth": "declined_transfer"
      },
      {
        "index": 7,
        "prediction": "pending_transfer",
        "ground_truth": "balance_not_updated_after_cheque_or_cash_deposit"
      },
      ...
    ]
  }
}

The outcomes, admittedly, are usually not very impressive. In reality, accuracy barely dropped after optimisation, from 75.1% to 73.6%. But even negative results can teach us something helpful.

There are a couple of likely explanation why ACE didn’t provide much profit on this case:

  • Limited data per category. We only had 248 training examples, 201 test examples, and 51 validation examples. Nevertheless, our task involved 77 different categories. With so few examples per class, the model simply may not have had enough data to learn meaningful distinctions.
  • Small and unrepresentative validation set. With only 51 examples, the validation set won’t have captured the total diversity of customer queries, making it difficult for ACE to generate useful reflections and enhancements.
  • Task complexity. Our use case is comparatively straightforward. Because the authors note, ACE tends to shine in scenarios with large amounts of highly specialised domain knowledge or more complex agentic workflows, where reflection and iterative context refinement can significantly improve performance.

Using ACE for code generation

Encouraged by the previous experiment, I made a decision to present ACE one other try. This time on the Mostly Basic Python Problems dataset (available under cc-by-4.0 license). Hopefully, the outcomes can be more promising with a code generation task.

Data overview

Each example within the dataset comprises three key components:

  • Query, for instance,
  • Ground truth implementation — Python reference code. For instance, for the query above
def reverse_words(s):
  return ' '.join(reversed(s.split()))
  • Test cases  are  assertions to validate the generated code, corresponding to
[
    assert reverse_words("python program")==("program python"),
    assert reverse_words("java language")==("language java"),
    assert reverse_words("indian man")==("man indian")
]

Adding a brand new task to the ACE framework

We are able to follow similar steps to increase the ACE framework to handle coding tasks. I won’t go into all of the implementation details here, since you’ll find the total code on GitHub. Nevertheless, it’s price highlighting the important thing differences in comparison with the banking intent example.

Coding tasks are inherently more complex. Within the banking intent case, the model outputs a single class out of 77, which is straightforward to match directly with the bottom truth. In code generation, nonetheless, the LLM can produce arbitrary code, so we cannot simply check for exact matches. As a substitute, we’d like to run tests to find out whether the generated solution is correct.

# banking

def answer_is_correct(self, predicted: str, ground_truth: str) -> bool:
  return predicted.lower() == ground_truth.lower()

# coding 
def answer_is_correct(self, predicted: str, ground_truth: str, 
                    test_list: List[str], idx: int, save_dir: str) -> bool:
  code = extract_code_from_response(predicted)
  result = execute_code_with_tests(code, test_list, timeout=5)
  return result['success']

For this reason added complexity, I needed to implement several enhancements within the DataProcessor for code generation:

  • Code extraction. LLMs often include extra context across the code, corresponding to Markdown formatting (```python ...```). We’d like to scrub and extract the code to make sure it could possibly compile appropriately.
  • Secure execution. Since we run the generated code to confirm correctness, it’s vital to implement basic safety measures, corresponding to timeouts and isolated execution environments.
  • Providing full context. It’s crucial to incorporate all needed information within the query. If we just ask the LLM to generate code, it’s unlikely to pass the tests since it won’t be clear what function name or signature is anticipated. That’s why it’s crucial to supply all needed details within the query when standardising the information within the process_task_data function.
query = (
  f"Write a Python function to resolve the next problem:nn"
  f"Problem: {problem_text}nn"
  f"Your code must pass the next test cases:n"
  f"{test_cases_formatted}nn"
  f"Essential: The test cases will probably be executed against your code. "
  f"Ensure that your function name and signature match what the tests expect.nn"
  f"Respond with ONLY the Python code, no explanations."
)

In the unique ACE implementation, the Reflector compared generated code directly with the bottom truth, which works for classification tasks. For coding, nonetheless, this approach doesn’t make sense: multiple correct solutions can exist, and optimising for code that “looks similar” to the reference doesn’t guarantee it can pass the tests.

To handle this, I implemented a brand new method, get_test_feedback, which provides the Reflector with actual test execution results and error messages. The test output becomes the first signal for correctness, giving way more informative feedback than easy code comparison.

def get_test_feedback(self, predicted: str, ground_truth: str, test_list: List[str] = None) -> str:
  """
  Get detailed test execution feedback for the reflector.
  
  This method provides the reflector with actual test results and error messages,
  which is more informative than simply comparing generated code with ground truth.
  The test output is the first signal for correctness in code generation tasks.
  
  Args:
      predicted: Model's predicted code
      ground_truth: Ground truth code (reference only, not used for evaluation)
      test_list: List of test assertions to run
      
  Returns:
      str: Detailed feedback string with test execution results
  """
  if test_list is None:
      return "No test cases provided - cannot evaluate code."
  
  # Extract code from response if needed
  code = extract_code_from_response(predicted)
  
  # Execute code with tests
  result = execute_code_with_tests(code, test_list, timeout=self.timeout)
  
  # Construct detailed feedback
  feedback_parts = []
  
  if result['success']:
    feedback_parts.append(f"✓ All {result['total']} tests PASSED")
    feedback_parts.append("nTest cases executed successfully:")
    for i, test in enumerate(test_list, 1):
        feedback_parts.append(f"  {i}. {test} ✓")
  else:
    feedback_parts.append(f"✗ Tests FAILED: {result['passed']}/{result['total']} tests passed")
    
    if result['timeout']:
      feedback_parts.append("n⏱ TIMEOUT: Code execution exceeded closing date")
    
    if result['errors']:
      feedback_parts.append("n--- ERROR DETAILS ---")
      for error in result['errors']:
        feedback_parts.append(f"  • {error}")
    
    # Show which tests passed vs failed
    feedback_parts.append("n--- TEST RESULTS ---")
    for i, test in enumerate(test_list, 1):
      # Check if this specific test appears in errors
      test_failed = any(f"Test {i}" in err for err in result.get('errors', []))
      status = "✗ FAILED" if test_failed else "✓ passed"
      feedback_parts.append(f"  {i}. {test} - {status}")
  
  # Add extracted code for reference
  feedback_parts.append("n--- EXTRACTED CODE ---")
  feedback_parts.append(code)
  
  return "n".join(feedback_parts)

Alongside this latest method, I created a dedicated Reflector prompt tailored for code generation. Its focus is on test results, not line-by-line code comparison.

You're an authority code reviewer and educator. Your job is to research why generated code passed or failed test cases, and discover patterns that result in correct or incorrect solutions.

**IMPORTANT: Test execution results are the PRIMARY signal for correctness.**
- The code is correct if and provided that ALL tests pass
- Do NOT compare implementations line-by-line with the reference - different implementations could be equally correct
- Concentrate on understanding WHY tests passed or failed based on the code's logic

**Instructions:**
- First, examine the Test Execution Results to find out if the code is correct
- If tests FAILED: Analyze what caused the failure (syntax errors, logic errors, edge cases, improper algorithm)
- If tests PASSED: Discover what the model did well that led to success
- The "Possible Implementation" is just ONE technique to solve the issue - the model's approach could also be different but equally valid
- Provide actionable insights for improving code generation in the longer term
- Tag bulletpoints as helpful/harmful/neutral based on whether or not they contributed to passing tests

Your output needs to be a json object, which comprises the next fields:
  - reasoning: analyze the test results and the code's logic, explain why tests passed/failed
  - error_identification: if tests failed, what specific issue caused the failure? If tests passed, state "No errors - all tests passed"
  - root_cause_analysis: what underlying concept or pattern led to success or failure?
  - correct_approach: what coding strategy or pattern needs to be used for similar problems?
  - key_insight: what principle needs to be remembered for future code generation tasks?
  - bullet_tags: a listing of json objects with bullet_id and tag for every bulletpoint

**Query:**
{}

**Model's Reasoning Trace:**
{}

**Model's Generated Code:**
{}

**Possible Implementation (Reference Only - NOT the one correct solution):**
{}

**Test Execution Results (PRIMARY SIGNAL):**
{}

**A part of Playbook that is utilized by the generator to reply the query:**
{}

**Answer on this exact JSON format:**
{{
  "reasoning": "[Analyze test results and code logic - why did tests pass or fail?]",
  "error_identification": "[What caused test failures? Or 'No errors - all tests passed']",
  "root_cause_analysis": "[What concept/pattern led to success or failure?]",
  "correct_approach": "[What coding strategy works for this type of problem?]",
  "key_insight": "[What principle should be remembered for future code generation?]",
  "bullet_tags": [
    {{"id": "code-00001", "tag": "helpful"}},
    {{"id": "code-00002", "tag": "harmful"}}
  ]
}}

This coding-specific Reflector is routinely used every time the duty name comprises "coding".

Results

Finally, I ran the prompt optimisation process on a dataset of 500 samples, split into train, test, and validation sets. This time, the outcomes are way more promising: accuracy improved significantly from 71.1% to 87.1%. On this case, ACE clearly helped optimise the prompts and guide the model toward correct solutions.

the perfect playbook, it’s quite extensive. A lot of essentially the most helpful patterns are general principles, corresponding to:

  • Write the only correct, Pythonic solution first,
  • Treat test cases because the true specification,
  • Confirm correctness before any further optimisation.

At the identical time, the playbook also includes very specific guidance, for instance, detailed instructions for tasks like GCD calculations.

Overall, this shows that ACE can effectively capture each high-level strategies and task-specific suggestions.

## GENERAL
## COMMON MISTAKES TO AVOID
[err-00003] helpful=5 harmful=0 :: Don't add unnecessary complexity to recursive algorithms. For instance, in GCD implementations, explicit min/max logic or special cases for checking if a worth equals 1 are redundant when using the usual Euclidean algorithm.
[err-00007] helpful=0 harmful=0 :: Don't assume problem constraints match your algorithm's mathematical prerequisites. For instance, Fermat's Little Theorem for modular inverse requires a PRIME modulus - confirm the issue guarantees this before using pow(a, p-2, p). If constraints aren't specified, select more general algorithms.
## OTHERS
## CODE GENERATION PRINCIPLES
[cgp-00002] helpful=41 harmful=2 :: Prefer minimal, mathematically sound implementations over complex ones. Avoid adding unnecessary preprocessing logic (like min/max) or special case checks when the core algorithm naturally handles all scenarios.
[cgp-00012] helpful=91 harmful=2 :: All the time ensure generated code is syntactically complete before finalizing output. Confirm all opened brackets, braces, and parentheses are properly closed, and all statements are fully formed. Incomplete code generation (truncation mid-statement) causes syntax errors that prevent execution no matter algorithmic correctness.
[cgp-00020] helpful=6 harmful=0 :: When an issue explicitly requires using lambda functions, integrate them naturally with Python's functional programming tools (map, filter, reduce, sorted with key parameter). Don't force lambda usage where it's awkward - these built-in functions are designed to work seamlessly with lambdas for operations like filtering, transformation, and counting.
[cgp-00024] helpful=140 harmful=2 :: Prioritize readable, Pythonic solutions using built-in functions over performance-optimized complex algorithms unless the issue explicitly requires optimization or involves large-scale data. A transparent solution using bin(), str methods, or list comprehensions is usually preferable to bit manipulation or manual loops. Optimize only when needed.
[cgp-00047] helpful=56 harmful=2 :: Follow a correctness-first development strategy: (1) implement the simple algorithm that appropriately solves the issue, even when it is not optimally efficient, (2) confirm correctness with test cases, (3) only then consider optimization if performance is insufficient or the issue explicitly requires it. An accurate O(n) solution is infinitely higher than a buggy O(log n) attempt. Premature optimization often introduces errors in logic, especially for mathematical or algorithmic problems.
[cgp-00050] helpful=0 harmful=0 :: When multiple algorithmically correct solutions exist, prefer the one with higher time/space complexity. An accurate O(1) formula-based solution is superior to an accurate O(n) iterative solution. Nevertheless, only optimize should you can maintain correctness - a working O(n) solution is infinitely higher than a buggy O(1) attempt. Confirm the more efficient approach passes all tests before committing to it.
[cgp-00053] helpful=0 harmful=0 :: When implementing mathematical optimizations (especially for pair/combination counting), confirm the optimized approach against test cases through manual calculation BEFORE coding. For every test case: (1) apply your mathematical insight to predict the output, (2) confirm it matches expected output, (3) only then implement. This catches errors in mathematical reasoning early, stopping bugs which can be harder to debug in code than in arithmetic.
[cgp-00057] helpful=0 harmful=0 :: Avoid shadowing Python built-in names (dict, list, str, int, set, tuple, etc.) when naming variables or parameters. Use descriptive alternatives as an alternative: 'd' or 'data' as an alternative of 'dict', 'lst' or 'items' as an alternative of 'list', 's' or 'text' as an alternative of 'str'. Shadowing built-ins makes them inaccessible in that scope and reduces code clarity, regardless that it's syntactically valid.
[cgp-00059] helpful=2 harmful=0 :: Include defensive programming practices (input validation, bounds checking, type checking) even when not explicitly tested by visible test cases. For string indexing, validate index bounds before access. For numeric conversions, confirm the input is a legitimate digit. For list operations, check for empty collections. These safeguards increase code robustness and forestall runtime errors on edge cases which will exist in hidden tests, demonstrating production-quality coding practices.
[cgp-00074] helpful=0 harmful=0 :: For operations involving powers of two, prefer bitwise shift operators over arithmetic operations for clarity and efficiency: use left shift (1 << k) instead of 2**k or pow(2, k) for computing 2^k, use right shift (n >> k) as an alternative of n // (2**k) for dividing by powers of two. Bitwise operators make the bit-level intent explicit and are the idiomatic approach in bit manipulation contexts. This is particularly helpful when working with bit positions and their corresponding values.
[cgp-00081] helpful=0 harmful=0 :: Before using standard library mathematical constants (math.pi, math.e, etc.), validate that test cases expect full-precision values by calculating one test output and comparing to expected. If expected outputs suggest truncated/simplified constants (pi=3.14, pi=3.1415, e=2.718), use hardcoded values matching test precision as an alternative of library constants. Pattern: (1) discover mathematical constant needed, (2) calculate test output with standard constant, (3) if mismatch exists, derive the constant value that produces exact expected outputs, (4) use hardcoded value. Test case expectations override mathematical purity.
## COMMON PYTHON PATTERNS
[cpp-00010] helpful=23 harmful=0 :: For locating elements with maximum/minimum properties based on a criterion, use built-in max()/min() functions with the important thing parameter. Example: max(list_of_lists, key=len) finds the longest list. That is more Pythonic and readable than manual iteration with comparisons.
[cpp-00013] helpful=17 harmful=0 :: For counting or searching operations in Python collections (tuples, lists, strings), prioritize built-in methods: use .count() for occurrence counting, .index() for locating positions, .find() for strings. These are more reliable, efficient, and Pythonic than manual iteration with counters or loops.
[cpp-00014] helpful=3 harmful=0 :: When working with mixed-type data structures, use isinstance() for type checking to differentiate between different element types. Mix with len() checks to validate structure. Example: isinstance(item, list) and len(item) == 2 reliably identifies 2-element lists in mixed collections.
[cpp-00015] helpful=3 harmful=0 :: Use extend() as an alternative of append() when adding multiple elements from a sequence to a listing. extend() adds elements individually to the goal list, while append() would add your complete sequence as a single nested element. Example: result.extend([value] * count) vs result.append([value] * count).
[cpp-00016] helpful=2 harmful=0 :: Use list multiplication ([value] * count) to efficiently repeat elements. That is more Pythonic and readable than manual loops for creating repeated elements. Mix with extend() for adding repeated elements to existing lists.
[cpp-00019] helpful=2 harmful=0 :: For counting elements matching a condition with lambda functions, use sum(map(lambda x: 1 if condition else 0, iterable)) as a sublime alternative to len(list(filter(lambda x: condition, iterable))). The sum(map()) approach maps elements to 1/0 and sums them, often more readable and efficient than filtering then counting.
[cpp-00026] helpful=14 harmful=0 :: For converting sequences (tuples, lists) of characters/strings right into a single string, use str.join() method: ''.join(sequence) for character concatenation, or 'separator'.join(sequence) for joining with delimiters. That is the idiomatic Python approach - more readable and performant than manual loops with += or accumulation patterns.
[cpp-00030] helpful=1 harmful=0 :: For character classification with , use re.findall() with mutually exclusive character class patterns. For 'every thing else' categories (like special characters), prefer negation patterns [^...] over enumerating specific characters - e.g., [^A-Za-z0-9] captures all non-alphanumeric characters comprehensively, avoiding the brittleness of lists like [,.!?]. Ensure patterns don't overlap to forestall double-counting.
[cpp-00031] helpful=2 harmful=0 :: For locating global maximum/minimum across nested iterables (list of tuples, list of lists, etc.), use nested generator expressions with built-in max()/min(): `max(element for container in containers for element in container)`. This pattern naturally flattens one level of nesting without creating intermediate lists, making it ideal for locating extremes across tuple records or sublists. More efficient and readable than manual iteration.
[cpp-00033] helpful=2 harmful=0 :: For index-based access to dictionary keys, use the pattern list(dict)[index] or list(dict.keys())[index]. This relies on Python 3.7+ guarantees that dictionaries maintain insertion order. Converting the dictionary to a listing extracts keys so as, allowing standard list indexing. That is the idiomatic Python solution for mapping numeric indices to dictionary keys.
[cpp-00036] helpful=27 harmful=2 :: For mathematical operations (GCD, LCM, factorial, prime checking, trigonometry), check Python's math module FIRST before implementing algorithms manually. Built-in functions like math.gcd(), math.factorial(), math.isqrt() are well-tested, optimized, and reduce implementation errors. Pattern: (1) Understand the mathematical definition, (2) Check if math module provides the operation, (3) Use it directly or wrap it with problem-specific logic (e.g., is_coprime = math.gcd(a,b) == 1).
[cpp-00038] helpful=0 harmful=0 :: For checking if a number is an ideal square, use math.isqrt() as an alternative of math.sqrt() to avoid floating-point precision errors. Pattern: b = math.isqrt(n); is_perfect_square = (b * b == n). The isqrt() function returns the integer square root, and squaring it back allows exact integer comparison without floating-point rounding issues.
[cpp-00043] helpful=0 harmful=0 :: For character filtering problems (removing/keeping characters based on membership criteria), use the set+comprehension+join pattern: (1) Convert filter criteria right into a set for O(1) lookup (char_set = set(filter_string)), (2) Use list comprehension or generator expression to filter (char for char in source if char not in char_set), (3) Use ''.join() to reconstruct the string. This pattern is more Pythonic, readable, and maintainable than manual index manipulation or character counting approaches, while being equally correct and efficient.
[cpp-00049] helpful=0 harmful=0 :: When returning tuples or lists with mixed numeric types (integers and floats), use appropriate division operators for every component: integer division (//) for whole number results, regular division (/) for decimal results. Example: for sum and average, return (n*(n+1)//2, n*(n+1)/2/n) to make sure sum is int and average is float. This prevents type mismatches in test assertions.
[cpp-00054] helpful=0 harmful=0 :: For digit-by-digit comparison or manipulation problems (digit distance, digit sum differences, etc.): Use the string conversion pattern: (1) Convert integers to strings with str(), (2) Use zfill(max_length) to pad shorter numbers with leading zeros for equal length, (3) Use zip() to pair corresponding digit positions, (4) Apply operations on paired digits and aggregate results. Example: str(num1).zfill(length) and str(num2).zfill(length) then zip() for pairing. This handles different-length numbers elegantly and provides clean positional access to digits.
[cpp-00056] helpful=5 harmful=0 :: For checking if all/any elements in a group satisfy a condition, use Python's built-in all() or any() functions with generator expressions. Pattern: all(condition for item in iterable) for universal quantification (all must satisfy), any(condition for item in iterable) for existential quantification (at the least one satisfies). That is more Pythonic, readable, and efficient than manual loops with flags. Common use cases: all(v == goal for v in dict.values()) for value uniformity, any(x > threshold for x in list) for threshold checking, all(isinstance(x, int) for x in collection) for type validation.
[cpp-00060] helpful=0 harmful=0 :: For whitespace normalization (collapsing multiple spaces/whitespace into single spaces), use the split-join pattern: ' '.join(s.split()). The important thing insight: str.split() without arguments has special behavior - it splits on ANY whitespace (spaces, tabs, newlines) AND routinely removes empty strings from the result, naturally collapsing consecutive whitespace. Combined with ' '.join(), this creates a clean solution without  imports. This pattern is more Pythonic and maintainable than  alternatives like re.sub(r' +', ' ', s) for easy whitespace normalization tasks.
[cpp-00062] helpful=0 harmful=0 :: For complex number operations (polar/rectangular conversion, phase calculation, magnitude), use Python's cmath module functions as the primary selection: cmath.polar(z) for conversion to polar form (returns magnitude and angle), cmath.rect(r, phi) for polar to rectangular, cmath.phase(z) for angle extraction. These built-in functions handle edge cases appropriately (e.g., treating real numbers as complex with imaginary part 0) and are more reliable than manual trigonometric calculations. Pattern: import cmath → use appropriate function → handle the return type (often tuples).
[cpp-00064] helpful=0 harmful=0 :: For grouping elements by a key while preserving insertion order (critical for tie-breaking in subsequent sorting), use collections.OrderedDict with setdefault pattern: from collections import OrderedDict; grouped = OrderedDict(); for item in items: grouped.setdefault(key, []).append(value). While Python 3.7+ dicts maintain insertion order, OrderedDict makes the intent explicit and is safer when order matters for downstream operations like sorting by aggregated properties where equal values should maintain original encounter order.
[cpp-00065] helpful=0 harmful=0 :: For creating tuples with variable-length unpacked elements, use the * unpacking operator: (first, *middle_elements, last) unpacks a listing/tuple into individual tuple positions. Example: (key, *values, count) where values is a listing creates a tuple with key, all values unpacked as separate elements, and count at the top. This is important when output format requires flattening nested structures into single-level tuples with variable element counts.
[cpp-00069] helpful=0 harmful=0 :: For  pattern matching problems requiring full string matches, make a choice from re.search(), re.match(), and re.fullmatch() based on matching scope: re.match() matches from the beginning, re.search() finds patterns anywhere, re.fullmatch() requires your complete string to match. When full string matching is required, either use re.fullmatch() with the pattern directly, or use re.search()/re.match() with explicit anchors (^ for start, $ for end). Example: re.fullmatch('a.*b', s) is corresponding to re.search('^a.*b$', s). Each approaches are valid - fullmatch() makes the intent explicit, while search() with anchors provides more flexibility. All the time analyze test cases to find out if partial or full string matching is required.
[cpp-00072] helpful=1 harmful=0 :: For counting elements in an iterable that match a condition, use the generator expression pattern with sum(): sum(1 for x in iterable if condition). This provides optimal balance of readability, memory efficiency, and Pythonic style in comparison with alternatives like len([x for x in iterable if condition]) which creates an intermediate list. For character-level string operations, prefer built-in string methods (isdigit(), isalpha(), isalnum(), isupper(), islower()) over manual ASCII range comparisons - they handle edge cases appropriately, improve clarity, and are more maintainable.
[cpp-00073] helpful=0 harmful=0 :: For bit manipulation problems (finding set bits, MSB/LSB positions, bit counting), check Python's integer bit methods FIRST before implementing manual algorithms: bit_length() returns the variety of bits needed to represent the integer (useful for MSB position), bit_count() counts set bits (Python 3.10+), as_integer_ratio() for rational representation. These built-in methods are optimized, handle edge cases (including 0), and infrequently eliminate the necessity for manual bit-by-bit iteration. Pattern: understand what bit property you would like, check if a built-in method provides it directly.
[cpp-00076] helpful=0 harmful=0 :: For grouping consecutive equivalent elements in a sequence, use itertools.groupby() because the canonical Python solution. Pattern: [list(group) for key, group in itertools.groupby(sequence)]. The groupby function returns (key, group_iterator) tuples where key's the element value and group is an iterator of consecutive occurrences. Convert each group iterator to a listing to materialize results. Critical distinction: groupby groups CONSECUTIVE equivalent elements only - non-consecutive duplicates form separate groups, making it ideal for run-length encoding and consecutive duplicate detection without manual index tracking.
## H&LING EDGE CASES
[hec-00021] helpful=2 harmful=0 :: When using mathematical operations like modulo (%), division, or exponentiation, confirm the answer handles negative numbers appropriately. For instance, modulo operator works appropriately for each positive and negative integers in Python (e.g., -18 % 2 == 0 for even number checking), but behavior may differ from expectations in other languages.
## ALGORITHM DESIGN
[ad-00001] helpful=1 harmful=2 :: For recursive GCD problems, use the Euclidean algorithm: base case is b == 0 (return a), recursive case is gcd(b, a % b). This handles all edge cases naturally including argument ordering, equal numbers, and divisibility.
[ad-00006] helpful=0 harmful=0 :: For bidirectional character swap problems (A↔B) using : use re.sub() with a callback function in a single pass. Pattern: (1) Create a personality class matching all swap targets (e.g., r'[ _]'), (2) Implement callback that examines each match and returns its counterpart. This avoids ambiguity from sequential replacements where latest characters develop into indistinguishable from originals.
[ad-00008] helpful=0 harmful=0 :: For modular arithmetic problems (nCr mod p, etc.), check if p should be prime. If p could be composite, avoid algorithms requiring modular inverse (like Fermat's Little Theorem). As a substitute, use approaches that avoid division entirely, corresponding to Pascal's triangle with DP: C[j] = (C[j] + C[j-1]) % p, which works for ANY modulus.
[ad-00009] helpful=0 harmful=0 :: When division is required in modular arithmetic: (1) If modulus is guaranteed prime, use Fermat's Little Theorem: a/b mod p = a * b^(p-2) mod p. (2) If modulus could also be composite, use Prolonged Euclidean Algorithm for modular inverse, or higher yet, redesign to avoid division (e.g., use reoccurrence relations like Pascal's triangle).
[ad-00017] helpful=1 harmful=0 :: For decoding problems with mixed encoded/non-encoded elements: (1) use type checking to differentiate element types, (2) validate encoded element structure, (3) handle each type appropriately in a single pass. Prioritize easy iterative approaches with explicit conditionals over complex comprehensions for higher readability and maintainability.
[ad-00018] helpful=4 harmful=0 :: For optimum sum problems with non-adjacent element constraints: Use dynamic programming with reoccurrence dp[i] = max(arr[i] + dp[i-2], dp[i-1]), representing the selection to incorporate current element (add to best from i-2) or exclude it (keep best from i-1). Handle edge cases: empty array returns 0, single element returns that element, initialize dp[0] = arr[0] and dp[1] = max(arr[0], arr[1]). Time: O(n), Space: O(n) or O(1) with optimization.
[ad-00023] helpful=0 harmful=0 :: For bit counting and parity checking problems: Multiple valid approaches exist with different trade-offs. (1) Pythonic approach: bin(n).count('1') - most readable and maintainable, (2) Bit manipulation: repeatedly use x & (x-1) to clear lowest set bit - higher performance for giant inputs, (3) XOR reduction for parity. Select the Pythonic approach by default unless performance profiling shows it is a bottleneck.
[ad-00028] helpful=1 harmful=1 :: For bit toggling problems: (1) Create a mask with 1s at positions to be toggled, (2) Use XOR operation (n ^ mask) to toggle those bits. For variable-length numbers, use bit_length() to find out what number of bits to process. Example: to toggle bits at positions 1,3,5 as much as bit_length, generate mask = sum(1 << i for i in range(1, n.bit_length(), 2)).
[ad-00037] helpful=0 harmful=0 :: For element rearrangement/partitioning problems (move zeros to finish, separate by condition, etc.): Use the filter+concatenate pattern: (1) filter elements into separate groups using list comprehensions [x for x in lst if condition], (2) count or collect each group individually, (3) concatenate groups in required order. This Pythonic approach using built-ins (list comprehension, count(), list multiplication) is usually clearer and equally correct in comparison with in-place two-pointer algorithms, especially for small to medium datasets.
[ad-00039] helpful=0 harmful=0 :: For 'sum of two squares' problems (checking if n = a² + b²): Use single-loop optimization O(√n) as an alternative of nested loops O(n). Iterate one variable from 0 to √n, calculate remainder (n - a²), and check if remainder is an ideal square using math.isqrt(). Return True immediately upon finding valid pair. This pattern: (1) reduces time complexity, (2) handles edge cases naturally (a=0, a=√n), (3) avoids floating-point errors with isqrt().
[ad-00041] helpful=4 harmful=1 :: For geometry and formula-based mathematical problems: Follow a structured approach: (1) Discover the proper mathematical formula from problem domain knowledge, (2) Implement the formula as a direct translation into code using math module functions, (3) Avoid reimplementing mathematical functions or constants that exist in standard libraries, (4) Confirm the formula with at the least one test case before coding. Direct formula translation results in cleaner, more maintainable code with higher numerical precision.
[ad-00042] helpful=0 harmful=0 :: For problems choosing elements from each ends of a group (k smallest AND k largest), use approaches that handle overlap: (1) Index-based selection: iterate sorted collection and include elements where idx < k OR idx >= len-k, ensuring each element chosen once, or (2) Set union: mix subsets with set(min_k + max_k) then sort to eliminate duplicates. All the time consider edge cases where k*2 >= collection_size, as this guarantees overlap between minimum and maximum selections. Avoid easy list concatenation which creates duplicates when ranges overlap.
[ad-00045] helpful=0 harmful=0 :: For 'find the n-th number with property X' problems: Use the iterative counting pattern: (1) implement a helper function to envision if a number satisfies the property, (2) iterate through candidate numbers ranging from an appropriate initial value, (3) maintain a counter for numbers that satisfy the property, (4) return the candidate when counter reaches n. This pattern works for prime numbers, perfect squares, numbers with specific factorization properties, etc. It's straightforward to implement appropriately and optimize later if needed.
[ad-00046] helpful=3 harmful=0 :: For counting distinct prime aspects: Use the usual factorization pattern: (1) iterate potential divisors from 2 to sqrt(n), (2) for every divisor that divides n, increment the distinct factor count, then divide n by that divisor repeatedly until it not divides (this ensures each prime is counted once no matter its power), (3) after the loop, if n > 1, it is a remaining prime factor (count it), (4) optimize by checking divisor 2 individually, then only odd numbers. This appropriately distinguishes between distinct primes and their multiplicities.
[ad-00048] helpful=1 harmful=0 :: For mathematical sequence problems (sum of first n numbers, arithmetic/geometric series, factorial-related), check if a closed-form formula exists before implementing iterative solutions. Common formulas: sum(1..n) = n*(n+1)/2, sum of arithmetic series = n*(first+last)/2, sum of geometric series = a*(r^n - 1)/(r-1). Formula-based solutions provide O(1) time complexity vs O(n) for loops, are less error-prone, and reveal mathematical insight. All the time confirm formula correctness with test cases.
[ad-00051] helpful=1 harmful=0 :: For pair-counting problems (count pairs satisfying a condition), search for mathematical properties that eliminate the necessity for explicit enumeration. Pattern: (1) Discover what makes a pair valid, (2) Find mathematical properties characterizing valid pairs (e.g., for XOR being odd: one number should be even, other odd), (3) Transform right into a counting problem (count elements in each category), (4) Use combinatorics to compute result (e.g., odd_count × even_count). This reduces O(n²) pair enumeration to O(n) categorization + O(1) calculation.
[ad-00052] helpful=0 harmful=0 :: For problems involving XOR operations, leverage bit-level properties for optimization: (1) XOR result's odd ⟺ operands have different parities (one even, one odd), because parity relies on the least significant bit, (2) XOR is commutative and associative, allowing reordering, (3) x ^ x = 0 and x ^ 0 = x, useful for cancellation patterns. Analyze the precise XOR property relevant to your problem to search out mathematical shortcuts that avoid brute force computation.
[ad-00061] helpful=0 harmful=0 :: For iterative mathematical sequence problems (sum/product of first n terms with specific properties): Use a structured 3-step approach: (1) Discover the formula for generating the k-th element (e.g., 2k-1 for odd numbers, 2k for even numbers, k² for squares), (2) Determine the operation to use to every element (exponentiation, multiplication, transformation), (3) Aggregate with appropriate function (sum, product, max). Implement using generator expressions with built-ins: sum(operation(formula(i)) for i in range(start, n+1)). Ensure range bounds match the sequence indexing (1-indexed sequences need range(1, n+1)). This pattern provides clarity and correctness for problems where closed-form formulas don't exist or aren't obvious.
[ad-00066] helpful=0 harmful=0 :: For problems requiring grouping, counting, and sorting by aggregated properties: (1) Group elements using dict/OrderedDict with setdefault() or defaultdict, selecting OrderedDict when insertion order affects tie-breaking in sorting, (2) Sort groups using sorted() with key function based on aggregated metric (e.g., key=lambda x: len(x[1]) for count), (3) Transform output to match required format using appropriate unpacking/restructuring. This pattern handles 'group by X, sort by count of Y' problems systematically.
[ad-00068] helpful=0 harmful=0 :: For heap-based 'top k' problems, confirm OUTPUT ORDERING against test cases, not only which elements to return. Key distinction: (1) heappop() from a min-heap produces ASCENDING order by the heap key, (2) heapq.nlargest(k, items, key=func) produces DESCENDING order by key, (3) heapq.nsmallest(k, items, key=func) produces ASCENDING order by key. When implementing heap solutions, trace through test cases to find out if results needs to be ordered ascending or descending by frequency/priority. If ordering is improper, either reverse the ultimate list or switch between nlargest/nsmallest, or use the heappop pattern. Test case output ordering is authoritative when the issue description doesn't explicitly specify.
[ad-00070] helpful=0 harmful=0 :: For 2D grid problems with adjacency or selection constraints (cannot pick adjoining cells/rows/columns): Search for opportunities to cut back dimensionality before applying DP. If constraints allow picking at most one element per column (or row), pre-compute the optimal selection for every column/row (e.g., max of two rows in a column), transforming the issue right into a 1D array. Then apply standard 1D DP patterns (like 'house robber' for non-adjacency). This dimensional reduction simplifies state space and makes complex grid problems tractable using well-known DP templates.
[ad-00071] helpful=0 harmful=0 :: Recognize the 'house robber' DP pattern as a fundamental template applicable beyond linear arrays: any problem involving choosing non-adjacent elements to maximise/minimize a sum can use the reoccurrence dp[i] = max(value[i] + dp[i-2], dp[i-1]). This pattern appears in: linear arrays with spacing constraints, grid problems (after dimensional reduction), tree problems (with parent-child constraints), and sequence optimization. If you see 'maximize sum' + 'cannot pick adjoining', immediately consider this template.
[ad-00075] helpful=0 harmful=0 :: For locating essentially the most significant bit (MSB) value or position: Use bit_length() method which returns the variety of bits required to represent an integer. For MSB value, use the pattern: 1 << (n.bit_length() - 1), which leverages the connection that the MSB at position k (0-indexed from right) has value 2^k. The bit_length() approach is cleaner than manual division loops or string conversion methods. Handle edge case: bit_length() returns 0 for n=0, so confirm problem constraints or add explicit zero handling if needed.
## TEST CASE INTERPRETATION
[tci-00004] helpful=0 harmful=0 :: Multiple correct implementations can exist for a similar problem. Concentrate on algorithmic correctness verified by passing tests, not on matching a particular reference implementation's style or structure.
[tci-00011] helpful=123 harmful=2 :: Extract the expected OUTPUT FORMAT from test cases, not only the logic. Check if the return needs to be a single value, tuple, list, or other structure, and ensure your solution matches this exact format.
[tci-00022] helpful=0 harmful=1 :: When analyzing test cases, check if ALL inputs map to the SAME output value or structure. In that case, the answer could also be trivial - simply return that constant output directly. Don't overcomplicate with unnecessary transformations (like list conversions) when a direct return statement satisfies all requirements. Example: if all test cases expect empty tuple output, return () no matter input complexity.
[tci-00025] helpful=5 harmful=0 :: Before selecting an implementation approach, deeply understand the CORE REQUIREMENT from the issue description and test cases. For instance, 'even parity' means 'even count of 1-bits', not a particular algorithm. Don't lock into a specific technique (like bit manipulation) if simpler alternatives (like string counting) satisfy the requirement equally well.
[tci-00027] helpful=17 harmful=0 :: When problem descriptions use ambiguous terminology (especially in bit manipulation: 'even bits', 'odd positions', etc.), work backward from test cases to find the actual pattern. Manually trace through examples of their relevant representation (binary for bit problems) to find out the bottom truth interpretation. Test cases are authoritative when terminology is unclear.
[tci-00032] helpful=0 harmful=0 :: When problems ask for 'maximum/minimum of all records/groups', make clear whether it means: (1) global extreme across all elements, or (2) per-group extremes returned as a group. Test cases reveal the excellence: single value output indicates global extreme, list/tuple output suggests per-group evaluation. This interpretation affects whether you flatten the structure or preserve grouping.
[tci-00034] helpful=0 harmful=0 :: For dictionary-related problems, rigorously distinguish from test cases whether the expected output is: (1) a key (string/int), (2) a worth, (3) a key-value pair (tuple), or (4) a group of any of those. The output type determines whether you would like dict.keys(), dict.values(), dict.items(), or direct indexing into converted structures. Test case outputs reveal the precise format required.
[tci-00035] helpful=3 harmful=0 :: When function names or problem descriptions suggest specific behavior (e.g., 'parallelogram_perimeter' implying geometric formula 2*(a+b)), but test cases produce outputs inconsistent with that expectation, trust the test cases because the authoritative specification. Reverse-engineer the actual formula by calculating what operation on inputs produces the given outputs, then confirm this derived pattern against ALL test cases before implementing. Test case expectations override semantic meanings and domain knowledge.
[tci-00040] helpful=0 harmful=0 :: Test results are the first signal of correctness, not line-by-line comparison with reference implementations. In case your solution passes all tests with higher time complexity (e.g., O(√n) vs O(n)), it is not just correct but algorithmically superior. Different approaches could be equally or more valid - deal with correctness verification through tests, not on matching specific implementation styles.
[tci-00044] helpful=2 harmful=0 :: When encountering undefined or domain-specific mathematical terms (like 'smart number', 'lucky number', etc.), treat test cases because the authoritative specification. Systematically analyze test case outputs to reverse-engineer the mathematical definition: (1) examine the numerical properties of output values (factorization, divisors, digits, etc.), (2) search for patterns or common characteristics across all outputs, (3) formulate a hypothesis in regards to the defining property, (4) confirm the hypothesis against ALL test cases. The test cases encode the entire definition when the issue statement is ambiguous.
[tci-00055] helpful=2 harmful=0 :: When problem terminology is totally ambiguous or undefined (like 'digit distance' which could have multiple interpretations), systematically trace through EACH test case manually to discover the precise pattern: (1) Work through inputs and outputs step-by-step within the relevant representation, (2) Formulate a hypothesis about what operation produces those outputs, (3) Confirm the hypothesis against ALL remaining test cases, (4) Implement the pattern that satisfies all tests. The test-derived pattern is the proper specification, no matter what the terminology might suggest in other contexts.
[tci-00058] helpful=0 harmful=0 :: Multiple algorithmically different solutions could be equally valid in the event that they satisfy all test cases. When deriving requirements from ambiguous specifications, use systematic hypothesis testing: (1) analyze each test case to grasp input-output relationships, (2) formulate a hypothesis in regards to the underlying rule, (3) validate the hypothesis against ALL test cases, (4) implement the pattern that passes all tests. Your solution is correct by definition if it satisfies all test requirements, even when it differs structurally from reference implementations or uses a unique interpretation of ambiguous terms.
[tci-00063] helpful=0 harmful=0 :: In Python, parentheses alone don't create tuples - distinguish between ('value') which is only a string 'value' (parentheses are for grouping/precedence), and ('value',) which is a 1-element tuple (trailing comma required). When analyzing test assertions like assert func()==('Matched!'), recognize this expects a plain string, not a tuple. Only ('Matched!',) with a trailing comma or (a, b) with multiple elements create tuples. This syntax nuance is critical for matching expected return types exactly.
[tci-00067] helpful=0 harmful=0 :: When test cases show complex output structures (tuples with variable-length unpacked elements, nested aggregations), analyze the EXACT structure before coding: (1) Count elements in output tuples/lists, (2) Discover which elements are aggregated vs individual, (3) Determine if nested structures are flattened (unpacked) or preserved, (4) Check if ordering inside groups matters. Use this structural evaluation to decide on appropriate Python constructs (* unpacking, list flattening, tuple construction patterns) that match the expected format precisely.
[tci-00077] helpful=0 harmful=0 :: For counting/aggregation problems involving nested structures (lists of lists, trees, nested dictionaries), when the issue asks to 'count elements' without specifying the extent, use test cases to find out the counting scope: (1) Check if test outputs suggest counting only immediate/top-level children (e.g., len(outer_list)) vs recursive counting of all nested elements, (2) Trace through at the least one test case with nested structures to see which interpretation produces the expected output, (3) The only interpretation (top-level counting) is frequently correct unless test cases prove otherwise. Example: 'count lists in [[1,2], [3], [[4,5]]]' could mean 3 (top-level) or 4 (recursive) - test outputs reveal which is anticipated.
[tci-00078] helpful=0 harmful=0 :: For mathematical problems with infinitely many valid solutions (linear Diophantine equations, modular arithmetic, geometric constructions, etc.), recognize that tests expect ONE PARTICULAR solution, not only any mathematically correct answer. Work through test cases to discover the choice criteria (e.g., smallest non-negative values, specific ordering, canonical form). When selecting algorithms, prefer approaches that naturally produce the expected solution pattern (e.g., iterative search from x=0 upward for smallest non-negative x) over sophisticated algorithms (e.g., Prolonged Euclidean Algorithm) that require additional adjustment logic to match test expectations. The mathematically elegant solution is not all the time the proper one for passing tests.
[tci-00079] helpful=0 harmful=0 :: For problems involving mathematical constants (pi, e, sqrt(2), etc.), confirm that test case expected outputs match calculations using standard library constants (math.pi, math.e). Calculate at the least one test case output manually using the usual constant and compare to the expected value. If there is a mismatch in precision (e.g., your 942.477 vs expected 942.45), the test cases likely expect a simplified/truncated constant value (like pi=3.14 or pi=3.1415) quite than full precision. Check reference implementations for hardcoded constant values and use those exact values to match test expectations, even in the event that they're less mathematically accurate.
## DEBUGGING STRATEGIES
[ds-00005] helpful=110 harmful=2 :: Before generating code, mentally trace through the logic against test cases to confirm correctness. This helps catch logical errors early and builds confidence in the answer approach.
[ds-00029] helpful=0 harmful=0 :: For bit manipulation problems with unclear position indexing, test multiple interpretations systematically: (1) 0-indexed vs 1-indexed, (2) counting from right vs left, (3) 'even/odd' referring to position vs bit value. Work through all test cases manually in binary to validate each hypothesis before implementing. The interpretation that satisfies all test cases is correct.
[ds-00080] helpful=0 harmful=0 :: During reasoning phase, manually calculate expected outputs for at the least one test case using your proposed approach and compare against the actual expected output. For numerical problems, confirm precision matches exactly - discrepancies like 942.477 vs 942.45 indicate constant precision mismatches (e.g., using math.pi as an alternative of a truncated value). This early validation catches precision issues, improper formulas, and constant value problems before code generation.

These results show that ACE can significantly improve performance on complex tasks like code generation.

Summary

In this text, we’ve explored loads about context engineering and the ACE approach, so let’s briefly recap the important thing takeaways:

  • Context engineering has emerged as a critical field since it allows us to enhance LLM performance without lengthy and expensive fine-tuning.
  • ACE (Agentic Context Engineering) is considered one of the most recent approaches to prompt optimisation, leveraging detailed playbooks with atomised bullet points that include each instructions and metadata.
  • As our examples showed, prompt optimisation isn't a silver bullet. It doesn’t improve performance in every case. In keeping with the authors, ACE is only for agentic workflows or highly specialised domains. In our experiments, it made a transparent difference in code generation, but had limited impact on banking intent classification.

The predominant takeaway for me is that prompt optimisation won’t solve your task routinely. You continue to need a holistic understanding of what information the LLM and agents have through the optimisation process and the way best to structure and refine it. Context matters, and thoughtful engineering of that context is what makes approaches like ACE effective.

Reference

This text was based on the paper and research by Zhang et al., published in 2025, “Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models”.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x