Beyond Code Generation: Repeatedly Evolve Text with LLMs

the initial response from an LLM doesn’t suit you? You rerun it, right? Now, if you happen to were to automate that…

success = false
while not success:
    response = prompt.invoke()
    success = evaluate(response)

Alright, something like that. People have done it for code, and the identical applies to non-code if the function is suitable. Nowadays, you should utilize LLMs for content generation and evaluation. Nevertheless, a straightforward while loop that waits for the most effective random result is just not at all times ok. Sometimes, you want to modify the prompt. Experiment and blend things up, and keep track of what works and what doesn’t. Follow along different ideation paths to maintain your options open…

In this text, we’ll discuss how OpenEvolve [1], an open-source implementation of Google’s AlphaEvolve paper [2], could be used for content creation. Within the background, it applies this “experiment and blend, follow different paths” approach to optimize the LLM prompts.

The AlphaEvolve paper applied an evolutionary system to the code generation with LLMs. Read more in regards to the exciting, brand-new results of this paper in my article, Google’s AlphaEvolve: Getting Began with Evolutionary Coding Agents. In essence, in a survival of the fittest scheme, programs are mixed and improved upon. The authors suggest that these evolutionary coding agents can achieve research breakthroughs and present several results.

Attributable to the sheer variety of things that content could be, I feel there may bepotential for high-value content creation apart from code that utilizes such a long-running, continuous evolution process. In this text, we explore methods to apply the identical technology to a non-code use case where LLMs, quite than algorithms, judge the outcomes of the LLM-generated solution. We also dicuss methods to examine the outcomes.

Prerequisites

First, let’s prepare a fast, basic setup.

LLM server

So as to use OpenEvolve, you have to access to an LLM server with OpenAI-compatible API endpoints. You’ll be able to register with Cerebras (they’ve a free tier), OpenAI, Google Gemini, or the same service. Alternatively, if you’ve a capable GPU, you possibly can arrange your individual server, for instance with ollama. You have to to choose at the least two different LLM models, a weak one (e.g., 4bn parameters) and a powerful one (e.g., 17bn parameters).

Python envionment & git

I presume that you simply are running a Linux system with a prepared Python environment, during which you possibly can create virtual environments and install packages from the Python Package index.

OpenEvolve setup

Install OpenEvolve, then prepare your individual project & prompt folders:

git clone https://github.com/codelion/openevolve.git
cd openevolve
python3 -m venv .venv
source .venv/bin/activate
pip install -e .
mkdir -p examples/my_project/prompts

A little bit warning: OpenEvolve is currently a research project. Its code base continues to be developing quickly. Due to this fact, it’s a very good idea to follow all updates closely.

Configuration

Create the file

checkpoint_interval: 1

# LLM configuration
llm:
  models:
    - name: "llama3.1-8b"
      weight: 0.8
      temperature: 1.5
    - name: "llama-4-scout-17b-16e-instruct"
      weight: 0.2
      temperature: 0.9
  evaluator_models:
    - name: "llama-4-scout-17b-16e-instruct"
      weight: 1.0
      temperature: 0.9
  api_base: "https://api.cerebras.ai/v1/" # The bottom URL of your LLM server API

# Prompt configuration
prompt:
  template_dir: "examples/my_project/prompts"
  num_top_programs: 0
  num_diverse_programs: 0

# Database configuration
database:
  num_islands: 3

# Evaluator configuration
evaluator:
  timeout: 60
  cascade_evaluation: false
  use_llm_feedback: true
  llm_feedback_weight: 1.0 # (Non-LLM metrics are weighed with an element of 1)

diff_based_evolution: true
allow_full_rewrites: false

To get a general idea of what you’re configuring here, consider how recent solutions are generated and evaluated in OpenEvolve. Solutions consist of their respective text content and are stored in a database alongside their evaluation metrics and “side channel” textual results (e.g., errors during execution or textual improvement suggestions). The database also stores an inventory of elite programs and programs that perform particularly well on different metrics (MAP-Elites) to have the ability to offer inspirations for brand spanking new solutions. An LLM generates these recent, mutated solutions based on a single parent. Programmatic and/or LLM evaluators then judge the brand new solution before feeding it back into the database.

The OpenEvolve generation and evaluation flow: Sample a parent and inspirations, generate a brand new child, evaluate it, and store it in the identical island because the parent. (Image by writer)

The configuration options include:

llm: models, evaluator_models
For generation and evaluation, you possibly can configure any variety of models.
The concept behind using multiple models is to make use of a quick (weak) model that quickly explores many various options and a slower (stronger) model that adds quality. For generation, the burden parameter controls the probability that every model will probably be chosen in an iteration — it is simply one model at a time, not multiple. For evaluation, all models will probably be executed every time, and their output metrics are weighed with the required parameter.
The temperature setting influence how random these models behave. A price of 1.5 may be very high, and 0.9 continues to be a hot temperature value. For the creative use case, I feel these are good. For business content or code, use lower values. The OpenEvolve default setting is 0.7.
prompt: template_dir
The template_dir option specifies the directory that incorporates the prompt templates which are used to overwrite the defaults. See below for more information on the folder’s contents.
database: num_top_programs, num_diverse_programs
The prompts for generating recent solutions can include inspirations from other programs within the database. With a price of 0, I turned this function off, because I discovered that the inspirations — which don’t include the content itself, quite just metrics and alter summary — weren’t too useful for creative content evolution.
database: num_islands controls what number of separate sub-populations are maintained within the database. The more islands you utilize, the more diverging solution paths will result, whereas inside the same island you’ll observe fewer substantial variations. For creative use cases, if you’ve enough time and resources to run many iterations, it might be useful to extend the variety of islands.
evaluator: llm_feedback_weight
The combined metrics generated by the evaluation LLMs are multiplied with this parameter. Along with the algorithmically generated metrics, the numeric average is then used to search out the most effective program. Say the generated metrics were
with an of 1.0, the general rating can be (1.0+0.5*1.0+0.7*1.0)/3
diff_base_evolution / allow_full_rewrites:
Two different prompt approaches for the generator LLM are supported. Within the diff mode, the LLM uses a search-and-replace response format to exchange specific elements in the present solution. Within the full_rewrite mode, the LLM simply outputs a full rewrite. The latter mode is less demanding for less capable LLMs, but it’s also less suitable for long content. Quality can also be higher with diff mode, based on my tests.

For more options, seek advice from

Prompts

OpenEvolve’s default prompts are written for code evolution. Due to this fact, its prompts should not suitable for non-code generation by default. Fortunately, we are able to overwrite them. The default prompts are encoded within the file .

Create the next files and adapt the prompts to match your use case. Let’s try a straightforward example for creating poems.

Initial placeholder content:

No initial poem, invent your individual.

The initial prompt represents the “first generation” parent. It affects its offspring, the second-generation solutions.
For the initial content, you possibly can provide an existing version or an empty placeholder text. You might also provide specific instructions, similar to “Be sure it mentions cats,” to guide the initial generation in a desired direction. If you happen to need more general context for all generations, include it within the system prompt.

The system prompt:

You might be a Shakespeare level poem author, turning content into beautiful poetry and improving it further and further.

The system prompt just sets the final context on your generator model so it knows what your use case is all about. In this instance, we should not creating code, we’re writing poems.

User prompt for content generation:

# Current Solution Information
- Current performance metrics: {metrics}
- Areas identified for improvement: {improvement_areas}

{artifacts}

# Evolution History
{evolution_history}

# Current Solution
```
{current_program}
```

# Task
Suggest improvements to the reply that can lead to higher performance on the required metrics.

You MUST use the precise SEARCH/REPLACE diff format shown below to point changes:

<<<<<<< SEARCH
# Original text to find and replace (must match exactly)
=======
# New replacement text
>>>>>>> REPLACE

Example of valid diff format:
<<<<<<< SEARCH
poem stub
=======
Tyger Tyger, burning bright, In the forests of the night; What immortal hand or eye
>>>>>>> REPLACE

You'll be able to suggest multiple changes. Each SEARCH section must exactly match text in the present solution. If the answer is a blank placeholder, ensure to reply with exactly one diff substitute -- looking for the present placeholder string, replacing it along with your initial solution.

The content generation user prompt may be very general. It incorporates several placeholders, that will probably be replaced with the content from the answer database, including the evaluation results of the parent program. This prompt illustrates how the evolution process influences the generation of recent solutions.

User prompt for content generation without the diff method:

# Current Solution Information
- Current metrics: {metrics}
- Areas identified for improvement: {improvement_areas}

{artifacts}

# Evolution History
{evolution_history}

# Current Solution
```
{current_program}
```

# Task
Rewrite the reply to enhance its performance on the required metrics.
Provide the entire recent answer. Don't add reasoning, changelog or comments after the reply!

# Your rewritten answer here

Prompt fragment for the evolution history:evolution_history.txt

## Previous Attempts

{previous_attempts}

## Top Performing Solution

{top_programs}

Prompt fragment for the highest programs:top_programs.txt

### Solution {program_number} (Rating: {rating})
```
{program_snippet}
```
Key features: {key_features}

System prompt for the evaluator:

You might be a Shakespeare level poem author and are being asked to review another person's work.

This method prompt for the evaluator models is basically the identical because the system prompt for the generator LLM.

User prompt for the evaluator:

Evaluate the next poem:
1. Beauty: Is it beautiful?
2. Inspiring: Is its message inspired and meaningful?
3. Emotion: Does the poem trigger an emotional response?
4. Creativity: Is it creative?
5. Syntax: Is its syntax good? Is it only a poem or does it also contain non-poem content (if yes, rate as 0)? Are its lines overly long (if yes, rate low)?
6. Overall rating: Give an overall rating. If Poem, Syntax or Length evaluation was not okay, give a nasty overall feedback.

For every metric, provide a rating between 0.0 and 1.0, where 1.0 is best.

Answer to guage:
```
{current_program}
```

Return your evaluation as a JSON object with the next format:
{{
    "beauty": score1,
    "inspiring": score2,
    "emotion": score3,
    "creativity": score4,
    "syntax": score5,
    "overall_score": score6,
    "improvement_suggestion": "..",
}}
Even for invalid input, return nothing however the JSON object.

That is where the magic happens. On this prompt, you need to consider metrics that represent what you’re optimizing. What determines whether the content is sweet or bad? Correctness? Humor? Writing skill? Resolve what is essential to you, and encode it correctly. This may occasionally take some experimentation before you see the evolution converge the best way you intended. Mess around as you observe the evolution of your content (more on that below).

Watch out — every metric is rated equally. They’re multiplied by the think about your It’s also a very good idea to maintain an metric that gives a summary of the large picture evaluation. You’ll be able to then sort the generated solutions by it.

The is a textual advice from the evaluator LLM. It is going to be stored together with the metrics within the database and provided to the generator LLM when this solution is used as a parent, as a part of the placeholder you saw above. (Note: As of this writing, textual LLM feedback continues to be a pull request under review within the OpenEvolve codebase, you’ll want to use a version that supports it.)

The evaluator program

OpenEvolve was designed for code generation with algorithmic evaluators. Even though it is difficult to write down an algorithm that judges the great thing about a poem, we design a useful algorithmic evaluation function also for our content generation use case. As an illustration, we are able to define a metric that targets a specific variety of lines or words.

Create a file

from openevolve.evaluation_result import EvaluationResult


def linear_feedback(actual, goal):
    deviation = abs(actual - goal) / goal
    return 1 - min(1.0, deviation)


def evaluate_stage1(file_path):
    # Read in file_path
    with open(file_path, 'r') as file:
        content = file.read()

    # Count lines and words
    lines = content.splitlines()
    num_lines = len(lines)
    num_words = sum(len(line.split()) for line in lines)

    # Goal length
    line_target = 5
    word_target = line_target*7

    # Linear feedback between 0 (worst) and 1 (best)
    line_rating = linear_feedback(num_lines, line_target)
    word_rating = linear_feedback(num_words, word_target)
    combined_rating = (line_rating + word_rating) / 2

    # Create textual feedback
    length_comment_parts = []

    # Line count feedback
    line_ratio = num_lines / line_target
    if line_ratio > 1.2:
        length_comment_parts.append("Reduce the variety of lines.")
    elif line_ratio < 0.8:
        length_comment_parts.append("Increase the number of lines.")
    else:
        length_comment_parts.append("Line count is just right.")

    # Words per line feedback
    words_per_line = num_words / num_lines if num_lines else 0
    target_words_per_line = word_target / line_target
    words_per_line_ratio = words_per_line / target_words_per_line

    if words_per_line_ratio > 1.2:
        length_comment_parts.append("Reduce the variety of words per line.")
    elif words_per_line_ratio < 0.8:
        length_comment_parts.append("Increase the variety of words per line.")

    length_comment = " ".join(length_comment_parts)

    return EvaluationResult(
        metrics={
            "length_good": combined_rating,
        },
        artifacts={
            "length_recommendation": length_comment,
        },
    )


def evaluate(file_path):
    return evaluate_stage1(file_path)

This code has two elements:
First, it creates a metric value that permits us to quantify the standard of the response length. If the response is simply too short too long, the rating is lower. If the response is good, the rating reaches 1.
Second, this code prepares that the LLM can intuitively understand, so it knows what to alter without getting lured right into a predetermined idea of what to do when the length is just not good. For instance, it won’t mistakenly think: “I want to write down more.. and more..”.

Data review: Evolution at play

Run the evolution process:

source .venv/bin/activate
export OPENAI_API_KEY="sk-.."
python3 openevolve-run.py 
    examples/my_project/initial_program.py 
    examples/my_project/evaluator.py 
    --config examples/my_project/config.yaml 
    --iterations 9

It's best to start with only a couple of iterations and analyze the outcomes closely to make sure every little thing is functioning properly. To achieve this, start the visualization web server and observe in real time:

python3 scripts/visualizer.py

Or, if you've a selected past checkpoint that you simply wish to investigate, open it with:

python3 scripts/visualizer.py --path examples/content_writing/openevolve_output/checkpoints/checkpoint_2

When rerunning your tests after making improvements, you'll want to move the present checkpoint folders out of the best way before starting over:

mkdir -p examples/my_project/archive
mv examples/my_project/openevolve_output/ examples/my_project/archive/

If every little thing is configured properly, you must see an evolution of improving results (Image by writer)

Within the visualization front end, click the nodes to see the associated current solution text, in addition to all of their metrics, prompts and LLM responses. It's also possible to easily click through children within the sidebar. Use the yellow locator button if you happen to wander off within the graph and might’t see a node. By observing the prompts, you possibly can trace how the evaluation response from a parent affects the generation user prompt of the kid. (Note: As of this writing, prompt & response logging continues to be a pull request under review within the OpenEvolve codebase, you'll want to use a version that supports it.)

If you happen to are inquisitive about comparing all solutions by a selected metric, select it from the highest bar:

The metrics select box shows all of the metrics produced by your evaluation.py logic and evaluation.txt prompt. With it, you possibly can change the metric used to find out the radii of the nodes within the graph. (Image by writer)

The node colours represent the islands, during which evolution takes place largely individually (if you happen to run it long enough!) and in several directions. Occasionally, depending on the migration parameters within the configuration, individuals from one island could be copied over into one other.
The dimensions of every node indicates its performance on the currently chosen metric.
The sides within the visualization show which parent was modified to provide the kid. This clearly has the strongest influence on the descendant.

In reality, the AlphaEvolve algorithm incorporates learnings from several previous programs in its prompting (configurable top- programs). The generation prompt is augmented with a summary of previous changes and their influence on the resulting metrics. This “prompt crossover” is just not visualized. Also not visualized are the relations of “clones”: When an answer migrates to a different island, it's copied with all of its data, including its ID. The copy shows up as an unlinked element within the graph.

In any case, the most effective solution will probably be saved to

In silken moonlight, where night’s veil is lifted,
A constellation of dreams is gently shifted,
The center, a canvas, painted with vibrant hues,
A symphony of feelings, in tender Muse.

Can I…

..use my very own start prompt?
Yes! Just put the answer you have already got in your initial_content.txt.
..not create my very own start prompt?
Yes! Just put a placeholder like in your
..not write any code?
Yes! If you happen to don’t want an algorithmic evaluator, put a stub in your like this:

def evaluate_stage1(file_path):
    return {}
def evaluate(file_path):
    return evaluate_stage1(file_path)

…use an area or non-OpenAI LLM?
Yes, so long as it's compatible with the OpenAI API! In your , change the to a price like ”http://localhost:11434/v1/” for a default ollama configuration. On the command-line, set your API key before calling the Python program:

export OPENAI_API_KEY="ollama"

Final thought

This text described an experiment with using LLM feedback within the context of evolutionary algorithms. I desired to enable and explore this use case, since the AlphaEvolve paper itself hinted at it — and mentioned that they hadn’t optimized for that yet. This is simply the start. The best use cases where this comparatively high effort for content generation is value it and more experiments still must follow. Hopefully, all of this can change into easier to make use of in the long run.

Real-life results: In practice I find that improvements across all metrics are observable as much as a certain point. Nevertheless, it's difficult to acquire good numeric metrics from an LLM because their rankings should not fine-grained and subsequently quickly plateau. Higher prompts, especially for the evaluator, could possibly improve upon this. Either way, the mix of algorithmic and LLM evaluation with a strong evolutionary algorithm and plenty of configuration options makes the general approach very effective.

To generate more exciting LLM metrics that justify the long-running evolution, multi-stage LLM evaluator pipelines might be incorporated. These pipelines could summarize content and make sure the presence of certain facts, amongst other things. By calling these pipelines from the file, this is feasible right away inside OpenEvolve.

With knowledge bases and tools, the capabilities of such evolutionary systems that incorporate LLM feedback could be prolonged further. An exciting addition for OpenEvolve might be the support for MCP servers in the long run, but again, within the file you possibly can already make use of those to generate feedback.

This whole approach may be applied with multi-modal LLMs or a separate backend LLM, that generates the actual content in a unique modality, and is prompted by the evolutionary system. Existing MCP servers could generate images, audio and more. So long as now we have an LLM suitable for evaluating the result, we are able to then refine the prompt to generate recent, improved offspring.

In summary, there are lots of more experiments inside this exciting framework waiting to be done. I stay up for your responses and am desirous to see the end result of this. Thanks for reading!

References

Asankhaya Sharma, OpenEvolve: Open-source implementation of AlphaEvolve (2025), Github
Novikov et al., AlphaEvolve: A Gemini-Powered Coding Agent for Designing Advanced Algorithms (2025), Google DeepMind

Beyond Code Generation: Repeatedly Evolve Text with LLMs

Prerequisites

LLM server

Python envionment & git

OpenEvolve setup

Configuration

Prompts

The evaluator program

Data review: Evolution at play

Can I…

Final thought

References

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

The NLP Course is becoming the LLM Course

GenCast predicts weather and the risks of maximum conditions with state-of-the-art accuracy

Latest MIT program to coach military leaders for the AI age

The best way to Scale Fast Fourier Transforms to Exascale on Modern NVIDIA GPU Architectures

Journey to 1 Million Gradio Users!

Beyond Code Generation: Repeatedly Evolve Text with LLMs

Prerequisites

LLM server

Python envionment & git

OpenEvolve setup

Configuration

Prompts

The evaluator program

Data review: Evolution at play

Can I…

Final thought

References

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.