is Nikolay Nikitin, PhD. I’m the Research Lead on the AI Institute of ITMO University and an open-source enthusiast. I often see a lot of my colleagues failing to seek out the time and energy to create open repositories for his or her research papers and to make sure they’re of proper quality. In this text, I’ll discuss how we will help solve this problem using OSA, an AI tool developed by our team that helps the repository grow to be a greater version of itself. For those who’re maintaining or contributing to open source, this post will prevent effort and time: you’ll find out how OSA can robotically improve your repo by adding a correct README, generating documentation, establishing CI/CD scripts, and even summarizing the important thing strengths and weaknesses of the project.
There are numerous different documentation improvement tools. Nonetheless, they give attention to different individual components of repository documentation. For instance, the Readme-AI tool generates the README file, nevertheless it doesn’t account for extra context, which is significant, for instance, for repositories of scientific articles. One other tool, RepoAgent, generates complete documentation for the repository code, but not README or CI/CD scripts. In contrast, OSA considers the repository holistically, aiming to make it easier to grasp and able to run. The tool was originally made for our colleagues in research, including biologists and chemists, who often lack experience in software engineering and modern development practices. The predominant aim was to assist them make the repository more readable and reproducible in a number of clicks. But OSA may be used on any repository, not only scientific ones.
Why is it needed?
Scientific open source faces challenges with the reuse of research results. Even when code is shared with scientific papers, it is never available or complete. This code is generally difficult to read; there isn’t a documentation for it, and sometimes even a basic README is missing, because the developer intended to jot down it on the last moment but didn’t have time. Libraries and frameworks often lack basic CI/CD settings equivalent to linters, automated tests, and other quality checks. Subsequently, it’s unattainable to breed the algorithm described within the article. And this can be a big problem, because if someone publishes their research, they do it with a desire to share it with the community
But this problem isn’t limited to science only. Skilled developers also often delay writing readme and documentation for long periods. And if a project has dozens of repositories, maintaining and using them may be complicated.
Ideally, each repository must be easy to run and user-friendly. And infrequently the posted developments often lack essential elements equivalent to a transparent README file or proper docstrings, which may be compiled into full documentation using standard tools like mkdocs.
Based on our experience and evaluation of the issue, we tried to suggest an answer and implement it because the Open Source Advisor tool – OSA.
What’s the OSA tool?
OSA is an open-source Python library that leverages LLM agents to enhance open-source repositories and make them easier to reuse.
The tool is a package that runs via a command-line interface (CLI). It may even be deployed locally using Docker. By specifying an API key for your chosen LLM, you possibly can interact with the tool via the console. It’s also possible to try OSA via the general public web GUI. There may be short introduction to predominant ideas of repository improvement with OSA:
How does OSA work?
The Open Source Advisor (OSA) is a multi-agent tool that helps improve the structure and value of scientific repositories in an automatic way. It addresses common issues in research projects by handling tasks equivalent to generating documentation (README files, code docstrings), creating essential files (licenses and requirements), and suggesting practical improvements to the repository. Users simply provide a repository link and might either receive an robotically generated Pull Request (PR) with all really helpful changes or review the suggestions locally before applying them.
OSA may be utilized in two ways: by cloning the repository and running it through a command-line interface (CLI), or via an internet interface. It also offers three working modes: basic, automatic, and advanced, that are chosen at runtime to suit different needs. In basic mode, OSA applies a small set of ordinary improvements with no extra input: it generates a report, README, community documentation, and an About section, and adds common folders like “tests” and “examples” in the event that they’re missing. Advanced mode gives users full manual control over every step. In automatic mode, OSA uses an LLM to research the repository structure and the prevailing README, then proposes an inventory of improvements for users to approve or reject. An experimental multi-agent conversational mode can be being developed, allowing users to specify desired improvements in free-form natural language via the CLI. OSA interprets this request and applies the corresponding changes. This mode is currently under lively development.
One other key strength of OSA is its flexibility with language models. It really works with popular providers like OpenRouter and OpenAI, in addition to local models equivalent to Ollama and self-hosted LLMs running via FastAPI.
OSA also supports multiple repository platforms, including GitHub and GitLab (each GitLab.com and self-hosted instances). It may adjust CI/CD configuration files, arrange documentation deployment workflows, and accurately configure paths for community documentation.
an experimental multi-agent system (MAS), currently under lively development, that serves as the premise for its automatic and conversational modes. The system decomposes repository improvement right into a sequence of reasoning and execution stages, each handled by a specialized agent. Agents communicate via a shared state and are coordinated through a directed state graph, enabling conditional transitions and iterative workflows.
README generation
OSA features a README generation tool that robotically creates clear and useful README files in two formats: a normal README and an article-style README. The tool decides which format to make use of by itself, for instance, if the user provides a path or URL to a scientific paper through the CLI, OSA switches to the article format. To begin, it scans the repository to seek out an important files, specializing in core logic and project descriptions, and takes under consideration the folder structure and any existing README.
For the usual README, OSA analyzes the important thing project files, repository structure, metadata, and the predominant sections of an existing README if one is present. It then generates a “Core Features” section that serves as the inspiration for the remaining of the document. Using this information, OSA writes a transparent project overview and adds a “Getting Began” section when example scripts or demo files can be found, helping users quickly understand tips on how to use the project.
In article mode, the tool creates a summary of the associated scientific paper and extracts relevant information from the predominant code files. These pieces are combined into an Overview that explains the project goals, a Content section that describes the predominant components and the way they work together, and an Algorithms section that explains how the implemented methods fit into the research. This approach keeps the documentation scientifically accurate while making it easier to read and understand.
Documentation generation
The documentation generation tool produces concise, context-aware documentation for functions, methods, classes, and code modules. The documentation generation process is as follows:
(1) Reference parsing: Initially, a TreeSitter-driven parser fetches imported modules and resolves paths to them for every particular source code file, forming an import map that may further be used to find out method and performance calls for the foreign modules utility. By implementing such an approach, it is comparatively easy to rectify interconnections between different parts of the processed project and to differentiate between internal aliases. Together with the import maps, the parser also preserves general information equivalent to the processing file, an inventory of occurring classes, and standalone functions. Each class comprises its name, attributes list, decorators, docstring, list of its methods, and every method has its specific details that are of the identical structure as standalone functions, that’s: method name, docstring, return type, source code and alias resolved foreign method calls with a reputation of the imported module, class, method, and path to it.
(2) Initial docstrings generation for functions, methods, and classes: With a parser having a structure formed, an initial docstrings generation stage is ongoing. Only docstrings that lack classes, methods, and functions are processed at this stage. Here’s a general description of what the ‘what’ method does. The context is usually the tactic’s source code, since at this point, forming a general description of the functionality is crucial. The onward prompt includes information in regards to the method’s arguments and interior decorators, and it trails with the source code of the called foreign methods to supply additional context for processing method utility. A neat moment here is that class docstrings are generated only in spite of everything their docstring-lacking methods are generated; then class attributes, their methods’ names, and docstrings are provided to the model.
(3) Generation of “the predominant idea” of the project using descriptions of components derived from the previous stage.
(4) Docstrings update using generated “predominant idea”: Hence, all docstrings for the project are presumably present, generation of the predominant idea of the project may be performed. Essentially, the prompt for the concept consists of docstrings for all classes and functions, together with their importance rating based on the speed of occurrence of every component within the import maps mentioned before, and their place within the project hierarchy determined by source path. The model response is returned in markdown format, summarizing the project’s components. Once the predominant idea is acquired, the second stage of docstring generation begins, during which all the project’s source code components are processed. At this moment, the important thing focus is on providing the model with an original or generated docstring on the initial stage docstring with the predominant idea to elaborate on ‘why’ this component is required for the project. The source code for the methods can be being provided, since an expanded project narrative may prompt the model to correct some points in the unique docstring.
(5) Hierarchical modules description generation ranging from the underside to the highest.
(6) Using Mkdocs and GitHub pages for automated documentation pushing and streaming: Final stage of the docstring pipeline, considering a recursive traversal across the project’s modules and submodules. Hierarchy is predicated on the source path; at each leaf-processing level, a previously parsed structure is used to create an outline of which submodule is used, in accordance with the predominant idea. As processing moves to higher levels of the hierarchy, generated submodules’ summaries are also used to supply additional context. The model returns summaries in Markdown to make sure seamless integration with the mkdocs documentation generation pipeline. The whole schema of the approach is described within the image below.

CI/CD and structure organization
OSA offers an automatic CI/CD setup that works across different repository hosting platforms. It generates configurable workflows that make it easier to run tests, check code quality, and deploy projects. The tool supports common utilities equivalent to Black for code formatting, unit_test for running tests, PEP8 and autopep8 for style checks, fix_pep8 for automatic style fixes, pypi_publish for publishing packages, and slash_command_dispatch for handling commands. Depending on the platform, these workflows are placed in the suitable locations, for instance, .github/workflows/ for GitHub or a .gitlab-ci.yml file within the repository root for GitLab.
Users can customize the generated workflows using options like –use-poetry to enable Poetry for dependency management, –branches to define which branches trigger the workflows (by default, predominant and master), and code coverage settings via --codecov-token and --include-codecov.
To make sure reliable testing, OSA also reorganizes the repository structure. It identifies test and example files and moves them into standardized tests and examples directories, allowing CI workflows to run tests consistently without additional configuration.
Workflow files are created from templates that mix project-specific information with user-defined settings. This approach keeps workflows consistent across projects while still allowing flexibility when needed.
OSA also automates documentation deployment using MkDocs. For GitHub repositories, it generates a YAML workflow within the .github/workflows directory and requires enabling read/write permissions and choosing the gh-pages branch for deployment within the repository settings. For GitLab, OSA creates or updates the .gitlab-ci.yml file to incorporate construct and deployment jobs using Docker images, scripts, and artifact retention rules. Documentation is then robotically published when changes are merged into the predominant branch.
The best way to use OSA
To start using OSA, select your repository with draft code that’s incomplete or underdocumented. Optionally, include a related scientific paper or one other document describing the library or algorithm implemented within the chosen repo. The paper is uploaded as a separate file and used to generate the README. It’s also possible to specify the LLM provider (e.g., OpenAI) and the model name (equivalent to GPT-4o).
OSA generates recommendations for improving the repository, including:
- A README file generated from code evaluation, using standard templates and examples
- Docstrings for classes and methods which might be currently missing, to enable automatic documentation generation with MkDocs
- Basic CI/CD scripts, including linters and automatic tests
- A report with actionable recommendations for improving the repository
- Contribution guidelines and files (Code of Conduct, pull request and issue templates, etc.)
You’ll be able to easily install OSA by running:
pip install osa_tool
After establishing the environment, it is best to select an LLM provider (equivalent to OpenAI or an area model). Next, it is best to add GIT_TOKEN (GitHub token with standard repo permissions) and OPENAI_API_KEY (for those who use OpenAI-compatible API) as environment variables, or you possibly can store them within the .env file as well. Finally, you possibly can launch OSA directly from the command line. OSA is designed to work with an existing open-source repository by providing its URL. The fundamental launch command includes the repository address and optional parameters equivalent to the operation mode, API endpoint, and model name:
osa_tool -r {repository} [--mode {mode}] [--api {api}] [--base-url {base_url}] [--model {model_name}]
OSA supports three operating modes:
- auto – analyzes the repository and creates a customized improvement plan using the specialized LLM agent.
- basic – applies a predefined set of improvements: generates a project report, README, community guidelines, an “About” section, and creates standard directories for tests and examples (in the event that they are missing).
- advanced – allows manual selection and configuration of actions before execution.
Additional CLI options can be found here. You’ll be able to customize OSA by passing these options as arguments to the CLI, or by choosing desired features within the interactive command-line mode.

Once launched, OSA performs an initial evaluation of the repository and displays key information: general project details, the present environment configuration, and tables with planned and inactive actions. The user is then prompted to either accept the suggested plan, cancel the operation, or enter an interactive editing mode.
In interactive mode, the plan may be modified: actions toggled on or off, parameters (strings and lists) adjusted, and extra options configured. The system guides the user through each motion’s description, possible values, and current settings. This process continues until the user confirms the ultimate plan.
This CLI-based workflow ensures flexibility, from fully automated processing to express manual control, making it suitable for each rapid initial assessments and detailed project refinements.
OSA also includes an experimental conversational interaction mode that permits users to specify desired repository improvements using free-form natural language via the CLI. If the request is ambiguous or insufficiently related to repository processing, the system iteratively requests clarifications and allows the attached supplementary file to be updated. Once a sound instruction is obtained, OSA analyzes the repository, selects the suitable internal modules, and executes the corresponding actions. This mode is currently under lively development.
When OSA finishes, it creates a pull request (PR) within the repository. The PR includes all proposed changes, equivalent to the README, docstrings, documentation page, CI/CD scripts, сontribution guidelines, report, and more. The user can easily review the PR, make changes if needed, and merge it into the project’s predominant branch.
Let’s have a look at an example. GAN-MFS is a repository that gives a PyTorch implementation of Wasserstein GAN with Gradient Penalty (WGAN-GP). Here is an example of a command to launch OSA on this repo:
osa_tool -r github.com/Roman223/GAN_MFS --mode auto --api openai --base-url https://api.openai.com/v1 --model gpt-4.1-mini
OSA made several contributions to the repository, including a README file generated from the paper’s content.


OSA also added a License file to the pull request, in addition to some basic CI/CD scripts.

OSA added docstrings to all classes and methods where documentation was missing. It also generated a structured, web-based documentation site using those docstrings.

The generated report includes an audit of the repository’s key components: README, license, documentation, usage examples, tests, and a project summary. It also analyzes key sections of the repository, equivalent to structure, README, and documentation. Based on this evaluation, the system identifies key areas for improvement and provides targeted suggestion.

Finally, OSA interacts with the goal repository via GitHub. The OSA bot creates a fork of the repository and opens a pull request that features all proposed changes. The developer only must review the suggestions and adjust anything that seems incorrect. In my view, this is far easier than writing the identical README from scratch. After review, the repository maintainer successfully merged the pull request. All changes proposed by OSA can be found here.

Although the variety of changes introduced by the OSA is important, it’s difficult to evaluate the general improvement in repository quality. To do that, we decided to look at the repository from a security perspective. The scorecard tool allows us to judge the repository using the aggregated metric. Scorecard was created to assist open source maintainers improve their security best practices and to assist open source consumers judge whether their dependencies are protected. The combination rating takes under consideration many repository parameters, including the presence of binary artifacts, CI/CD tests, the variety of contributors, and a license. The aggregated rating of the unique repository was 2.2/10. After the processing by OSA, it rose to three.7/10. This happened resulting from the addition of a license and CI/CD scripts. This rating should seem too low, however the repository being processed isn’t intended for integration into large projects. It’s a small tool for generating synthetic data based on a scientific article, so its security requirements are lower.
What’s Next for OSA?
We plan to integrate a RAG system into OSA, based on best practices in open-source development. OSA will compare the goal repository with reference examples to discover missing components. For instance, if the repository already has a high-quality README, it won’t be regenerated. Initially, we used OSA for Python repositories, but we plan to support additional programming languages in the long run.
If you’ve gotten an open repository that requires improvement, give OSA a try! We’d also appreciate ideas for brand spanking new features which you can leave as issues and PRs.
For those who wish to make use of OSA in your works, it will possibly be cited as:
Nikitin N. et al. An LLM-Powered Tool for Enhancing Scientific Open-Source Repositories // Championing Open-source DEvelopment in ML Workshop@ ICML25.
