Learn tips on how to create an agent that understands your house’s context, learns your preferences, and interacts with you and your house to perform activities you discover invaluable.
This text describes the architecture and design of a Home Assistant (HA) integration called home-generative-agent. This project uses LangChain and LangGraph to create a generative AI agent that interacts with and automates tasks inside a HA smart home environment. The agent understands your house’s context, learns your preferences, and interacts with you and your house to perform activities you discover invaluable. Key features include creating automations, analyzing images, and managing home states using various LLMs (Large Language Models). The architecture involves each cloud-based and edge-based models for optimal performance and cost-effectiveness. Installation instructions, configuration details, and knowledge on the project’s architecture and the several models used are included and will be found on the home-generative-agent GitHub. The project is open-source and welcomes contributions.
These are among the features currently supported:
- Create complex Home Assistant automations.
- Image scene evaluation and understanding.
- Home state evaluation of entities, devices, and areas.
- Full agent control of allowed entities in the house.
- Short- and long-term memory using semantic search.
- Automatic summarization of home state to administer LLM context length.
That is my personal project and an example of what I call learning-directed hacking. The project shouldn’t be affiliated with my work at Amazon nor am I related to the organizations accountable for Home Assistant or LangChain/LangGraph in any way.
Creating an agent to observe and control your house can result in unexpected actions and potentially put your house and yourself in danger as a result of LLM hallucinations and privacy concerns, especially when exposing home states and user information to cloud-based LLMs. I actually have made reasonable architectural and design decisions to mitigate these risks, but they can’t be completely eliminated.
One key early decision was to depend on a hybrid cloud-edge approach. This permits using probably the most sophisticated reasoning and planning models available, which should help reduce hallucinations. Simpler, more task-focused edge models are employed to further minimize LLM errors.
One other critical decision was to leverage LangChain’s capabilities, which permit sensitive information to be hidden from LLM tools and provided only at runtime. For example, tool logic may require using the ID of the user who made a request. Nevertheless, such values should generally not be controlled by the LLM. Allowing the LLM to govern the user ID could pose security and privacy risks. To mitigate this, I utilized the InjectedToolArg annotation.
Moreover, using large cloud-based LLMs incurs significant cloud costs, and the sting hardware required to run LLM edge models will be expensive. The combined operational and installation costs are likely prohibitive for the common user right now. An industry-wide effort to “make LLMs as low cost as CNNs” is required to bring home agents to the mass market.
It can be crucial to pay attention to these risks and understand that, despite these mitigations, we’re still within the early stages of this project and residential agents generally. Significant work stays to make these agents truly useful and trustworthy assistants.
Below is a high-level view of the home-generative-agent architecture.
The overall integration architecture follows the very best practices as described in Home Assistant Core and is compliant with Home Assistant Community Store (HACS) publishing requirements.
The agent is built using LangGraph and uses the HA conversation component to interact with the user. The agent uses the Home Assistant LLM API to fetch the state of the house and understand the HA native tools it has at its disposal. I implemented all other tools available to the agent using LangChain. The agent employs several LLMs, a big and really accurate primary model for high-level reasoning, smaller specialized helper models for camera image evaluation, primary model context summarization, and embedding generation for long-term semantic search. The first model is cloud-based, and the helper models are edge-based and run under the Ollama framework on a pc positioned in the house.
The models currently getting used are summarized below.
LangGraph-based Agent
LangGraph powers the conversation agent, enabling you to create stateful, multi-actor applications utilizing LLMs as quickly as possible. It extends LangChain’s capabilities, introducing the power to create and manage cyclical graphs essential for developing complex agent runtimes. A graph models the agent workflow, as seen within the image below.
The agent workflow has five nodes, each Python module modifying the agent’s state, a shared data structure. The sides between the nodes represent the allowed transitions between them, with solid lines unconditional and dashed lines conditional. Nodes do the work, and edges tell what to do next.
The __start__ and __end__ nodes inform the graph where to begin and stop. The agent node runs the first LLM, and if it decides to make use of a tool, the motion node runs the tool after which returns control to the agent. The summarize_and_trim node processes the LLM’s context to administer growth while maintaining accuracy if agent has no tool to call and the variety of messages meets the below-mentioned conditions.
LLM Context Management
You could fastidiously manage the context length of LLMs to balance cost, accuracy, and latency and avoid triggering rate limits resembling OpenAI’s Tokens per Minute restriction. The system controls the context length of the first model in two ways: it trims the messages within the context in the event that they exceed a max parameter, and the context is summarized once the variety of messages exceeds one other parameter. These parameters are configurable in const.py; their description is below.
- CONTEXT_MAX_MESSAGES | Messages to maintain in context before deletion | Default = 100
- CONTEXT_SUMMARIZE_THRESHOLD | Messages in context before summary generation | Default = 20
The summarize_and_trim node within the graph will trim the messages only after content summarization. You possibly can see the Python code related to this node within the snippet below.
async def _summarize_and_trim(
state: State, config: RunnableConfig, *, store: BaseStore
) -> dict[str, list[AnyMessage]]:
"""Coroutine to summarize and trim message history."""
summary = state.get("summary", "")if summary:
summary_message = SUMMARY_PROMPT_TEMPLATE.format(summary=summary)
else:
summary_message = SUMMARY_INITIAL_PROMPT
messages = (
[SystemMessage(content=SUMMARY_SYSTEM_PROMPT)] +
state["messages"] +
[HumanMessage(content=summary_message)]
)
model = config["configurable"]["vlm_model"]
options = config["configurable"]["options"]
model_with_config = model.with_config(
config={
"model": options.get(
CONF_VLM,
RECOMMENDED_VLM,
),
"temperature": options.get(
CONF_SUMMARIZATION_MODEL_TEMPERATURE,
RECOMMENDED_SUMMARIZATION_MODEL_TEMPERATURE,
),
"top_p": options.get(
CONF_SUMMARIZATION_MODEL_TOP_P,
RECOMMENDED_SUMMARIZATION_MODEL_TOP_P,
),
"num_predict": VLM_NUM_PREDICT,
}
)
LOGGER.debug("Summary messages: %s", messages)
response = await model_with_config.ainvoke(messages)
# Trim message history to administer context window length.
trimmed_messages = trim_messages(
messages=state["messages"],
token_counter=len,
max_tokens=CONTEXT_MAX_MESSAGES,
strategy="last",
start_on="human",
include_system=True,
)
messages_to_remove = [m for m in state["messages"] if m not in trimmed_messages]
LOGGER.debug("Messages to remove: %s", messages_to_remove)
remove_messages = [RemoveMessage(id=m.id) for m in messages_to_remove]
return {"summary": response.content, "messages": remove_messages}
Latency
The latency between user requests or the agent taking timely motion on the user’s behalf is critical for you to think about within the design. I used several techniques to cut back latency, including using specialized, smaller helper LLMs running on the sting and facilitating primary model prompt caching by structuring the prompts to place static content, resembling instructions and examples, upfront and variable content, resembling user-specific information at the top. These techniques also reduce primary model usage costs considerably.
You possibly can see the everyday latency performance below.
- HA intents (e.g., activate a lightweight) | < 1 second
- Analyze camera image (initial request) | < 3 seconds
- Add automation | < 1 second
- Memory operations | < 1 second
Tools
The agent can use HA tools as laid out in the LLM API and other tools in-built the LangChain framework as defined in tools.py. Moreover, you possibly can extend the LLM API with tools of your individual as well. The code gives the first LLM the list of tools it might call, together with instructions on using them in its system message and within the docstring of the tool’s Python function definition. You possibly can see an example of docstring instructions within the code snippet below for the get_and_analyze_camera_image tool.
@tool(parse_docstring=False)
async def get_and_analyze_camera_image( # noqa: D417
camera_name: str,
detection_keywords: list[str] | None = None,
*,
# Hide these arguments from the model.
config: Annotated[RunnableConfig, InjectedToolArg()],
) -> str:
"""
Get a camera image and perform scene evaluation on it.Args:
camera_name: Name of the camera for scene evaluation.
detection_keywords: Specific objects to search for in image, if any.
For instance, If user says "check the front porch camera for
boxes and dogs", detection_keywords could be ["boxes", "dogs"].
"""
hass = config["configurable"]["hass"]
vlm_model = config["configurable"]["vlm_model"]
options = config["configurable"]["options"]
image = await _get_camera_image(hass, camera_name)
return await _analyze_image(vlm_model, options, image, detection_keywords)
If the agent decides to make use of a tool, the LangGraph node motion is entered, and the node’s code runs the tool. The node uses an easy error recovery mechanism that can ask the agent to try calling the tool again with corrected parameters within the event of constructing a mistake. The code snippet below shows the Python code related to the motion node.
async def _call_tools(
state: State, config: RunnableConfig, *, store: BaseStore
) -> dict[str, list[ToolMessage]]:
"""Coroutine to call Home Assistant or langchain LLM tools."""
# Tool calls will probably be the last message in state.
tool_calls = state["messages"][-1].tool_callslangchain_tools = config["configurable"]["langchain_tools"]
ha_llm_api = config["configurable"]["ha_llm_api"]
tool_responses: list[ToolMessage] = []
for tool_call in tool_calls:
tool_name = tool_call["name"]
tool_args = tool_call["args"]
LOGGER.debug(
"Tool call: %s(%s)", tool_name, tool_args
)
def _handle_tool_error(err:str, name:str, tid:str) -> ToolMessage:
return ToolMessage(
content=TOOL_CALL_ERROR_TEMPLATE.format(error=err),
name=name,
tool_call_id=tid,
status="error",
)
# A langchain tool was called.
if tool_name in langchain_tools:
lc_tool = langchain_tools[tool_name.lower()]
# Provide hidden args to tool at runtime.
tool_call_copy = copy.deepcopy(tool_call)
tool_call_copy["args"].update(
{
"store": store,
"config": config,
}
)
try:
tool_response = await lc_tool.ainvoke(tool_call_copy)
except (HomeAssistantError, ValidationError) as e:
tool_response = _handle_tool_error(repr(e), tool_name, tool_call["id"])
# A Home Assistant tool was called.
else:
tool_input = llm.ToolInput(
tool_name=tool_name,
tool_args=tool_args,
)
try:
response = await ha_llm_api.async_call_tool(tool_input)
tool_response = ToolMessage(
content=json.dumps(response),
tool_call_id=tool_call["id"],
name=tool_name,
)
except (HomeAssistantError, vol.Invalid) as e:
tool_response = _handle_tool_error(repr(e), tool_name, tool_call["id"])
LOGGER.debug("Tool response: %s", tool_response)
tool_responses.append(tool_response)
return {"messages": tool_responses}
The LLM API instructs the agent at all times to call tools using HA built-in intents when controlling Home Assistant and to make use of the intents `HassTurnOn` to lock and `HassTurnOff` to unlock a lock. An intent describes a user’s intention generated by user actions.
You possibly can see the list of LangChain tools that the agent can use below.
- get_and_analyze_camera_image | run scene evaluation on the image from a camera
- upsert_memory | add or update a memory
- add_automation | create and register a HA automation
- get_entity_history | query HA database for entity history
Hardware
I built the HA installation on a Raspberry Pi 5 with SSD storage, Zigbee, and LAN connectivity. I deployed the sting models under Ollama on an Ubuntu-based server with an AMD 64-bit 3.4 GHz CPU, Nvidia 3090 GPU, and 64 GB system RAM. The server is on the identical LAN because the Raspberry Pi.
I’ve been using this project at home for a couple of weeks and have found it useful but frustrating in a couple of areas that I will probably be working on to handle. Below is an inventory of pros and cons of my experience with the agent.
Pros
- The camera image scene evaluation could be very useful and versatile since you possibly can query for nearly anything and never need to worry having the fitting classifier as you’ll for a conventional ML approach.
- Automations are very easy to setup and will be quite complex. Its mind blowing how good the first LLM is at generating HA-compliant YAML.
- Latency usually is sort of acceptable.
- Its very easy so as to add additional LLM tools and graph states with LangChain and LangGraph.
Cons
- The camera image evaluation seems less accurate than traditional ML approaches. For instance, detecting packages which can be partially obscured could be very difficult for the model to handle.
- The first model clould costs are high. Running a single package detector once every 30 mins costs about $2.50 per day.
- Using structured model outputs for the helper LLMs, which might make downstream LLM processing easier, considerably reduces accuracy.
- The agent must be more proactive. Adding a planning step to the agent graph will hopefully address this.
Listed below are a couple of examples of what you possibly can do with the home-generative-agent (HGA) integration as illustrated by screenshots of the Assist dialog taken by me during interactions with my HA installation.
- Create an automation that runs periodically.
The snippet below shows that the agent is fluent in YAML based on what it generated and registered as an HA automation.
alias: Check Litter Box Waste Drawer
triggers:
- minutes: /30
trigger: time_pattern
conditions:
- condition: numeric_state
entity_id: sensor.litter_robot_4_waste_drawer
above: 90
actions:
- data:
message: The Litter Box waste drawer is greater than 90% full!
motion: notify.notify
- Check multiple cameras (video by the creator).
https://github.com/user-attachments/assets/230baae5-8702-4375-a3f0-ffa981ee66a3
- Summarize the house state (video by the creator).
https://github.com/user-attachments/assets/96f834a8-58cc-4bd9-a899-4604c1103a98
- Long-term memory with semantic search.