Three Mighty Alerts Supporting Hugging Face’s Production Infrastructure

The Infrastructure team at Hugging Face is happy to share a behind-the-scenes take a look at the inner workings of Hugging Face’s production infrastructure, which we’ve had the privilege of helping to construct and maintain. Our team’s dedication to designing and implementing a sturdy monitoring and alerting system has been instrumental in ensuring the steadiness and scalability of our platforms. We’re always reminded of the impact that our alerts have on our ability to discover and reply to potential issues before they turn into major incidents.

On this blog post, we’ll dive into the main points of three mighty alerts that play their unique role in supporting our production infrastructure, and explore how they’ve helped us maintain the high level of performance and uptime that our community relies on.

High NAT Gateway Throughput

Oftentimes, with cloud computing architectures, where data flows between private and public networks, implementing a NAT (Network Address Translation) gateway stands as a steadfast best practice. This gateway acts as a strategic gatekeeper, monitoring and facilitating all outbound traffic towards the general public Web. By centralizing egress traffic, the NAT gateway offers a strategic vantage point for comprehensive visibility. Our team can easily query and analyze this traffic, making it a useful asset when working through security, cost optimization or various other investigative scenarios.

Cost optimization is a critical aspect of cloud infrastructure management, and understanding the pricing dynamics is vital. In data centers, pricing structures often differentiate between east-west traffic (typically communication inside the same rack or constructing) and north-south traffic (communication between further away private networks or the web). By monitoring network traffic volume, Hugging Face gains useful insights into these traffic patterns. This awareness allows us to make informed decisions regarding infrastructure configuration and architecture, ensuring we limit incurring pointless costs.

One among our key alerts is designed to notify us when our network traffic volume surpasses a predefined threshold. This alert serves multiple purposes. Firstly, it acts as an early warning system, alerting us to any unusual spikes in traffic that may indicate potential issues or unexpected behavior. Secondly, it prompts us to recurrently review our traffic trends, ensuring we stay on top of our infrastructure’s growth and evolving needs. This alert is ready at a static threshold, which now we have fine-tuned over time, ensuring it stays relevant and effective. When triggered, it often coincides with periods of refactoring our infrastructure.

As an illustration, when integrating third-party security and autoscaling tools, we have observed increased telemetry data egress from our nodes, triggering the alert and prompting us to optimize our configurations.

On one other occasion, adjustments to our infrastructure result in mistakenly avoiding a non-public, low-cost path between product-specific infrastructure (ex. traffic destined to the Hub from a Space to interact with repository data). To elaborate further, probably the most impactful workloads when it comes to cost savings we’ve found are those who access object storage. Fetching objects directly prices cheaper than going through CDN-hosted assets for our LFS repository storage and moreover doesn’t require the identical security measures that our WAF provides in comparison with public requests arriving at our front door. Leveraging DNS overrides to modify traffic through private network paths and public network paths has turn into a useful technique for us, driven by the CDKTF AWS provider.

recent Route53ResolverFirewallRule(
  stack,
  `dns-override-rule-${key}-${j}`,
  {
    provider: group?.provider!,
    name: `dns-override-${dnsOverride.name}-${rule[0]}`,
    motion: 'BLOCK',
    blockOverrideDnsType: 'CNAME',
    blockOverrideDomain: `${rule[1]}.`,
    blockOverrideTtl: dnsOverride.ttl,
    blockResponse: 'OVERRIDE',
    firewallDomainListId: list.id,
    firewallRuleGroupId: group!.id,
    priority: 100 + j,
  },
);

As a final note, while now we have configuration-as-code ensuring the specified state is at all times in effect, having a further layer of alerting around helps in case mistakes are made when expressing the specified state through code.

Hub Request Logs Archival Success Rate

The logging infrastructure at Hugging Face is a complicated system designed to gather, process, and store vast amounts of log data generated by our applications and services. At the center of this method is the Hub application logging pipeline, a well-architected solution that ensures Hub model usage data is efficiently captured, enriched, and stored for reporting and archival purposes. The pipeline begins with Filebeat, a light-weight log shipper that runs as a daemonset alongside our application pods in each Kubernetes cluster. Filebeat’s role is to gather logs from various sources, including application containers, and forward them to the following stage of the pipeline.

Once logs are collected by Filebeat, they’re sent to Logstash, a robust log processing tool. Logstash acts as the information processing workhorse, applying a series of mutations and transformations to the incoming logs. This includes enriching logs with GeoIP data for geolocation insights, routing logs to specific Elasticsearch indexes based on predefined criteria, and manipulating log fields by adding, removing, or reformatting them to make sure consistency and ease of research. After Logstash has processed the logs, they’re forwarded to an Elasticsearch cluster.

Elasticsearch, a distributed search and analytics engine, forms the core of our log storage and evaluation platform. It receives the logs from Logstash and applies its own set of processing rules through Elasticsearch pipelines. These pipelines perform minimal processing tasks, comparable to adding timestamp fields to point the time of processing, which is crucial for log evaluation and correlation. Elasticsearch provides a scalable and versatile storage solution, allowing us to buffer logs for operational use and real-time evaluation.

To administer the lifecycle of logs inside Elasticsearch, we employ a sturdy storage and lifecycle management strategy. This ensures that logs are retained in Elasticsearch for an outlined period, providing quick access for operational and troubleshooting purposes. After this retention period, logs are offloaded to long-term archival storage. The archival process involves an automatic tool that reads logs from Elasticsearch indexes, formats them as Parquet files—an efficient columnar storage format—and writes them to our object storage system.

The ultimate stage of our logging pipeline leverages AWS data warehousing services. Here, AWS Glue crawlers are utilized to find and classify data in our object storage, robotically generating a Glue Data Catalog, which provides a unified metadata repository. The Glue table schema is periodically refreshed to make sure it stays up-to-date with the evolving structure of our log data. This integration with AWS Glue enables us to question the archived logs using Amazon Athena, a serverless interactive query service. Athena allows us to run SQL queries directly against the information in object storage, providing a cheap and scalable solution for log evaluation and historical data exploration.

The logging pipeline, while meticulously designed, is just not without its challenges and potential points of failure. One of the crucial critical vulnerabilities lies within the elasticity of the system, particularly within the Elasticsearch cluster. Elasticsearch, being a distributed system, can experience backpressure in various scenarios, comparable to high ingress traffic, intensive querying, or internal operations like shard relocation. When backpressure occurs, it may possibly result in a cascade of issues throughout the pipeline. As an illustration, if the Elasticsearch cluster becomes overwhelmed, it could start rejecting or delaying log ingestion, causing backlogs in Logstash and even Filebeat, which can lead to log loss or delayed processing.

One other point of fragility is the auto-schema detection mechanism in Elasticsearch. While it’s designed to adapt to changing log structures, it may possibly fail when application logs undergo significant field type changes. If the schema detection fails to acknowledge the brand new field types, it could result in failed writes from Logstash to Elasticsearch, causing log processing bottlenecks and potential data inconsistencies. This issue highlights the importance of proactive log schema management and the necessity for robust monitoring to detect and address such issues promptly.

Memory management can also be a critical aspect of the pipeline’s stability. The log processing tier, including Logstash and Filebeat, operates with limited memory resources to manage costs. When backpressure occurs, these components can experience Out-of-Memory (OOM) issues, especially during system slowdowns. As logs accumulate and backpressure increases, the memory footprint of those processes grows, pushing them closer to their limits. If not addressed promptly, this may result in process crashes or further exacerbation of the backpressure problem.

Archival jobs, answerable for transferring logs from Elasticsearch to object storage, have also encountered challenges. Occasionally, these jobs might be resource-intensive, with their performance becoming sensitive to node size and memory availability. In cases where junk data or unusually large log entries go through the pipeline, they will strain the archival process, resulting in failures resulting from memory exhaustion or node capability limits. This underscores the importance of information validation and filtering earlier within the pipeline to forestall such issues from reaching the archival stage.

To mitigate these potential failures, we have implemented a robust alert system with a singular motivation: validating end-to-end log flow. The alert is designed to compare the variety of requests received by our Application Load Balancer (ALB) with the variety of logs successfully archived, providing a comprehensive view of log data flow throughout all the pipeline. This approach allows us to quickly discover any discrepancies that may indicate potential log loss or processing issues.

The alert mechanism is predicated on an easy yet effective comparison: the variety of requests hitting our ALB, which represents the whole log volume entering the system, versus the variety of logs successfully archived in our long-term storage. By monitoring this ratio, we are able to be certain that what goes in must come out, providing a sturdy validation of our logging infrastructure’s health. When the alert is triggered, it indicates a possible mismatch, prompting immediate investigation and remediation.

In practice, this alert has proven to be a useful tool, especially in periods of infrastructure refactoring. As an illustration, once we migrated our ALB to a VPC origin, the alert was instrumental in identifying and addressing the resulting log flow discrepancies. Nevertheless, it has also saved us in less obvious scenarios. For instance, when archive jobs didn’t run resulting from unexpected issues, the alert flagged the missing archived logs, allowing us to promptly investigate and resolve the issue before it impacted our log evaluation and retention processes.

While this alert is a robust tool, it is only one a part of our comprehensive monitoring strategy. We constantly refine and adapt our logging infrastructure to handle the ever-increasing volume and complexity of log data. By combining proactive monitoring, efficient resource management, and a deep understanding of our system’s behavior, Hugging Face ensures that our logging pipeline stays resilient, reliable, and able to supporting our platform’s growth and evolving needs. This alert is a testament to our commitment to maintaining a sturdy and transparent logging system, providing our teams with the insights they should keep Hugging Face running easily.

Kubernetes API Request Errors and Rate Limiting

When operating cloud-native applications and Kubernetes-based infrastructures, even seemingly minor issues can escalate into significant downtime if left unchecked. This is especially true for the Kubernetes API, which serves because the central nervous system of a Kubernetes cluster, orchestrating the creation, management, and networking of containers. At Hugging Face, we have learned through experience that monitoring the Kubernetes API error rate and rate limiting metrics is an important practice, one which can prevent potential disasters.

Hugging Face’s infrastructure is deeply integrated with Kubernetes, and the kube-rs library has been instrumental in constructing and managing this ecosystem efficiently. kube-rs offers a Rust-centric approach to Kubernetes application development, providing developers with a well-known and powerful toolkit. At its core, kube-rs introduces three key concepts: reflectors, controllers, and custom resource interfaces. Reflectors ensure real-time synchronization of Kubernetes resources, enabling applications to react swiftly to changes. Controllers, the decision-makers, constantly reconcile the specified and actual states of resources, making Kubernetes self-healing. Custom resource interfaces extend Kubernetes, allowing developers to define application-specific resources for higher abstraction.

Moreover, kube-rs introduces watchers and finalizers. Watchers monitor specific resources for changes, triggering actions in response to events. Finalizers, then again, ensure proper cleanup and resource termination by defining custom logic. By providing Rust-based abstractions for these Kubernetes concepts, kube-rs allows developers to construct robust, efficient applications, leveraging the Kubernetes platform’s power and suppleness while maintaining a Rust-centric development approach. This integration streamlines the technique of constructing and managing complex Kubernetes applications, making it a useful tool in Hugging Face’s infrastructure.

Hugging Face’s integration with Kubernetes is a cornerstone of our infrastructure, and the kube-rs library plays a pivotal role in managing this ecosystem. The kube::api:: module is instrumental in automating various tasks, comparable to managing HTTPS certificates for custom domains supporting our Spaces product. By programmatically handling certificate lifecycles, we make sure the security and accessibility of our services, providing users with a seamless experience. Moreover, now we have used this module outside of user-facing features during routine maintenance to facilitate node draining and termination providing cluster stability during infrastructure updates.

The kube::runtime:: module has been equally crucial for us, enabling the event and deployment of custom controllers that enhance our infrastructure’s automation and resilience. As an illustration, we have implemented controllers for billing management in our managed services, where watchers and finalizers on customer pods ensure accurate resource tracking and billing. This level of customization allows us to adapt Kubernetes to our specific needs.

Through kube-rs, Hugging Face has achieved a high level of efficiency, reliability, and control over our cloud-native applications. The library’s Rust-centric design aligns with our engineering philosophy, allowing us to leverage Rust’s strengths in managing Kubernetes resources. By automating critical tasks and constructing custom controllers, we have created a scalable, self-healing infrastructure that meets the various and evolving needs of our users and enterprise customers. This integration demonstrates our commitment to harnessing the complete potential of Kubernetes while maintaining a development approach tailored to our unique requirements.

While our infrastructure rarely encounters issues related to the Kubernetes API, we remain vigilant, especially during and after deployments. The Kubernetes API is a critical component in our use of kube::runtime:: for managing customer pods and cloud networking resources. Any disruptions or inefficiencies in API communication can have cascading effects on our services, potentially resulting in downtime or degraded performance.

The importance of monitoring these API metrics is underscored by the experiences of other users of Kubernetes. OpenAI, as an example, shared a status update detailing how DNS availability issues resulted in significant downtime. While in a roundabout way related to the Kubernetes API, their experience highlights the interconnectedness of varied infrastructure components and the potential for cascading failures. Just as DNS availability is important for application accessibility, a healthy and responsive Kubernetes API is crucial for managing and orchestrating our containerized workloads.

As a best practice, we have integrated these API metrics into our monitoring and alerting systems, ensuring that any anomalies or trends are promptly dropped at our attention. This enables us to take a proactive approach, investigating and addressing issues before they impact our customers. As an illustration, on one occasion a single cluster began rate limiting requests to the Kubernetes API. We were capable of trace this back to one in all our third-party tools hitting a bug and repeatedly requesting a node be drained although it had already done so. In response we were capable of flush the malfunctioning job from the system before any noticeable degradation impacted our users. That is an amazing example that alerting scenarios do not only occur directly after deploying recent versions of our custom controllers — bugs can take time to manifest as production issues.

In conclusion, while our infrastructure is powerful and well-architected, we recognize that vigilance and proactive monitoring are essential to maintaining its health and stability. By keeping an in depth eye on the Kubernetes API error rate and rate limiting metrics, we safeguard our managed services, ensure smooth customer experiences, and uphold our commitment to reliability and performance. This can be a testament to our belief that on the earth of cloud-native technologies, every component, irrespective of how small, plays a major role in the general resilience and success of our platform.

Bonus Alert: Recent Cluster Sending Zero Metrics

And a final bonus alert for reading through this far into the post!

At Hugging Face, our experiments are always in flux, often with purpose-fit clusters spinning up and down as we iterate on features and products. So as to add to the entropy, our growth can also be a major factor, with clusters expanding to their limits and triggering meiosis-like splits to keep up balance. To navigate this dynamic environment without resorting to hardcoding or introducing a further cluster discovery layer, we have devised a clever alert that adapts to those changes.

(
  (sum by (cluster) (rate(container_network_transmit_packets_total{pod="prometheus"}[1h] ) ) > 0)
or
  (-1 * (sum by (cluster) (rate(container_network_transmit_packets_total{pod="prometheus"}[1h] offset 48h) ) > 0))
) < 0

The metric utilized in this question is container_network_transmit_packets_total, which represents the whole variety of packets transmitted by a container. The query is filtering for metrics from a cluster’s local Prometheus instance, which is tasked with metric collection in addition to distant writing to our central metric store — Grafana Mimir. Transmission of packets approximates healthy distant writes, which is what we wish to make sure across all lively clusters.

The primary a part of the query performs a current rate check. The second a part of the query performs a historical rate check through the use of the identical calculation as the present rate check plus an offset 48h clause. The -1 * multiplication is used to invert the result, in order that if the historical rate is larger than 0, the result might be lower than 0.

The or operator combines the 2 parts of the query. The query will return true if either of the next conditions is met:

The present rate of packet transmission for a cluster is larger than 0.
The historical rate of packet transmission for a cluster (48 hours ago) is larger than 0, but the present rate is just not.

The outer < 0 condition checks if the results of the or operation is lower than 0. Which means that the query will only trigger if neither of the conditions is met, i.e., if a cluster has never sent any metrics (each current and historical rates are 0).

There are two cases where the query will trigger:

Recent cluster with no metrics: A brand new cluster is added, however it has not sent any metrics yet. On this case, each the present and historical rates might be 0, and the query will trigger.
Cluster that has never sent metrics: A cluster has been present for greater than 48 hours, however it has never sent any metrics. On this case, the historical rate might be 0, and the present rate may even be 0, triggering the query.

In each cases, the query will detect that the cluster is just not sending any metrics and trigger the alert.

This easy yet effective solution fires in scenarios where our metrics infrastructure crashes, during cluster setup, and once they’re torn down providing us with timely insights into our infrastructure’s health. While it will not be probably the most critical alert in our arsenal, it holds a special place because it was born out of collaboration. It’s a testament to the facility of teamwork through rigorous code review, made possible by the expertise and willingness to assist of fellow colleagues within the Hugging Face infrastructure team 🤗

Wrapping Up

On this post we shared a few of our favourite alerts supporting infrastructure at Hugging Face. We might love to listen to your team’s favourites as well!

How are you monitoring your ML infrastructure? Which alerts keep your team coming back for fixes? What breaks often in your infrastructure or conversely what have you ever never monitored and just works?

Share with us within the comments below!

Source link

Three Mighty Alerts Supporting Hugging Face’s Production Infrastructure

High NAT Gateway Throughput

Hub Request Logs Archival Success Rate

Kubernetes API Request Errors and Rate Limiting

Bonus Alert: Recent Cluster Sending Zero Metrics

Wrapping Up

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

VQ-Diffusion

The Rule Everyone Misses: The right way to Stop Confusing loc and iloc in Pandas

AI corporations want you to stop chatting with bots and begin managing them

Using Stable Diffusion with Core ML on Apple Silicon

Helping AI agents search to get the very best results out of huge language models

Three Mighty Alerts Supporting Hugging Face’s Production Infrastructure

High NAT Gateway Throughput

Hub Request Logs Archival Success Rate

Kubernetes API Request Errors and Rate Limiting

Bonus Alert: Recent Cluster Sending Zero Metrics

Wrapping Up

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.