CodexBloom - Programming Q&A Platform

How to handle high cardinality tags in Prometheus metrics for a FastAPI app?

👀 Views: 95 đŸ’Ŧ Answers: 1 📅 Created: 2025-06-07
prometheus fastapi observability Python

I've encountered a strange issue with I'm currently working on a FastAPI application and I'm trying to implement observability using Prometheus for metrics collection. My scenario arises when I need to track user activity with metrics that involve high cardinality tags, specifically user IDs. I noticed that Prometheus is struggling to handle the dynamically generated user IDs as labels, leading to performance degradation and even some timeouts during scrape intervals. I've already set up basic metrics collection using `prometheus_fastapi_instrumentator` version 0.10.0, and here's a snippet of how I'm currently defining my metrics: ```python from fastapi import FastAPI from prometheus_fastapi_instrumentator import Instrumentator app = FastAPI() instrumentator = Instrumentator() instrumentator.instrument(app) @app.get("/user/{user_id}") def get_user(user_id: int): # Simulate some processing return {"user_id": user_id} ``` I also tried using the `gauge` metric for tracking user sessions, but the cardinality scenario still continues. My configuration in the Prometheus server's scrape config is as follows: ```yaml scrape_configs: - job_name: 'fastapi' static_configs: - targets: ['localhost:8000'] metrics_path: '/metrics' ``` I'm getting behavior messages in the Prometheus logs indicating that the scrape interval is timing out: ``` level=behavior ts=2023-10-01T12:00:00.000Z caller=scrape.go:1054 component=scraper scraping="fastapi" err="Get \"http://localhost:8000/metrics\": context deadline exceeded" ``` I would like to know what best practices exist for handling high cardinality metrics in Prometheus, particularly in the context of a FastAPI application. Are there any strategies or patterns I can adopt to limit the number of unique labels or to aggregate user activity in a way that doesn't overwhelm Prometheus? Any insights or code examples would be greatly appreciated! I'm on Ubuntu 22.04 using the latest version of Python. Any advice would be much appreciated. This is happening in both development and production on Windows 10.