Loki indexes
This is a crucial distinction between Loki and Prometheus, despite both using labels.
Loki: Indexes on Labels, Not Message Data
No, Loki does NOT build a separate index for each unique combination of labels and message data.
Here's how Loki works:
- Labels define streams: Loki's fundamental concept is the "log stream." A log stream is defined by a unique combination of labels. For example,
app="webserver", env="production", host="host-1"
would define one log stream. - Log content is not indexed: The actual content of the log messages themselves is not indexed by Loki. This is its core difference from systems like Elasticsearch.
- Index on labels only: Loki builds an index only on the labels. This index allows it to quickly identify which log chunks (files containing log lines) contain logs belonging to a specific set of labels.
- "Grep" at query time: When you execute a LogQL query, Loki first uses the label index to narrow down the relevant log streams (and their associated chunks). Once it has those chunks, it then performs a grep-like search (text filtering, regex matching, parsing) on the raw log data within those chunks to find the specific log lines that match your query.
Why this design?
- Cost-effective: Not indexing log content significantly reduces storage requirements and ingestion costs compared to full-text indexing solutions.
- Scalability: It's easier to scale a system that primarily indexes metadata (labels) rather than the entire data payload.
- Optimization for specific use cases: Loki is optimized for use cases where you know the source of your logs (via labels) and then want to "tail" or grep through them, similar to how you'd use
grep
on log files on a server.
High Cardinality is Bad for Loki:
Because each unique combination of labels creates a new log stream and new chunks, high cardinality in Loki is a major anti-pattern. If you use labels that change frequently or have many unique values (e.g., user_id
, request_id
, full URL paths), you will create an explosion of log streams, leading to:
- A very large and inefficient label index.
- Thousands or millions of tiny chunks, which are inefficient for object storage and increase query latency (more files to open and read).
- Higher operational costs and performance degradation.
Prometheus: Indexes on Labels to Create Time Series
Yes, Prometheus (using TSDB or remote storage) effectively does build a "separate index" for each unique combination of metric name and labels.
Here's how Prometheus works:
- Dimensional Data Model: Prometheus stores all data as time series. Each time series is uniquely identified by its metric name and a unique set of key-value label pairs.
- Each unique combination is a distinct time series:
http_requests_total{method="GET", status="200"}
is a completely different time series fromhttp_requests_total{method="POST", status="404"}
. Even if they share the same metric name, the labels make them distinct. - TSDB Indexing: The Prometheus TSDB (Time Series Database) is designed to efficiently store and query these time series. It builds an internal index that maps label sets to the actual time series data chunks.
- It uses an inverted index where each label value points to a list of series IDs that contain that label. This allows for very fast filtering and aggregation based on label matchers.
- Data is stored in "blocks," and each block contains its own index and chunks of compressed time series data.
Why this design?
- Efficient queries: Prometheus is built for numerical time-series data and its query language (PromQL) relies heavily on labels for powerful aggregation and filtering. Efficient indexing of label sets is critical for fast PromQL queries.
- Metrics vs. Logs: Metrics are typically lower cardinality than log lines. While you can have millions of metric time series, the expectation is that they are generally bounded, unlike log messages which can contain an infinite variety of content.
High Cardinality is Bad for Prometheus too (but for different reasons):
While Prometheus's indexing model supports high cardinality better than Loki, extreme high cardinality in Prometheus still leads to significant issues:
- Increased memory usage: More unique time series mean more entries in the in-memory index on the Prometheus server.
- Increased disk usage: More time series mean more data to store.
- Slower query times: While the index helps, processing and aggregating over millions of time series can still be computationally intensive.
- Higher churn: If labels change frequently (e.g., dynamic pod names without proper relabeling), it leads to "churn" – old time series are constantly created and new ones appear, putting stress on the TSDB.
In summary:
- Loki: Indexes only on labels to define log streams. The log message content itself is not indexed. Searches within the log content are done via "grep" on the retrieved chunks. High label cardinality is very detrimental.
- Prometheus: Indexes on the combination of metric name and labels to define unique time series. Each unique combination gets its own dedicated storage and is efficiently indexed for fast querying and aggregation. High label cardinality is also detrimental, but for different resource-based reasons (memory, disk, CPU) rather than the fundamental architectural issue of stream explosion like in Loki.