Logs are generally unstructured text or structured events emitted by applications and written to files. Logs are essential to system visibility, especially when investigating unpredictable states of the system. Logs (ideally) give you data on exactly why something isn’t working. However, a logs-only approach becomes slow, expensive, and unsustainable as systems and the data they produce grow.
Metrics are data points that represent numbers, counters, or gauges that provide visibility into a system — error count, average CPU utilization percentage, throughput, etc. combined with a time stamp. Metrics are great for quick and reliable big-picture insight into system performance and can help teams quickly identify predictable system error states. But while out-of-the-norm metrics might indicate a problem within an application environment, it’s hard to pinpoint the root cause by looking at metrics alone.
Using a combination of logs and metrics provides many of the details needed to troubleshoot issues. So if we need both logs and metrics, what’s all this business about converting logs to metrics?
In many systems, including Splunk, metric data points are stored in a format that provides faster search performance and more efficient data storage. This means log data that would be better represented by high-level, quantitative metrics rather than detailed text can be converted to metrics to help set alerting thresholds, reduce storage costs, and decrease the time it takes to troubleshoot issues. Rather than searching through logs of HTTP status codes and trying to parse patterns or identify outliers from the log text, we can convert status codes to metrics and easily view anomalies or even alert on them.
So why not just start off reporting these metrics? Maybe you’re already logging system data but are unable to alter the code base to instrument metric data – or perhaps you’re using a commercial off-the-shelf (COTS) application that only emits logs. Maybe you need a low-barrier-to-entry way to gain deeper insight into your system with better search performance and lower data storage costs. Maybe certain teams need access to specific metric data but aren’t permitted to access the full details produced in logs. No matter the reason, pulling metric data from logs provides a higher-level system overview, a faster way to query data, and the ability to set alert thresholds on that metric data.
So how do we convert logs to metrics? Many observability platforms provide in-app ways to convert logs to metrics, so depending on which backend provider you use, the process can look different. In Splunk Enterprise, for example, ingest-time log-to-metrics conversion can be accomplished through Splunk Web or through configuration files.
For a quick and easy vendor-agnostic way to start reporting the metrics you need from the logs you already have, the OpenTelemetry Collector can be installed and configured to convert logs to useful metric counts. The Collector can be thought of as a pipeline to ingest logs from your application, process and extract metrics, and then send the metrics to an observability backend. Using the OpenTelemetry Collector approach allows us to:
Let’s take a look at how to do this.
Since we’re working with Splunk, we’ve installed the Splunk Distribution of the OpenTelemetry Collector. If you want to follow along and install the Collector yourself, you can work through the steps in the Splunk documentation to get started with the Splunk Distribution of the OpenTelemetry Collector.
With the Collector installed, we can configure the otel-collector-config.yaml file. First, we’ll update the receivers block to ingest our application’s log data:
This receiver sends our log data into the Collector just like it would to an HTTP Event Collector so no application code changes are required.
Next, we’ll need a connector. Connectors act as both an exporter and a receiver – they both consume and emit data. We’ll use a connector to consume our log data and then emit that data as a metric:
The count connector is used to create a http_500_errors metric that counts any log events where the status code is equal to 500. The defined attributes will additionally create any dimensions we’d like to include on this count metric.
Note: we could also use the sum connector to convert our logs to sum attribute values. The connector you use depends on the type of metric you need and what you want the metric to represent. Jeremy Hicks wrote a super great, in-depth blog post on this called Introduction to the OpenTelemetry Sum Connector. Here’s a quick example of what using the sum connector would look like:
We’re going to move forward with the count connector example. We’ll want to export this new count metric to Splunk Observability Cloud, so we’ll also include an exporter for that:
If we wanted to send our logs to an HTTP Event Collector, we could also add another exporter here. In the future, we might even want to exclude exporting the logs we convert to metrics by including a filter processor here in our configuration.
Finally, we need to add these to the service pipelines:
Here, we’ve configured our pipeline to receive logs from Splunk Enterprise, receive metrics from our count/http-error-500 connector, and export metrics to our otlp endpoint, aka Splunk Observability Cloud.
We can then save this configuration, start/restart the OpenTelemetry Collector, and search for our new metrics from the Metric Finder in Splunk Observability Cloud:
Converting logs to metrics gives us a high-level view of our system and allows us to create dashboards, charts, and detectors to proactively detect and alert on anomalies. We can then use this information to dig into our logs for deep insight into the exact lines of code that are failing for super-fast issue resolution.
If you’re ready to convert your logs to metrics and visualize your metric data in an observability platform with unified, full-system visibility, start with a Splunk Observability Cloud 14-day free trial.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.