How Metoro Uses eBPF for Zero-Instrumentation Observability
A technical deep-dive into how Metoro captures L7 protocol traffic and intercepts TLS-encrypted data using eBPF, enabling automatic observability without code changes
At Metoro, we're building an AI SRE for Kubernetes. An AI agent that can automatically detect issues, perform root cause analysis, and suggest fixes. But here's the thing: AI SREs are only as good as the data they can see.
If your observability has gaps (sampled traces, missing services, uninstrumented third-party code) then your AI is working with incomplete information. It's like asking a doctor to diagnose you while blindfolded.
That's why we built our observability layer on eBPF. We use eBPF to capture observability data directly from the Linux kernel, which means we can see every HTTP request, every database query, and every message queue interaction, without touching application code. No gaps. No sampling. Complete visibility.
In this post, we'll cover the two core techniques that make this possible:
- L7 Protocol Detection: How we identify HTTP, PostgreSQL, Redis, and other protocols from raw network bytes
- TLS Interception: How we capture encrypted traffic by intercepting TLS libraries before encryption and after decryption
The Foundation: How We Capture Network Data
Before we can detect protocols or parse requests, we need to actually capture the data flowing through applications. Every network read and write in Linux ultimately goes through syscalls, and eBPF lets us attach handlers to these syscalls.
When an application sends an HTTP request or executes a PostgreSQL query, it eventually calls one of these syscalls:
read/write- Basic I/O operations on file descriptorssendto/recvfrom- Send and receive with address informationsendmsg/recvmsg- Scatter-gather I/O for efficiency
We attach eBPF programs to these syscalls using tracepoints. When a syscall fires, our eBPF program captures the first bytes of the payload, along with metadata like the process ID and file descriptor.
But raw bytes and file descriptors aren't enough. We need to know which network connection a file descriptor corresponds to. So we also track TCP connection state using kernel tracepoints that fire on socket state transitions. This gives us a mapping from file descriptors to actual network connections with source/destination IPs and ports.
The result is that when we see a write syscall on a particular file descriptor, we can look up that it corresponds to a TCP connection to, say, 10.0.0.5:5432, and from there identify it as PostgreSQL traffic.
flowchart LR
A[Application] -->|write/send| B[Syscall Layer]
B --> C[eBPF Tracepoint]
C --> D{Protocol Detection}
D -->|HTTP| E[Perf Buffer]
D -->|PostgreSQL| E
D -->|Redis| E
D -->|Other| E
L7 Protocol Detection: Identifying Protocols from Raw Bytes
Here's where it gets interesting. We have raw bytes flowing through syscalls, but we need to know if those bytes are HTTP, PostgreSQL, Redis, or something else entirely.
The technique is signature matching. Each protocol has distinctive byte patterns at the start of messages, and we can check for these patterns directly in our eBPF code running in kernel space.
HTTP Detection
HTTP is one of the easiest protocols to detect because requests start with a method name. Our detection logic checks if the first bytes match one of the HTTP methods:
GET, POST, PUT, DELETE, HEAD, OPTIONS, PATCH, CONNECT
For responses, we look for the HTTP/1. prefix followed by a status code. The detection is a simple byte-by-byte comparison:
If bytes start with "GET" → HTTP request
If bytes start with "POST" → HTTP request
If bytes start with "HTTP/1." followed by status code → HTTP response
This runs in kernel space, so by the time the data reaches our userspace agent, we already know it's HTTP and can parse it accordingly.
Database Protocol Detection
Database protocols are more complex because they use binary wire formats, but they still have distinctive signatures.
PostgreSQL uses a message-based protocol where each message has a type byte and a length. We look for the characteristic message structure and validate it to confirm it's PostgreSQL rather than random binary data.
MySQL packets have a specific header format with length and sequence information, followed by command identifiers. We validate this structure to identify MySQL traffic.
Redis uses the RESP (REdis Serialization Protocol) format with distinctive prefix characters for different data types. These prefixes make Redis straightforward to detect.
MongoDB uses a binary wire protocol with specific message types and opcodes. We validate the message header structure to identify MongoDB traffic.
Message Queue and Other Protocols
We detect several other protocols using similar signature-based techniques, including Kafka, Cassandra, RabbitMQ, DNS, and gRPC/HTTP2. Each has distinctive header patterns that we can match in kernel space.
Why Kernel-Space Detection Matters
You might wonder why we do this detection in eBPF rather than just sending all data to userspace and parsing there. The answer is efficiency.
eBPF programs run in kernel space with extremely low overhead. By detecting protocols in the kernel, we can filter out traffic we don't care about before it ever reaches userspace. We're not copying megabytes (or sometimes gigabytes) of irrelevant data through perf buffers. And because we know the protocol, we can extract just the relevant parts of each message.
The actual parsing, extracting SQL queries from PostgreSQL messages or parsing HTTP headers, happens in our Go userspace agent. This is safer and more flexible than trying to do complex parsing in eBPF, which has strict verifier limits on what code can run in the kernel.
The Encryption Problem
Everything we've described so far works great for plaintext traffic. But in the real world, most traffic is encrypted with TLS.
When an application uses TLS, the data that reaches the syscall layer is encrypted. If we capture bytes from a write syscall on a TLS connection, we see something like:
\x17\x03\x03\x00\x5a\x8c\x2f\x9d\x7a...
That's a TLS application data record, and without the encryption keys, we can't see the HTTP request or PostgreSQL query inside.
Traditional solutions like mTLS termination or service mesh sidecars can decrypt traffic, but they add latency and operational complexity. They also don't help if you want to observe traffic that doesn't go through your mesh.
The key observation is that TLS libraries have specific points in their code where plaintext exists:
- Before encryption: When the application writes data, the plaintext is in the input buffer before the library encrypts it
- After decryption: When the application reads data, the plaintext is available after the library decrypts it
If we can intercept at these points, we get the plaintext data without needing encryption keys.
flowchart LR
subgraph Application
A[Plaintext Data]
end
subgraph TLS Library
B[Write Function]
C[Encrypt]
D[Decrypt]
E[Read Function]
end
subgraph Network
F[Encrypted Data]
end
A -->|"uprobe captures here"| B
B --> C
C --> F
F --> D
D --> E
E -->|"uprobe captures here"| G[Plaintext Data]
TLS Interception with Uprobes
While tracepoints attach to kernel events, uprobes let us attach to function entry and exit points in userspace binaries. We can intercept any function in any executable, including TLS library functions.
Go TLS Interception
Go doesn't use OpenSSL. It has its own TLS implementation in the crypto/tls package, which means we need a separate approach for Go services.
We attach uprobes to Go's TLS read and write functions. The challenge is that Go uses a register-based calling convention that differs between architectures, so we need to carefully extract function arguments from the right CPU registers.
We also need to track goroutine context to correctly correlate function entries with exits, since Go's concurrency model means multiple goroutines may be performing TLS operations simultaneously.
On writes, we capture the plaintext data before Go encrypts it. On reads, we capture the data on function exit after Go has decrypted it.
OpenSSL and BoringSSL Interception
Most other languages (Python, Node.js, Ruby, Java native code, C/C++) use OpenSSL or its Google fork BoringSSL. We intercept the standard SSL read and write functions to capture plaintext before encryption and after decryption.
OpenSSL has a well-documented C calling convention, making it relatively straightforward to extract function parameters. However, there are two significant challenges:
Version compatibility: OpenSSL's internal structures have changed significantly across major versions. We need to handle these differences to correctly extract connection information and correlate SSL sessions with network sockets.
Memory BIOs: Some runtimes don't connect SSL objects directly to sockets. Instead, they use memory buffers where encrypted data is staged before being sent over the network. We handle these cases to ensure we don't miss traffic from applications using this pattern.
From Kernel to Metoro: The Data Pipeline
We've covered how we capture data and detect protocols. Now let's look at how that data flows from the kernel to Metoro.
Perf Ring Buffers
eBPF programs communicate with userspace through perf ring buffers. These are circular buffers in kernel memory that eBPF programs write to and userspace programs read from. They're designed for high-throughput event streaming.
We use per-CPU buffers to avoid contention. Each CPU core has its own buffer, so eBPF programs on different cores don't block each other. Our Go agent reads from all buffers concurrently.
We tune buffer sizes based on event volume and importance. L7 events that include payload data need larger buffers than simple process lifecycle events.
Userspace Processing
Our Go agent, built on the cilium/ebpf library, loads eBPF programs and reads from perf buffers. When events arrive, we:
- Deserialize the binary event data into Go structs
- Correlate events: match syscall data with connection info, process info, and pod metadata
- Parse protocols: extract HTTP paths, SQL queries, Redis commands, etc.
- Match requests with responses to calculate latency
- Enrich with Kubernetes metadata: pod name, namespace, service, deployment
The Kubernetes enrichment uses cgroup and namespace information from the kernel events. Since containers are built on cgroups, we can map from kernel-level cgroup IDs to Kubernetes pods.
Export
Finally, we export the processed data to Metoro's backend using OTLP (OpenTelemetry Protocol). This includes:
- Distributed traces showing request flow across services
- Metrics for request rates, latencies, and error rates
- Database query details including actual SQL statements
flowchart TB
subgraph Kernel Space
A[Syscall Tracepoints] --> B[eBPF Programs]
C[TLS Uprobes] --> B
B --> D[Protocol Detection]
D --> E[Perf Ring Buffers]
end
subgraph Userspace
E --> F[Go Agent]
F --> G[Parse & Correlate]
G --> H[K8s Enrichment]
end
H -->|OTLP| I[Metoro Backend]
Putting It All Together: Tracing a Real Request
Let's walk through what happens when a request flows through a system instrumented with Metoro:
-
HTTPS request arrives at a Go service. Our TLS uprobe captures the decrypted HTTP request including method, path, and headers.
-
Protocol detection in kernel space identifies this as HTTP. The event flows through the perf buffer to our Go agent.
-
The service queries PostgreSQL to fetch data. Our syscall tracepoint captures the write to the PostgreSQL connection. Protocol detection identifies it as PostgreSQL, and we extract the SQL query.
-
PostgreSQL responds and our tracepoint captures the response. We match this with the original query to calculate database latency.
-
The service responds to the original HTTP request. Our TLS uprobe captures the response as it's written.
-
Metoro correlates all these events into a distributed trace showing: HTTP request → PostgreSQL query → HTTP response, with latencies at each step.
The engineer using Metoro sees this as a complete trace without having added any instrumentation code. They can see the exact PostgreSQL query, the response time, and how it fits into the broader request flow.
Why This Matters for AI-Powered Observability
eBPF fundamentally changes what's possible for observability, and by extension, what's possible for AI-powered operations.
Most AI observability tools fail because they're working with incomplete data. They analyze sampled traces, miss uninstrumented services, and can't see into encrypted traffic. When the AI tries to find the root cause of an issue, it's working with fragments.
With eBPF, we see everything:
- Every HTTP request between services, even through TLS
- Every database query, including the actual SQL
- Every message queue interaction
- All without any code changes or sampling
This complete visibility is what powers Metoro Guardian, our AI SRE. When Guardian investigates an issue, it has access to 100% of the traffic flowing through your cluster. It can trace a failing request through every service it touches, see the exact database queries that ran, and identify where things went wrong, even for issues you don't have monitors for.
The combination of kernel-level data capture and AI-powered analysis means you get observability that works from day one, finds issues automatically, and actually has enough context to identify root causes accurately.
If you want to see how this works in practice, check out Metoro.