Loki 和分布式追踪
Loki 和分布式追踪
本节介绍轻量级日志方案 Loki 和分布式追踪系统 Jaeger。
Loki - 轻量级日志方案
Loki 是 Grafana Labs 开发的日志聚合系统,相比 EFK 更加轻量级。
Loki vs Elasticsearch
| 特性 | Loki | Elasticsearch |
|---|---|---|
| 索引方式 | 只索引标签 | 全文索引 |
| 存储成本 | 低 | 高 |
| 查询速度 | 快(标签查询) | 快(全文搜索) |
| 资源占用 | 低 | 高 |
| 适用场景 | 结构化日志 | 全文搜索 |
架构
┌─────────────┐
│ Application │
│ (logs) │
└──────┬──────┘
│
┌──────▼──────┐
│ Promtail │ ◄── 收集日志(轻量级)
│ DaemonSet │
└──────┬──────┘
│
┌──────▼──────┐
│ Loki │ ◄── 存储日志(只索引标签)
└──────┬──────┘
│
┌──────▼──────┐
│ Grafana │ ◄── 查询展示
└─────────────┘
安装 Loki Stack
# 添加 Grafana Helm 仓库
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
# 安装 Loki Stack (包含 Loki + Promtail + Grafana)
helm install loki grafana/loki-stack \
--namespace logging \
--create-namespace \
--set grafana.enabled=true \
--set prometheus.enabled=true \
--set promtail.enabled=true \
--set loki.persistence.enabled=true \
--set loki.persistence.size=10Gi
Promtail 配置
apiVersion: v1
kind: ConfigMap
metadata:
name: promtail-config
namespace: logging
data:
promtail.yaml: |
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
# Kubernetes Pod 日志
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
pipeline_stages:
- docker: {}
relabel_configs:
# 只抓取有特定注解的 Pod
- source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_scrape
action: keep
regex: true
# 添加 namespace 标签
- source_labels:
- __meta_kubernetes_namespace
target_label: namespace
# 添加 pod 标签
- source_labels:
- __meta_kubernetes_pod_name
target_label: pod
# 添加 container 标签
- source_labels:
- __meta_kubernetes_pod_container_name
target_label: container
# 添加 app 标签
- source_labels:
- __meta_kubernetes_pod_label_app
target_label: app
# 日志路径
- replacement: /var/log/pods/*$1/*.log
separator: /
source_labels:
- __meta_kubernetes_pod_uid
- __meta_kubernetes_pod_container_name
target_label: __path__
Promtail DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: promtail
namespace: logging
spec:
selector:
matchLabels:
app: promtail
template:
metadata:
labels:
app: promtail
spec:
serviceAccountName: promtail
containers:
- name: promtail
image: grafana/promtail:2.9.0
args:
- -config.file=/etc/promtail/promtail.yaml
volumeMounts:
- name: config
mountPath: /etc/promtail
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
env:
- name: HOSTNAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
volumes:
- name: config
configMap:
name: promtail-config
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
Grafana 中查询 Loki 日志
# 在 Grafana 中添加 Loki 数据源
# Configuration → Data Sources → Add data source → Loki
# URL: http://loki:3100
# LogQL 查询示例:
# 1. 查询特定 namespace 的日志
{namespace="production"}
# 2. 查询特定 app 的日志
{app="myapp"}
# 3. 包含特定文本的日志
{app="myapp"} |= "error"
# 4. 排除特定文本
{app="myapp"} != "debug"
# 5. 正则匹配
{app="myapp"} |~ "error|failed"
# 6. JSON 解析
{app="myapp"} | json | level="error"
# 7. 统计错误率
rate({app="myapp"} |= "error" [5m])
# 8. 多标签查询
{namespace="production", app="myapp", container="api"}
Jaeger - 分布式追踪
Jaeger 是 CNCF 的分布式追踪系统,用于监控微服务架构中的请求流程。
分布式追踪概念
用户请求 → API Gateway → Service A → Service B → Database
[Span 1] [Span 2] [Span 3] [Span 4]
└────────────── Trace ID: abc123 ──────────┘
- Trace: 完整的请求链路
- Span: 单个操作(如 HTTP 请求、数据库查询)
- Trace ID: 唯一标识一条 Trace
- Span ID: 唯一标识一个 Span
安装 Jaeger Operator
# 创建 namespace
kubectl create namespace tracing
# 安装 Jaeger Operator
kubectl create -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.51.0/jaeger-operator.yaml -n tracing
部署 Jaeger 实例
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: jaeger
namespace: tracing
spec:
strategy: production
# 存储配置(使用 Elasticsearch)
storage:
type: elasticsearch
options:
es:
server-urls: http://elasticsearch.logging:9200
index-prefix: jaeger
# Collector 配置
collector:
maxReplicas: 5
resources:
limits:
cpu: 500m
memory: 512Mi
# Query UI 配置
query:
replicas: 2
resources:
limits:
cpu: 500m
memory: 512Mi
# Agent 配置
agent:
strategy: DaemonSet
resources:
limits:
cpu: 200m
memory: 128Mi
简化部署(All-in-One)
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: jaeger-simple
namespace: tracing
spec:
strategy: allInOne
allInOne:
image: jaegertracing/all-in-one:latest
options:
memory:
max-traces: 100000
storage:
type: memory
ingress:
enabled: true
应用集成 OpenTelemetry
OpenTelemetry 是 CNCF 的可观测性框架,统一了指标、日志、追踪。
Node.js 集成
// 安装依赖
// npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node @opentelemetry/exporter-jaeger
// tracing.js
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');
const jaegerExporter = new JaegerExporter({
endpoint: 'http://jaeger-collector:14268/api/traces',
});
const sdk = new NodeSDK({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'myapp',
[SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0',
}),
traceExporter: jaegerExporter,
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
process.on('SIGTERM', () => {
sdk.shutdown()
.then(() => console.log('Tracing terminated'))
.catch((error) => console.log('Error terminating tracing', error))
.finally(() => process.exit(0));
});
// app.js
require('./tracing');
const express = require('express');
const app = express();
app.get('/api/users', async (req, res) => {
// 自动创建 span
const users = await fetchUsers();
res.json(users);
});
app.listen(3000);
手动创建 Span
const opentelemetry = require('@opentelemetry/api');
async function fetchUsers() {
const tracer = opentelemetry.trace.getTracer('myapp');
// 创建 span
const span = tracer.startSpan('fetchUsers');
try {
// 添加属性
span.setAttribute('db.system', 'mongodb');
span.setAttribute('db.name', 'users');
// 执行业务逻辑
const users = await db.collection('users').find().toArray();
// 添加事件
span.addEvent('users fetched', {
count: users.length
});
return users;
} catch (error) {
// 记录错误
span.recordException(error);
span.setStatus({ code: opentelemetry.SpanStatusCode.ERROR });
throw error;
} finally {
// 结束 span
span.end();
}
}
Go 集成
package main
import (
"context"
"log"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/jaeger"
"go.opentelemetry.io/otel/sdk/resource"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.4.0"
)
func initTracer() func() {
exporter, err := jaeger.New(jaeger.WithCollectorEndpoint(
jaeger.WithEndpoint("http://jaeger-collector:14268/api/traces"),
))
if err != nil {
log.Fatal(err)
}
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exporter),
sdktrace.WithResource(resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceNameKey.String("myapp"),
semconv.ServiceVersionKey.String("1.0.0"),
)),
)
otel.SetTracerProvider(tp)
return func() {
if err := tp.Shutdown(context.Background()); err != nil {
log.Fatal(err)
}
}
}
func main() {
cleanup := initTracer()
defer cleanup()
// 应用代码
}
访问 Jaeger UI
# 端口转发
kubectl port-forward -n tracing svc/jaeger-query 16686:16686
# 访问 http://localhost:16686
Jaeger UI 功能
- 搜索 Trace: 按服务、操作、标签搜索
- 查看详情: 查看 Span 时间线、持续时间
- 依赖图: 查看服务间依赖关系
- 对比 Trace: 对比不同请求的性能
最佳实践
1. 日志规范
{
"timestamp": "2024-01-08T12:00:00Z",
"level": "error",
"message": "Database connection failed",
"service": "user-service",
"trace_id": "abc123",
"span_id": "def456",
"user_id": "user123",
"error": {
"type": "ConnectionError",
"message": "Connection timeout after 5s"
}
}
2. 采样策略
// 生产环境使用采样,减少开销
const sdk = new NodeSDK({
// 采样 10% 的请求
sampler: new TraceIdRatioBasedSampler(0.1),
// 或基于父 Span 决策
sampler: new ParentBasedSampler({
root: new TraceIdRatioBasedSampler(0.1),
}),
});
3. 关联日志和追踪
const logger = require('winston');
const opentelemetry = require('@opentelemetry/api');
function log(level, message) {
const span = opentelemetry.trace.getActiveSpan();
const context = span ? {
trace_id: span.spanContext().traceId,
span_id: span.spanContext().spanId,
} : {};
logger.log(level, message, context);
}
小结
本节介绍了轻量级监控方案:
✅ Loki: 轻量级日志聚合,只索引标签
✅ Promtail: DaemonSet 收集日志
✅ LogQL: Grafana 中查询 Loki 日志
✅ Jaeger: 分布式追踪系统
✅ OpenTelemetry: 统一的可观测性框架
✅ 应用集成: Node.js、Go 集成示例
下一节:监控最佳实践。