目前最为流行的 metrics 库是来自 Coda Hale 的 dropwizard/metrics,该库被广泛地应用于各个知名的开源项目中。例如 Hadoop,Kafka,Spark,JStorm 中。
  
   
有一些优点:
- 提供了对Ehcache、Apache HttpClient、JDBI、Jersey、Jetty、Log4J、Logback、JVM等的集成
- 支持多种Metric指标:Gauges、Counters、Meters、Histograms和Timers
- 支持多种Reporter发布指标
- JMX、Console,CSV文件和SLF4J loggers
- Ganglia、Graphite,用于图形化展示
 
MetricRegistry
MetricRegistry类是Metrics的核心,它是存放应用中所有metrics的容器。也是我们使用 Metrics 库的起点。其中maven依赖添加在文末。
| 1
 | static final MetricRegistry metrics = new MetricRegistry();
 | 
Reporter
指标获取之后需要上传到各种地方,就需要用到Reporter。
控制台
监控指标直接打印在控制台
| 1 2 3 4 5 6 7
 | pravite static void startReportConsole() {     ConsoleReporter reporter = ConsoleReporter.forRegistry(metrics)             .convertRatesTo(TimeUnit.SECONDS)             .convertDurationsTo(TimeUnit.MILLISECONDS)             .build();     reporter.start(1, TimeUnit.SECONDS); }
 | 
JMX
将监控指标上报到JMX中,后续可以通过其他的开源工具上传到Graphite等供图形化展示。从Jconsole中MBean中能看到。
| 1 2 3 4
 | pravite static void startReportJmx(){     JmxReporter reporterJmx = JmxReporter.forRegistry(metrics).build();     reporterJmx.start(); }
 | 
Graphite
将监控指标上传到Graphite,从Graphite-web中能看到上传的监控指标。
| 1 2 3 4 5 6 7 8 9 10
 | pravite static void startReportGraphite(){     Graphite graphite = new Graphite(new InetSocketAddress("graphite.xxx.com", 2003));     GraphiteReporter reporter = GraphiteReporter.forRegistry(metrics)             .prefixedWith("test.metrics")             .convertRatesTo(TimeUnit.SECONDS)             .convertDurationsTo(TimeUnit.MILLISECONDS)             .filter(MetricFilter.ALL)             .build(graphite);     reporter.start(1, TimeUnit.MINUTES); }
 | 
封装各种Reporter
调用方式MetricCommon.getMetricAndStartReport();
| 1 2 3 4 5 6 7 8 9 10 11 12 13
 | public class MetricCommon {     private static final MetricRegistry metricRegistry = new MetricRegistry();     public static MetricRegistry getMetricAndStartReport(){         startReportConsole();         startReportJmx();         startReportGraphite();         return metricRegistry;     }     pravite static void startReportConsole() {...}     pravite static void startReportJmx(){...}     pravite static void startReportGraphite(){...} }
 | 
Metics指标
Metrics 有如下监控指标:
- Gauges:记录一个瞬时值。例如一个待处理队列的长度。
- Histograms:统计单个数据的分布情况,最大值、最小值、平均值、中位数,百分比(75%、90%、95%、98%、99%和99.9%)
- Meters:统计调用的频率(TPS),总的请求数,平均每秒的请求数,以及最近的1、5、15分钟的平均TPS
- Timers:当我们既要统计TPS又要统计耗时分布情况,Timer基于Histograms和Meters来实现
- Counter:计数器,自带inc()和dec()方法计数,初始为0。
- Health Checks:用于对Application、其子模块或者关联模块的运行是否正常做检测
Gauges
最简单的度量指标,只有一个简单的返回值,例如,我们想衡量一个待处理队列中任务的个数
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
 | public class GaugeTest {     private static final MetricRegistry registry = MetricCommon.getMetricAndStartReport();     private static final Random random = new Random();     @Test     public void testOneGuage() throws InterruptedException {         Queue queue= new LinkedList<String>();         registry.register(MetricRegistry.name(GaugeTest.class, "testGauges-queue-size", "size"),                 (Gauge<Integer>) () -> queue.size());         while(true){             Thread.sleep(1000);             queue.add("Job-xxx");         }     }     @Test     public void testMultiGuage() throws InterruptedException {         Map<Integer, Integer> map = new ConcurrentHashMap<>();         while(true){             int i = random.nextInt(100);             int j = i % 10;             if(!map.containsKey(j)){                 map.put(j,i);                 registry.register(MetricRegistry.name(GaugeTest.class, "testGauges-number", String.valueOf(j)),                         (Gauge<Integer>) () -> map.get(j));             }else{                 map.put(j,i);             }             Thread.sleep(1000);         }     } }
 | 
第一个测试用例,是用一个guage记录队列的长度
| 1 2 3
 | -- Gauges ---------------------------------------------------------------------- GaugeTest.testGauges-queue-size.size              value = 4
 | 
第二个测试用例,每次产生一个100以内的随机数,将这些数以个位数的数字分组,guage记录每一组现在是什么数。
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
 | -- Gauges ---------------------------------------------------------------------- GaugeTest.testGauges-number.0              value = 60 GaugeTest.testGauges-number.1              value = 1 GaugeTest.testGauges-number.2              value = 82 GaugeTest.testGauges-number.3              value = 23 GaugeTest.testGauges-number.4              value = 74 GaugeTest.testGauges-number.5              value = 25 GaugeTest.testGauges-number.7              value = 17 GaugeTest.testGauges-number.8              value = 78 GaugeTest.testGauges-number.9              value = 69
 | 
Histogram
Histogram统计数据的分布情况。比如最小值,最大值,中间值,还有中位数,75百分位, 90百分位, 95百分位, 98百分位, 99百分位, 和 99.9百分位的值(percentiles)。
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
 | public class HistogramTest {     private static final MetricRegistry registry = MetricCommon.getMetricAndStartReport();     public static Random random = new Random();     @Test     public void test() throws InterruptedException {         Histogram histogram = new Histogram(new ExponentiallyDecayingReservoir());         registry.register(MetricRegistry.name(HistogramTest.class, "request", "histogram"), histogram);         while(true){             Thread.sleep(1000);             histogram.update(random.nextInt(100000));         }     } }
 | 
运行很长时间之后,相当于随机值取极限,会趋向于统计值,75%肯定是要<=75000,99.9%肯定是要<=999000。
| 1 2 3 4 5 6 7 8 9 10 11 12 13
 | -- Histograms ------------------------------------------------------------------ HistogramTest.request.histogram              count = 1336                min = 97                max = 99930               mean = 49816.49             stddev = 29435.27             median = 49368.00               75% <= 75803.00               95% <= 95340.00               98% <= 98096.00               99% <= 98724.00             99.9% <= 99930.00
 | 
Meters
Meter度量一系列事件发生的速率(rate),例如TPS。Meters会统计最近1分钟,5分钟,15分钟,还有全部时间的速率。
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 | public class MetersTest {     MetricRegistry registry = MetricCommon.getMetricAndStartAllReport("nc110x.corp.youdao.com","test.metrics");     public static Random random = new Random();     @Test     public void testOne() throws InterruptedException {         Meter meterTps = registry.meter(MetricRegistry.name(MetersTest.class,"request","tps"));         while(true){             meterTps.mark();             Thread.sleep(random.nextInt(1000));         }     }     @Test     public void testMulti() throws InterruptedException {         while(true){             int i = random.nextInt(100);             int j = i % 10;             Meter meterTps = registry.meter(MetricRegistry.name(MetersTest.class,"request","tps",String.valueOf(j)));             meterTps.mark();             Thread.sleep(10);         }     } }
 | 
这里,多个注册多个meter与注册多个guage、Histograms用法会有不同,meter方法是getOrAdd
| 1 2 3
 | public Meter meter(String name) {         return (Meter)this.getOrAdd(name, MetricRegistry.MetricBuilder.METERS); }
 | 
一个meter的测试用例,运行结果如下。可以看到随着次数的增多,各种rate无限趋近于2次。
| 1 2 3 4 5 6 7
 | MetersTest.request.tps              count = 452          mean rate = 1.99 events/second      1-minute rate = 2.03 events/second      5-minute rate = 2.00 events/second     15-minute rate = 2.00 events/second
 | 
多个meter的测试用例,运行结果取了数字个位数为6/7/8的三个如下。最后都会无限趋近于10。sleep时间为10ms,每秒有100份,平均到尾数不同的,每组就有10份。
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
 | MetersTest.request.tps.6              count = 905          mean rate = 9.74 events/second      1-minute rate = 9.76 events/second      5-minute rate = 9.94 events/second     15-minute rate = 9.98 events/second MetersTest.request.tps.7              count = 935          mean rate = 10.07 events/second      1-minute rate = 10.62 events/second      5-minute rate = 11.82 events/second     15-minute rate = 12.19 events/second MetersTest.request.tps.8              count = 937          mean rate = 10.09 events/second      1-minute rate = 10.09 events/second      5-minute rate = 10.31 events/second     15-minute rate = 10.37 events/second
 | 
Timer
Timer其实是 Histogram 和 Meter 的结合, histogram 某部分代码/调用的耗时, meter统计TPS。
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
 | public class TimerTest {     public static Random random = new Random();     private static final MetricRegistry registry = MetricCommon.getMetricAndStartAllReport("nc110x.corp.youdao.com","test.metrics");     private static final Map<Integer,Timer> timerMap = new ConcurrentHashMap<>();     @Test     public void testOneTimer() throws InterruptedException {         Timer timer = registry.timer(MetricRegistry.name(TestTimer.class,"get-latency"));         Timer.Context ctx;         while(true){             ctx = timer.time();             Thread.sleep(random.nextInt(1000));             ctx.stop();         }     }     @Test     public void testMultiTimer() throws InterruptedException {         while(true){             int i = random.nextInt(100);             int j = i % 10;             Timer timer = registry.timer(MetricRegistry.name(TestTimer.class,"get-latency",String.valueOf(j)));             Timer.Context ctx;             ctx = timer.time();             Thread.sleep(random.nextInt(1000));             ctx.stop();             Thread.sleep(1000);         }     } }
 | 
测试用例1是单个timer,结果如下。最后的时间都趋近于统计值。
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
 | -- Timers ---------------------------------------------------------------------- com.testmetrics.TestTimer.get-latency              count = 657          mean rate = 2.05 calls/second      1-minute rate = 1.98 calls/second      5-minute rate = 2.02 calls/second     15-minute rate = 2.01 calls/second                min = 4.98 milliseconds                max = 998.93 milliseconds               mean = 496.79 milliseconds             stddev = 297.46 milliseconds             median = 501.02 milliseconds               75% <= 765.09 milliseconds               95% <= 952.03 milliseconds               98% <= 974.12 milliseconds               99% <= 989.02 milliseconds             99.9% <= 998.93 milliseconds
 | 
Counters
Counter 就是计数器,Counter 只是用 Gauge 封装了 AtomicLong 。我们可以使用如下的方法,使得获得队列大小更加高效。
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
 | public class CounterTest {     private static final MetricRegistry registry = MetricCommon.getMetricAndStartReport();     public static Queue<String> q = new LinkedBlockingQueue<String>();     public static Counter pendingJobs;     public static Random random = new Random();     public static void addJob(String job) {         pendingJobs.inc();         q.offer(job);     }     public static String takeJob() {         pendingJobs.dec();         return q.poll();     }     @Test     public void test() throws InterruptedException {         pendingJobs = registry.counter(MetricRegistry.name(Queue.class,"pending-jobs","size"));         int num = 1;         while(true){             Thread.sleep(200);             if (random.nextDouble() > 0.7){                 String job = takeJob();                 System.out.println("take job : "+job);             }else{                 String job = "Job-"+num;                 addJob(job);                 System.out.println("add job : "+job);             }             num++;         }     } }
 | 
job会越来越多,因为每次取走只取一个job,但是加入job是加入num个,num会一直增加,而概率是7:3。
| 1 2 3
 | -- Counters -------------------------------------------------------------------- java.util.Queue.pending-jobs.size              count = 36
 | 
HeathChecks
Metrics提供了一个独立的模块:Health Checks,用于对Application、其子模块或者关联模块的运行是否正常做检测。该模块是独立metrics-core模块的,使用时则导入metrics-healthchecks包。
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
 | public class HeathChecksTest extends HealthCheck {     @Override     protected Result check() throws Exception {         Random random = new Random();         if(random.nextInt(10)!=9){             return Result.healthy();         }else{             return Result.unhealthy("oh,unhealthy");         }     }     @Test     public void test() throws InterruptedException {         HealthCheckRegistry registry = new HealthCheckRegistry();         registry.register("check1",new HeathChecksTest());         registry.register("check2", new HeathChecksTest());         while (true) {             for (Map.Entry<String, Result> entry : registry.runHealthChecks().entrySet()) {                 if (entry.getValue().isHealthy()) {                     System.out.println(entry.getKey() + ": OK, message:"+entry.getValue());                 } else {                     System.err.println(entry.getKey() + ": FAIL, error message: " + entry.getValue());                 }             }             Thread.sleep(1000);         }     } }
 | 
注册两个HeathChecks,重写其check()方法为取随机数,只要不是9就为healthy,输出结果如下:
| 1 2 3 4 5 6 7 8 9
 | check1: OK, message:Result{isHealthy=true} check2: FAIL, error message: Result{isHealthy=false, message=oh,unhealthy} check1: OK, message:Result{isHealthy=true} check2: OK, message:Result{isHealthy=true} check1: OK, message:Result{isHealthy=true} check2: OK, message:Result{isHealthy=true} check1: OK, message:Result{isHealthy=true} check2: OK, message:Result{isHealthy=true} check1: OK, message:Result{isHealthy=true}
 | 
maven依赖
- metrics-core:必须添加
- metrics-healthchecks:用到healthchecks时添加
- metrics-graphite:用到graphite时添加
- org.slf4j:不添加看不到metrics-graphite包出错的log| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
 | <properties>     <metrics.version>3.1.0</metrics.version>     <sl4j.version>1.7.22</sl4j.version> </properties> <dependency>     <groupId>io.dropwizard.metrics</groupId>     <artifactId>metrics-core</artifactId>     <version>${metrics.version}</version> </dependency> <dependency>     <groupId>io.dropwizard.metrics</groupId>     <artifactId>metrics-healthchecks</artifactId>     <version>${metrics.version}</version> </dependency> <dependency>     <groupId>io.dropwizard.metrics</groupId>     <artifactId>metrics-graphite</artifactId>     <version>${metrics.version}</version> </dependency> <dependency>     <groupId>org.slf4j</groupId>     <artifactId>slf4j-api</artifactId>     <version>${sl4j.version}</version> </dependency> <dependency>     <groupId>org.slf4j</groupId>     <artifactId>slf4j-simple</artifactId>     <version>${sl4j.version}</version> </dependency>
 |  
 
参考
http://metrics.dropwizard.io/3.1.0/getting-started/
http://www.cnblogs.com/nexiyi/p/metrics_sample_1.html
http://wuchong.me/blog/2015/08/01/getting-started-with-metrics/