java - jHiccup: histogram data seems inconsistent -
i'm running application azul's jhiccup. application started via jsvc -javaagent option. jhiccup settings default (except output filename).
the application runs ~107 minutes. however, if add values resulting hgrm
file, total length of hiccups seems large.
i calculate sum of hiccup values this:
sum( (totalcountincludingthisvalue[n] - totalcountincludingthisvalue[n-1]) * value[n] )
the result 12.75 mln., or ~213 minutes (presuming values in milliseconds). twice duration of application run.
total count * mean
(from bottom of histogram values) give 12 mln., formula correct.
for comparison, added "application time" , "stopped time" gc log. results close expect: ~105 min. app time , ~2 min. stopped time.
i've accidentally overwritten original files, these example files different run:
https://dl.dropboxusercontent.com/u/644287/jhiccup/hiccup.dat
https://dl.dropboxusercontent.com/u/644287/jhiccup/hiccup.dat.hgrm
the run took ~79 minutes (17:50-19:09), mean*totalcount ~23 mln. (that is, 380 minutes).
update: [keeping original mistaken post below]
ok. both had math wrong (both in question , in initial answer). sanity check says (total count * mean( should equal runtime bogus math.
this demonstrated following thought exercise:
the following simple scenario: measure application runs (idle, intervals showing 0 hiccup) 100 seconds. ^z system 100 seconds. let run in foreground again , ^c it. total run time 200 seconds.
here proper, "how human describe in numbers" statements such system:
- the 99.99%'ile close 100,000 msec.
- the 10%'ile, 20%'ile, ... 50%'ile close 0 msec.
- during 1st 100 seconds (and first 100,000 sample periods), mean 0 msec.
- during 2nd 100 seconds (and second 100,000 sample periods) mean 50,000 msec.
- the overall mean 25,000 milliseconds (over total run time of 200 second period).
but there (properly) 200,000 results reported in totalcount histogram (one each 1msec of elapsed time).
so (mean * total count) = 200,000 * 25,000 == 5,000,000,000
that's 25,000x as wall clock time. , there nothing wrong that. there ~n^2 thing going on multiplication (actually n * (n-1) / 2), , that's how behave...
the key here each percentile, histogram [correctly] reports percentage of overall random samples (performed without coordination on time) encounter hiccup levels below magnitude. when single long freeze occurs (e.g. seen in 2,654 msec max in sample output), that's not 1 point in time affected. there many sample points in time affected there milliseconds in freeze. measured , represented, single freeze generate 1 count per msec, values varying linearly between max value , 0 (2654msec, 2653msec, 2652msec, ... 3msec, 2msec, 1msec). right way calculate spread of possible latencies that random, uncoordinated sample have encountered.
i'm embarrassed, because original answer repeated same mistaken assumption warn (with graphs in slides discuss above scenario) in talk titled "how not measure latency" (http://www.infoq.com/presentations/latency-pitfalls , , under "the coordinated omission problem" 32 minutes in). built both jhiccup , hdrhistogram right math avoid problem, why see right results in data.
in fact, had sanity check shown "good" numbers, max times in 1000s of msec, indication measurement system exhibiting coordinated omission...
-- gil.
-------- original answer, "wrong" -----------
can post example .hgrm data set shows this? files looked @ seem show close relationship between total instrumented run time , counts in .hgrm file. simplest sanity test total count * mean (both reported @ bottom of file) should close runtime. data come same histogram produces percentile lines...
note counts @ percentile lines not additive pretty simple reason. 2 percentile levels reported same exact value. quite common long runs have little noise in them. however, math above (only counting delta counts) "should" work right.
[small note, math off 1 line: each percentile line shows percentile , count of values have values @ or below value on line. value should multiplied delta between counts @ next line (n+1) , count @ "this" line (n), should not have made big difference in measurement].
below sample file forma control run took ~1685 seconds, jhiccup started instrumenting 30 seconds run (so ~1655 seconds of instrumented time). using excel on data set, results of:
sum( (totalcountincludingthisvalue[n+1] - totalcountincludingthisvalue[n]) * value[n] )
shows ~16237 seconds, , mean * total count shows ~16591 seconds.
jhiccup histogram report, mon apr 22 15:28:07 cdt 2013 :
value, percentile, totalcountincludingthisvalue
1.01 0.000000000000 47 1.06 0.100000000000 483774 1.06 0.200000000000 483774 1.06 0.300000000000 483774 1.10 0.400000000000 628589 1.11 0.500000000000 804685 1.12 0.550000000000 1024190 1.12 0.600000000000 1024190 1.12 0.650000000000 1024190 1.13 0.700000000000 1388741 1.13 0.750000000000 1388741 1.13 0.775000000000 1388741 1.13 0.800000000000 1388741 1.13 0.825000000000 1388741 1.13 0.850000000000 1388741 1.13 0.875000000000 1388741 1.13 0.887500000000 1388741 1.13 0.900000000000 1388741 1.13 0.912500000000 1388741 1.14 0.925000000000 1471124 1.14 0.937500000000 1471124 1.14 0.943750000000 1471124 1.14 0.950000000000 1471124 1.14 0.956250000000 1471124 1.14 0.962500000000 1471124 1.14 0.968750000000 1471124 1.14 0.971875000000 1471124 1.14 0.975000000000 1471124 1.14 0.978125000000 1492169 1.14 0.981250000000 1492169 1.14 0.984375000000 1492169 1.14 0.985937500000 1492169 1.14 0.987500000000 1492169 1.14 0.989062500000 1492169 1.15 0.990625000000 1500639 1.15 0.992187500000 1500639 1.15 0.992968750000 1500639 1.15 0.993750000000 1500639 1.15 0.994531250000 1500639 1.16 0.995312500000 1504895 1.16 0.996093750000 1504895 1.16 0.996484375000 1504895 1.16 0.996875000000 1504895 1.16 0.997265625000 1504895 1.16 0.997656250000 1504895 1.17 0.998046875000 1506535 1.17 0.998242187500 1506535 1.17 0.998437500000 1506535 1.17 0.998632812500 1506535 1.17 0.998828125000 1506535 1.18 0.999023437500 1507442 1.18 0.999121093750 1507442 1.18 0.999218750000 1507442 1.18 0.999316406250 1507442 1.18 0.999414062500 1507442 1.18 0.999511718750 1507836 1.18 0.999560546875 1507836 1.18 0.999609375000 1507836 1.18 0.999658203125 1507836 1.18 0.999707031250 1507836 1.19 0.999755859375 1508028 1.19 0.999780273438 1508028 1.19 0.999804687500 1508028 1.19 0.999829101563 1508028 1.20 0.999853515625 1508106 1.20 0.999877929688 1508106 1.21 0.999890136719 1508145 1.21 0.999902343750 1508145 1.22 0.999914550781 1508163 1.22 0.999926757813 1508174 1.25 0.999938964844 1508186 1.34 0.999945068359 1508195 1.82 0.999951171875 1508204 2.42 0.999957275391 1508213 3.54 0.999963378906 1508222 4.74 0.999969482422 1508231 5.09 0.999972534180 1508236 5.98 0.999975585938 1508241 6.24 0.999978637695 1508245 7.01 0.999981689453 1508251 7.97 0.999984741211 1508254 8.26 0.999986267090 1508257 8.96 0.999987792969 1508259 9.02 0.999989318848 1508261 9.98 0.999990844727 1508265 10.24 0.999992370605 1508266 11.01 0.999993133545 1508268 11.01 0.999993896484 1508268 11.97 0.999994659424 1508269 13.95 0.999995422363 1508271 14.98 0.999996185303 1508272 14.98 0.999996566772 1508272 16.00 0.999996948242 1508273 16.00 0.999997329712 1508273 17.02 0.999997711182 1508274 18.05 0.999998092651 1508275 18.05 0.999998283386 1508275 18.05 0.999998474121 1508275 18.05 0.999998664856 1508275 19.07 0.999998855591 1508276 19.07 0.999999046326 1508276 19.07 0.999999141693 1508276 19.07 0.999999237061 1508276 19.07 0.999999332428 1508276 19.97 0.999999427795 1508277 19.97 1.000000000000 1508277 #[mean = 1.10, stddeviation = 0.06] #[max = 19.84, total count = 1508277] #[buckets = 25, subbuckets = 256]
Comments
Post a Comment