python - Pandas DataFrame grouped box plot from aggregated results -
i want draw box plot, don't have raw data aggregated results in pandas dataframe.
is still possible draw box plot aggregated results?
if not, closest plot can get, plot min, max, mean, median, std-dev etc. know can plot them using line chart, need boxplots grouped/clustered.
here data, plotting part missing. please help. thanks
import matplotlib.pyplot plt import numpy np import pandas pd df = pd.dataframe({ 'group' : ['tick tick tick', 'tock tock tock', 'tock tock tock', 'tick tick tick']*3, # , ['tock tock tock', 'tick tick tick']*6, 'person':[x*5 x in list('abc')]*4, 'median':np.random.randn(12), 'stddev':np.random.randn(12) }) df["average"]=df["median"]*1.1 df["minimum"]=df["median"]*0.5 df["maximum"]=df["median"]*1.6 df["90%"]=df["maximum"]*0.9 df["95%"]=df["maximum"]*0.95 df["99%"]=df["maximum"]*0.99 df
update,
i'm 1 step closer result -- have found feature available since matplotlib 1.4, , i'm using matplotlib 1.5, , tested , proved working me.
the problem have no clue why works, , how adapt above code use such new feature. i'll re-post working code below, hope can understand , put 2 , 2 together.
the data have median, average, minimum, 90%,95%, 99%, maximum , stddev, , hope chart them all. , took @ data structure of logstats
of following code, after for stats, label in zip(logstats, list('abcd'))
, , found fields are:
[{'cihi': 4.2781254505311281, 'cilo': 1.6164348064249057, 'fliers': array([ 19.69118642, 19.01171604]), 'iqr': 5.1561885723613567, 'label': 'a', 'mean': 4.9486856766955922, 'med': 2.9472801284780168, 'q1': 1.7655440553898782, 'q3': 6.9217326277512345, 'whishi': 12.576334012545718, 'whislo': 0.24252084924003742}, {'cihi': 4.3186289184254107, 'cilo': 1.9963715983778565, ...
so, this
and bxp
doc, i'm going map data follows:
- whislo: minimum
- q1: median
- med: average
- mean: 90%
- q3: 95%
- whishi: 99%
- and maximum fliers
to map them, i'll select minimum whislo, [90%] mean, [95%] q3, [99%] whishi
...here final result:
raw_data = {'label': ['label_01 init', 'label_02', 'label_03', 'label_04', 'label_05', 'label_06', 'label_07', 'label_08', 'label_99'], 'whislo': [0.17999999999999999, 2.0299999999999998, 4.0800000000000001, 2.0899999999999999, 2.3300000000000001, 2.3799999999999999, 1.97, 2.6499999999999999, 0.089999999999999997], 'q3': [0.5, 4.9699999999999998, 11.77, 5.71, 12.460000000000001, 11.859999999999999, 13.84, 16.969999999999999, 0.29999999999999999], 'mean': [0.40000000000000002, 4.1299999999999999, 10.619999999999999, 5.0999999999999996, 10.24, 9.0700000000000003, 11.960000000000001, 15.15, 0.26000000000000001], 'whishi': [1.76, 7.6399999999999997, 20.039999999999999, 6.6699999999999999, 22.460000000000001, 21.66, 16.629999999999999, 19.690000000000001, 1.1799999999999999], 'q1': [0.28000000000000003, 2.96, 7.6100000000000003, 3.46, 5.8099999999999996, 5.4400000000000004, 6.6299999999999999, 8.9900000000000002, 0.16], 'fliers': [5.5, 17.129999999999999, 32.890000000000001, 7.9100000000000001, 32.829999999999998, 70.680000000000007, 24.699999999999999, 32.240000000000002, 3.3500000000000001]} df = pd.dataframe(raw_data, columns = ['label', 'whislo', 'q1', 'mean', 'q3', 'whishi', 'fliers'])
now challenge how present above dataframe in box plot multiple level of grouping. if multiple level of grouping difficult, let's plotting pd dataframe working first, because pd
dataframe has same fields required np
array. tried,
fig, ax = plt.subplots() ax.bxp(df.as_matrix(), showmeans=true, showfliers=true, vert=false)
but got
...\anaconda3\lib\site-packages\matplotlib\axes\_axes.py in bxp(self, bxpstats, positions, widths, vert, patch_artist, shownotches, showmeans, showcaps, showbox, showfliers, boxprops, whiskerprops, flierprops, medianprops, capprops, meanprops, meanline, manage_xticks) 3601 pos, width, stats in zip(positions, widths, bxpstats): 3602 # try find new label -> 3603 datalabels.append(stats.get('label', pos)) 3604 # fliers coords 3605 flier_x = np.ones(len(stats['fliers'])) * pos attributeerror: 'numpy.ndarray' object has no attribute 'get'
if use ax.bxp(df.to_records(), ...
, i'll attributeerror: 'record' object has no attribute 'get'
.
ok, got working, plotting pd dataframe, not multiple level of grouping, this:
df['fliers']='' fig, ax = plt.subplots() ax.bxp(df.to_dict('records'), showmeans=true, meanline=true, showfliers=false, vert=false) # shownotches=true, plt.show()
note above data missing med
field, can add correct ones, or use df['med']=df['q1']*1.2
make works.
import matplotlib import matplotlib.pyplot plt import numpy np import pandas pd def test_bxp_with_ylabels(): np.random.seed(937) logstats = matplotlib.cbook.boxplot_stats( np.random.lognormal(mean=1.25, sigma=1., size=(37,4)) ) print(logstats) stats, label in zip(logstats, list('abcd')): stats['label'] = label fig, ax = plt.subplots() ax.set_xscale('log') ax.bxp(logstats, vert=false) test_bxp_with_ylabels()
while waiting clarification of df, related to:
dic = [{'cihi': 4.2781254505311281, 'cilo': 1.6164348064249057, 'fliers': array([ 19.69118642, 19.01171604]), 'iqr': 5.1561885723613567, 'mean': 4.9486856766955922, 'med': 2.9472801284780168, 'q1': 1.7655440553898782, 'q3': 6.9217326277512345, 'whishi': 12.576334012545718, 'whislo': 0.24252084924003742}]
and how data should map:
from bxp
doc:
required keys are: - ``med``: median (scalar float). - ``q1``: first quartile (25th percentile) (scalar float). - ``q3``: first quartile (50th percentile) (scalar float). # here guess it's rather : 3rd quartile (75th percentile) - ``whislo``: lower bound of lower whisker (scalar float). - ``whishi``: upper bound of upper whisker (scalar float). optional keys are: - ``mean``: mean (scalar float). needed if ``showmeans=true``. - ``fliers``: data beyond whiskers (sequence of floats). needed if ``showfliers=true``. - ``cilo`` & ``cihi``: lower , upper confidence intervals median. needed if ``shownotches=true``.
then, have do:
fig, ax = plt.subplots(1,1) ax.bxp([dic], showmeans=true)
so need find way build dic
. note not plot std
, whisker, need choose whether go 90%, 95% or 99% can't have values. in case need add them afterward plt.hlines()
.
hth
Comments
Post a Comment