python - Dictionary comprehension to calculate statistics across dict of dicts for each key in inner dicts -
i have dictionary this:
property2region2value = { 'countrya':{ 'a': 24, 'b': 56, 'c': 78 }, 'countryb':{ 'a': 3, 'b': 98 }, 'countryc':{ 'a': 121, 'b': 12121, 'c': 12989121, 'd':16171 }, 'countryd':{ 'a': 123, 'b': 1312, 'c': 1231 }, 'countrye':{ 'a': 1011, 'b': 1911 }, 'countryf'{ 'a': 1433, 'b': 19829, 'c': 1132, 'd':1791 } }
and trying create multiple dictionaries each contains statistics master dictionary (min, max, std etc) values property (a','b','c
etc) across countries (e.g. countrya, countryb
) etc.
so example: {'a': min of 'a' across countries, 'b': min of 'b' across countries....}
might 1 dictionary.
at moment, huge loop of code a) isn't efficient , b) doesn't allow me calculate statistics e.g. using np.min()
or np.max()
functions in numpy.
how can use dictionary comprehension achieve this? current code calculating min
, max
:
for country, property2value in property2region2value.items(): property,value in property2value.items(): if property not in property2rangemax: property2rangemax[property] = 0 if property2rangemax[property]<value: property2rangemax[property]=value if property not in property2rangemin: property2rangemin[property] = 0 if property2rangemin[property]>value: property2rangemin[property] = value
you should use pandas
task:
edit
pandas can accomplish want:
in [3]: pd.dataframe(property2region2value) out[3]: countrya countryb countryc countryd countrye countryf 24.0 3.0 121 123.0 1011.0 1433 b 56.0 98.0 12121 1312.0 1911.0 19829 c 78.0 nan 12989121 1231.0 nan 1132 d nan nan 16171 nan nan 1791 in [4]: df.apply(np.min, axis=1) out[4]: 3.0 b 56.0 c 78.0 d 1791.0 dtype: float64 in [5]: df.apply(np.mean, axis=1) out[5]: 4.525000e+02 b 5.887833e+03 c 3.247890e+06 d 8.981000e+03 dtype: float64 in [6]: mean_dict = df.apply(np.mean, axis=1).to_dict() in [7]: mean_dict out[7]: {'a': 452.5, 'b': 5887.833333333333, 'c': 3247890.5, 'd': 8981.0}
or, more easily, can transpose dataframe:
in [20]: df.t out[20]: b c d countrya 24.0 56.0 78.0 nan countryb 3.0 98.0 nan nan countryc 121.0 12121.0 12989121.0 16171.0 countryd 123.0 1312.0 1231.0 nan countrye 1011.0 1911.0 nan nan countryf 1433.0 19829.0 1132.0 1791.0 in [21]: df.t.describe() out[21]: b c d count 6.000000 6.000000 4.000000e+00 2.000000 mean 452.500000 5887.833333 3.247890e+06 8981.000000 std 612.768717 8215.770187 6.494154e+06 10168.195513 min 3.000000 56.000000 7.800000e+01 1791.000000 25% 48.250000 401.500000 8.685000e+02 5386.000000 50% 122.000000 1611.500000 1.181500e+03 8981.000000 75% 789.000000 9568.500000 3.248204e+06 12576.000000 max 1433.000000 19829.000000 1.298912e+07 16171.000000 in [22]: df.t.describe().to_dict() out[22]: {'a': {'25%': 48.25, '50%': 122.0, '75%': 789.0, 'count': 6.0, 'max': 1433.0, 'mean': 452.5, 'min': 3.0, 'std': 612.76871656441472}, 'b': {'25%': 401.5, '50%': 1611.5, '75%': 9568.5, 'count': 6.0, 'max': 19829.0, 'mean': 5887.833333333333, 'min': 56.0, 'std': 8215.770187065038}, 'c': {'25%': 868.5, '50%': 1181.5, '75%': 3248203.5, 'count': 4.0, 'max': 12989121.0, 'mean': 3247890.5, 'min': 78.0, 'std': 6494153.687626767}, 'd': {'25%': 5386.0, '50%': 8981.0, '75%': 12576.0, 'count': 2.0, 'max': 16171.0, 'mean': 8981.0, 'min': 1791.0, 'std': 10168.195513462553}}
and if want finer control, can pick , choose:
in [24]: df.t.describe().loc[['mean','std','min','max'],:] out[24]: b c d mean 452.500000 5887.833333 3.247890e+06 8981.000000 std 612.768717 8215.770187 6.494154e+06 10168.195513 min 3.000000 56.000000 7.800000e+01 1791.000000 max 1433.000000 19829.000000 1.298912e+07 16171.000000 in [25]: df.t.describe().loc[['mean','std','min','max'],:].to_dict() out[25]: {'a': {'max': 1433.0, 'mean': 452.5, 'min': 3.0, 'std': 612.76871656441472}, 'b': {'max': 19829.0, 'mean': 5887.833333333333, 'min': 56.0, 'std': 8215.770187065038}, 'c': {'max': 12989121.0, 'mean': 3247890.5, 'min': 78.0, 'std': 6494153.687626767}, 'd': {'max': 16171.0, 'mean': 8981.0, 'min': 1791.0, 'std': 10168.195513462553}}
from original answer
then can achieve whatever want:
in [8]: df.apply(np.min) out[8]: countrya 24.0 countryb 3.0 countryc 121.0 countryd 123.0 countrye 1011.0 countryf 1132.0 dtype: float64 in [9]: df.apply(np.max) out[9]: countrya 78.0 countryb 98.0 countryc 12989121.0 countryd 1312.0 countrye 1911.0 countryf 19829.0 dtype: float64 in [10]: df.apply(np.std) out[10]: countrya 2.217105e+01 countryb 4.750000e+01 countryc 5.620356e+06 countryd 5.424170e+02 countrye 4.500000e+02 countryf 7.960893e+03 dtype: float64
you can bring dictionaries ease:
in [11]: df.apply(np.min).to_dict() out[11]: {'countrya': 24.0, 'countryb': 3.0, 'countryc': 121.0, 'countryd': 123.0, 'countrye': 1011.0, 'countryf': 1132.0}
go nuts! data-processing needs easier:
in [12]: df.describe() out[12]: countrya countryb countryc countryd countrye \ count 3.000000 2.000000 4.000000e+00 3.000000 2.000000 mean 52.666667 50.500000 3.254384e+06 888.666667 1461.000000 std 27.153882 67.175144 6.489829e+06 664.322462 636.396103 min 24.000000 3.000000 1.210000e+02 123.000000 1011.000000 25% 40.000000 26.750000 9.121000e+03 677.000000 1236.000000 50% 56.000000 50.500000 1.414600e+04 1231.000000 1461.000000 75% 67.000000 74.250000 3.259408e+06 1271.500000 1686.000000 max 78.000000 98.000000 1.298912e+07 1312.000000 1911.000000 countryf count 4.000000 mean 6046.250000 std 9192.447602 min 1132.000000 25% 1357.750000 50% 1612.000000 75% 6300.500000 max 19829.000000
Comments
Post a Comment