In this document, we aim to analyse how US medical insurance costs are influenced by location and the possible reasons why this may be the case. We are given data from clients on their age, sex, BMI, number of children, smoker status, their region of residence and how much they are currently paying.
We start by collecting the locations of insurance clients into a Python list, as follows:
import csv
age = []
sex = []
bmi = []
children = []
smoker = []
region = []
charges = []
with open('insurance.csv') as insurance:
medical_records = csv.DictReader(insurance)
for row in medical_records:
age.append(row['age'])
sex.append(row['sex'])
bmi.append(row['bmi'])
children.append(row['children'])
smoker.append(row['smoker'])
region.append(row['region'])
charges.append(row['charges'])
We then create an empty dictionary and count the number of clients from a particular region.
region_count = {}
for item in region:
if item in region_count.keys():
region_count[item] += 1
else:
region_count[item] = 1
print(region_count)
{'southwest': 325, 'southeast': 364, 'northwest': 325, 'northeast': 324}
Here, we observe that there is quite an even spread of clients from the four regions of the US, though there are a little more clients in the Southeast region.
Now, let us have a look at how clients are paying on average over the four regions.
regions_and_charges = list(zip(region,charges))
charges_by_region = {}
for item in regions_and_charges:
r = item[0]
c = float(item[1])
if r in charges_by_region.keys():
charges_by_region[r] += c
else:
charges_by_region[r] = c
average_charge_by_region = {}
for area in region_count.keys():
average = charges_by_region[area]/region_count[area]
average_charge_by_region[area] = round(average,2)
print(average_charge_by_region)
{'southwest': 12346.94, 'southeast': 14735.41, 'northwest': 12417.58, 'northeast': 13406.38}
We are able to see here that those in the Southeast region are paying over $\$ 1300$ more than the Northeast region and over $\$ 2000$ more than the Southwest and Northwest regions. In order to see why this is happening, we can do a deep dive into the factors affecting insurance costs (in particular, age, sex, BMI, number of children and smoker status).
We start by looking at the average age by region, using similar methodology to above.
regions_and_ages = list(zip(region,age))
age_by_region = {}
for item in regions_and_ages:
r = item[0]
a = int(item[1])
if r in age_by_region.keys():
age_by_region[r] += a
else:
age_by_region[r] = a
average_age_by_region = {}
for area in region_count.keys():
average = age_by_region[area]/region_count[area]
average_age_by_region[area] = round(average,1)
print(average_age_by_region)
{'southwest': 39.5, 'southeast': 38.9, 'northwest': 39.2, 'northeast': 39.3}
The data here suggests that the average age in the Southeast region is marginally younger. However, we would naturally theorise that insurance costs would increase as average ages are higher. Therefore, age may not necessarily play a significant role in the higher insurance costs in the Southeast region.
It is often theorised that females pay more for health insurance. Let us have a look at the percentage of females by region.
regions_and_sex = list(zip(region,sex))
females_by_region = {}
for item in regions_and_sex:
r = item[0]
s = item[1]
if r in females_by_region.keys():
if s == 'female':
females_by_region[r] += 1
else:
if s == 'female':
females_by_region[r] = 1
female_percentage_by_region = {}
for area in females_by_region.keys():
females = 100 * females_by_region[area] / region_count[area]
female_percentage_by_region[area] = round(females,2)
print(female_percentage_by_region)
{'southwest': 49.85, 'southeast': 48.08, 'northwest': 50.46, 'northeast': 49.69}
We have that the percentage of females in the Southeast region of the US is lower, which would imply that sex also may not necessarily play a significant role in the higher insurance costs in the Southeast region.
While body mass index (BMI) is generally a poor indicator for pricing health insurance, let us see if it plays any role in the significant cost increase in the Southeast region.
regions_and_bmi = list(zip(region,bmi))
bmi_by_region = {}
for item in regions_and_bmi:
r = item[0]
b = float(item[1])
if r in bmi_by_region.keys():
bmi_by_region[r] += b
else:
bmi_by_region[r] = b
average_bmi_by_region = {}
for area in region_count.keys():
average = bmi_by_region[area]/region_count[area]
average_bmi_by_region[area] = round(average,2)
print(average_bmi_by_region)
{'southwest': 30.6, 'southeast': 33.36, 'northwest': 29.2, 'northeast': 29.17}
The BMI of clients in the Southeast region is quite significantly higher than that in the other regions; this may suggest that the BMI of those people may play a role in increased insurance costs, but this would need to be further analysed.
Let us see if the number of children plays a significant role in the prices of health insurance in the Southeast region.
regions_and_children = list(zip(region,children))
children_by_region = {}
for item in regions_and_children:
r = item[0]
a = int(item[1])
if r in children_by_region.keys():
children_by_region[r] += a
else:
children_by_region[r] = a
average_children_by_region = {}
for area in region_count.keys():
average = children_by_region[area]/region_count[area]
average_children_by_region[area] = round(average,1)
print(average_children_by_region)
{'southwest': 1.1, 'southeast': 1.0, 'northwest': 1.1, 'northeast': 1.0}
The number of children the clients have seems to be even across all four regions, suggesting that this variable does not really affect the prices of health insurance by region.
Given the health issues that surround smoking, it makes sense to theorise that a smoker would pay higher insurance premiums. Let us put that to the test by analysing the relevant data.
regions_and_smokers = list(zip(region,smoker))
smokers_by_region = {}
for item in regions_and_smokers:
r = item[0]
s = item[1]
if r in smokers_by_region.keys():
if s == 'yes':
smokers_by_region[r] += 1
else:
if s == 'yes':
smokers_by_region[r] = 1
smokers_percentage_by_region = {}
for area in females_by_region.keys():
smokers = 100 * smokers_by_region[area] / region_count[area]
smokers_percentage_by_region[area] = round(smokers,2)
print(smokers_percentage_by_region)
{'southwest': 17.85, 'southeast': 25.0, 'northwest': 17.85, 'northeast': 20.68}
It looks like we may have our answer! The percentage of smokers in the Southeast is significantly higher than that in the other regions, especially in the Southwest and the Northwest; this would suggest that smokers would pay a significantly higher insurance premium than non-smokers, but more analysis would need to be done before we can jump to such a conclusion.
Based on the cursory analysis above, we have found that BMI and smoker status seem to play the most significant role in the increase of health insurance prices in the Southeast region of the US. However, the cursory analysis above is not enough to determine precisely the effects, and so we would need to do some further statistical analysis (e.g. testing for significant differences).