The data set to be investigated contains records of 110,527 medical appointments and it is associated to 14 variables (characteristics). The most important variable is the "No-show" column which indicates whether the patient show-up or no-show to the appointment. Below are list of Variables.
VARIABLE | DESCRIPTION |
---|---|
PatientId | This is a unique identifier of each patient |
AppointmentID | An ID unique to each scheduuled appointment |
Gender | Show the gender(sex) of the patient |
ScheduledDay | Date which the patients planned their medical visit |
AppointmentDay | Date which patients are given for medical visit |
Age | Age of the patient |
Neighbourhood | Place the patient lives |
Scholarship | Indicates if the patient is enrolled in welfare program of not |
Hipertension | Indicates if a patients is Hypertensive or not |
Diabetes | Indicates if a patients is Diabetic or not |
Alcoholism | Indicates if a patient is an alcoholic or not |
Handcap | Indicates if the patient is Handicaped |
SMS_received | Indicates if the patient received an SMS for his/her appointment |
No-show | Indicates if the patient show up for their appointment or not |
The following questions will be Answered in this report
At the end of the this investigation, we will be able to answer all of these questions and most especially to have an idea if patients on scholarship show up for appointments more than others.
#import statements
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Load your data
df=pd.read_csv('KaggleV2-May-2016.csv')
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 110527 entries, 0 to 110526 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PatientId 110527 non-null float64 1 AppointmentID 110527 non-null int64 2 Gender 110527 non-null object 3 ScheduledDay 110527 non-null object 4 AppointmentDay 110527 non-null object 5 Age 110527 non-null int64 6 Neighbourhood 110527 non-null object 7 Scholarship 110527 non-null int64 8 Hipertension 110527 non-null int64 9 Diabetes 110527 non-null int64 10 Alcoholism 110527 non-null int64 11 Handcap 110527 non-null int64 12 SMS_received 110527 non-null int64 13 No-show 110527 non-null object dtypes: float64(1), int64(8), object(5) memory usage: 11.8+ MB
df.head()
PatientId | AppointmentID | Gender | ScheduledDay | AppointmentDay | Age | Neighbourhood | Scholarship | Hipertension | Diabetes | Alcoholism | Handcap | SMS_received | No-show | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2.987250e+13 | 5642903 | F | 2016-04-29T18:38:08Z | 2016-04-29T00:00:00Z | 62 | JARDIM DA PENHA | 0 | 1 | 0 | 0 | 0 | 0 | No |
1 | 5.589980e+14 | 5642503 | M | 2016-04-29T16:08:27Z | 2016-04-29T00:00:00Z | 56 | JARDIM DA PENHA | 0 | 0 | 0 | 0 | 0 | 0 | No |
2 | 4.262960e+12 | 5642549 | F | 2016-04-29T16:19:04Z | 2016-04-29T00:00:00Z | 62 | MATA DA PRAIA | 0 | 0 | 0 | 0 | 0 | 0 | No |
3 | 8.679510e+11 | 5642828 | F | 2016-04-29T17:29:31Z | 2016-04-29T00:00:00Z | 8 | PONTAL DE CAMBURI | 0 | 0 | 0 | 0 | 0 | 0 | No |
4 | 8.841190e+12 | 5642494 | F | 2016-04-29T16:07:23Z | 2016-04-29T00:00:00Z | 56 | JARDIM DA PENHA | 0 | 1 | 1 | 0 | 0 | 0 | No |
df.describe()
PatientId | AppointmentID | Age | Scholarship | Hipertension | Diabetes | Alcoholism | Handcap | SMS_received | |
---|---|---|---|---|---|---|---|---|---|
count | 1.105270e+05 | 1.105270e+05 | 110527.000000 | 110527.000000 | 110527.000000 | 110527.000000 | 110527.000000 | 110527.000000 | 110527.000000 |
mean | 1.474963e+14 | 5.675305e+06 | 37.088874 | 0.098266 | 0.197246 | 0.071865 | 0.030400 | 0.022248 | 0.321026 |
std | 2.560949e+14 | 7.129575e+04 | 23.110205 | 0.297675 | 0.397921 | 0.258265 | 0.171686 | 0.161543 | 0.466873 |
min | 3.920000e+04 | 5.030230e+06 | -1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 4.172615e+12 | 5.640286e+06 | 18.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
50% | 3.173180e+13 | 5.680573e+06 | 37.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
75% | 9.439170e+13 | 5.725524e+06 | 55.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
max | 9.999820e+14 | 5.790484e+06 | 115.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 4.000000 | 1.000000 |
sum(df.duplicated())
0
There are 110527 records, a total of 14 columns.
The data has no missing values.
There are three data types: integer, string and float.
AppointmentDay and ScheduledDay are representated as string object instead of date time
There are no duplicates
We see min Age as negative 1 (-1), which is impossible this could be typing error they data entry staff, so we decide to replace all -1 with 1.
#df['Age'] = df['Age'].replace(-1.0,1.0)
df['Age'].mask(df['Age'] <0 ,1, inplace=True)
We replace patient age that is less than zero(0) with 1
df.rename(columns = {'No-show':'No_show'}, inplace = True)
The name of the last column (No-show) is causing error no windows machine when plotting charts like histogram, we had to rename it. So we rename the 'No-show' column to 'No_show'
df['ScheduledDay'] = pd.to_datetime(df['ScheduledDay'])
df['AppointmentDay'] = pd.to_datetime(df['AppointmentDay'])
Here we Convert 'ScheduledDay' and 'AppointmentDay' columns to datetime
df.Age.mean()
37.08887421173107
df.Age.min()
0
df.Age.max()
115
Noshow1 = df['No_show'].value_counts('No')
print(Noshow1)
No 0.798067 Yes 0.201933 Name: No_show, dtype: float64
Noshow = df['No_show'].value_counts()
print(Noshow)
No 88208 Yes 22319 Name: No_show, dtype: int64
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)
def without_hue(plot, feature):
total = len(feature)
for p in ax.patches:
percentage = '{:.1f}%'.format(100 * p.get_height()/total)
x = p.get_x() + p.get_width() / 2 - 0.05
y = p.get_y() + p.get_height()
ax.annotate(percentage, (x, y), size = 12)
plt.show()
plt.figure(figsize = (7,5))
ax = sns.countplot('No_show',data = df)
plt.xticks(size = 12)
plt.xlabel('No_show', size = 12)
plt.yticks(size = 12)
plt.ylabel('count', size = 12)
without_hue(ax,df.No_show);
We can see only 20% of patients failed to show up for their appointments.
sns.displot(data=df,x="Scholarship", hue="No_show", kind="kde", height=6,multiple="fill", clip=(0, None),
palette="ch:rot=-.25,hue=1,light=.75");
This shows that most of the patients on scholarship are No-show (i.e. they did not show up for their appointment)
sns.kdeplot(data=df, x="Age", hue="No_show", multiple="stack");
From the chart above it shows that most patients within the age of 0-100 show up for their appointment, However, looking at the No-show(Yes), we can see that those above 60 year are less likely to miss their appointment.
df['Gender'].value_counts().plot(kind='pie');
This shows that there are more Female Patience than Male patients
df['Age'].value_counts().plot(kind='bar', figsize= (25,8));
This shows that children are age of zero(0) are more in number than any age group
df['Neighbourhood'].value_counts().plot(kind='bar', figsize= (25,8));
Jardim Camburi has the highest number of patient appointments
weekdays = df['ScheduledDay'].dt.day_name()
weekdays.value_counts().plot(kind='bar');
Most patients scheduled their Appointments on Tuesdays.
weekdaysa = df['AppointmentDay'].dt.day_name()
weekdaysa.value_counts().plot(kind='bar');
Most patients got their appointment fixed on Wednesdays.
Generally, most of the patient are Females, and we can also see that most patients (80%) attended their Medical appointment.
Looking at the patients on scholarship to see if they show up for their appointment more, but it turned out that those on scholarship miss their appointments the most.
Finally, patieints above age 60 seems to show up for their appointment more.
Because no statistical test was done, this report is could not state the cause as to why those on scholarship miss their appointments despite welfare they enjoy, and also could not establish if age is a determining factor as to whether a patient show up for appointment or Not.