Project: Analysis of No-Show Medical Appointments¶

Data Analyst: Yahaya Yusuf Danladi

Table of Contents¶

Introduction
Data Wrangling
Exploratory Data Analysis
Conclusions

Introduction¶

The data set to be investigated contains records of 110,527 medical appointments and it is associated to 14 variables (characteristics). The most important variable is the "No-show" column which indicates whether the patient show-up or no-show to the appointment. Below are list of Variables.

VARIABLE	DESCRIPTION
PatientId	This is a unique identifier of each patient
AppointmentID	An ID unique to each scheduuled appointment
Gender	Show the gender(sex) of the patient
ScheduledDay	Date which the patients planned their medical visit
AppointmentDay	Date which patients are given for medical visit
Age	Age of the patient
Neighbourhood	Place the patient lives
Scholarship	Indicates if the patient is enrolled in welfare program of not
Hipertension	Indicates if a patients is Hypertensive or not
Diabetes	Indicates if a patients is Diabetic or not
Alcoholism	Indicates if a patient is an alcoholic or not
Handcap	Indicates if the patient is Handicaped
SMS_received	Indicates if the patient received an SMS for his/her appointment
No-show	Indicates if the patient show up for their appointment or not

The following questions will be Answered in this report

What Percentage of patients did not show up for their Appointment?
Did Patients enrolled in the scholarship program show up for appointment more than others?
Did younger patients show up for their appointment more than older patients?
Which Gender has more Patients?
Which Age has the higest number of patients?
Which Location has the highest number of Patients Appointments?
Which day of the week most Patients set up their appointment?
Which day of the week is the most busiest?

At the end of the this investigation, we will be able to answer all of these questions and most especially to have an idea if patients on scholarship show up for appointments more than others.

In [84]:

#import statements
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Data Wrangling¶

General Properties¶

In [4]:

# Load your data
df=pd.read_csv('KaggleV2-May-2016.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 14 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   PatientId       110527 non-null  float64
 1   AppointmentID   110527 non-null  int64  
 2   Gender          110527 non-null  object 
 3   ScheduledDay    110527 non-null  object 
 4   AppointmentDay  110527 non-null  object 
 5   Age             110527 non-null  int64  
 6   Neighbourhood   110527 non-null  object 
 7   Scholarship     110527 non-null  int64  
 8   Hipertension    110527 non-null  int64  
 9   Diabetes        110527 non-null  int64  
 10  Alcoholism      110527 non-null  int64  
 11  Handcap         110527 non-null  int64  
 12  SMS_received    110527 non-null  int64  
 13  No-show         110527 non-null  object 
dtypes: float64(1), int64(8), object(5)
memory usage: 11.8+ MB

In [3]:

df.head()

Out[3]:

	PatientId	AppointmentID	Gender	ScheduledDay	AppointmentDay	Age	Neighbourhood	Hipertension	Diabetes	No-show
0	2.987250e+13	5642903	F	2016-04-29T18:38:08Z	2016-04-29T00:00:00Z	62	JARDIM DA PENHA	1	0	No
1	5.589980e+14	5642503	M	2016-04-29T16:08:27Z	2016-04-29T00:00:00Z	56	JARDIM DA PENHA	0	0	No
2	4.262960e+12	5642549	F	2016-04-29T16:19:04Z	2016-04-29T00:00:00Z	62	MATA DA PRAIA	0	0	No
3	8.679510e+11	5642828	F	2016-04-29T17:29:31Z	2016-04-29T00:00:00Z	8	PONTAL DE CAMBURI	0	0	No
4	8.841190e+12	5642494	F	2016-04-29T16:07:23Z	2016-04-29T00:00:00Z	56	JARDIM DA PENHA	1	1	No

In [4]:

df.describe()

Out[4]:

	PatientId	AppointmentID	Age	Scholarship	Hipertension	Diabetes	Alcoholism	Handcap	SMS_received
count	1.105270e+05	1.105270e+05	110527.000000	110527.000000	110527.000000	110527.000000	110527.000000	110527.000000	110527.000000
mean	1.474963e+14	5.675305e+06	37.088874	0.098266	0.197246	0.071865	0.030400	0.022248	0.321026
std	2.560949e+14	7.129575e+04	23.110205	0.297675	0.397921	0.258265	0.171686	0.161543	0.466873
min	3.920000e+04	5.030230e+06	-1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	4.172615e+12	5.640286e+06	18.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
50%	3.173180e+13	5.680573e+06	37.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
75%	9.439170e+13	5.725524e+06	55.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000
max	9.999820e+14	5.790484e+06	115.000000	1.000000	1.000000	1.000000	1.000000	4.000000	1.000000

In [5]:

sum(df.duplicated())

Out[5]:

Observations made about the data¶

There are 110527 records, a total of 14 columns.

The data has no missing values.

There are three data types: integer, string and float.

AppointmentDay and ScheduledDay are representated as string object instead of date time

There are no duplicates

We see min Age as negative 1 (-1), which is impossible this could be typing error they data entry staff, so we decide to replace all -1 with 1.

Data Cleaning¶

In [25]:

#df['Age'] = df['Age'].replace(-1.0,1.0)
df['Age'].mask(df['Age'] <0 ,1, inplace=True)

We replace patient age that is less than zero(0) with 1

In [11]:

df.rename(columns = {'No-show':'No_show'}, inplace = True)

The name of the last column (No-show) is causing error no windows machine when plotting charts like histogram, we had to rename it. So we rename the 'No-show' column to 'No_show'

In [64]:

df['ScheduledDay'] = pd.to_datetime(df['ScheduledDay'])
df['AppointmentDay'] = pd.to_datetime(df['AppointmentDay'])

Here we Convert 'ScheduledDay' and 'AppointmentDay' columns to datetime

Exploratory Data Analysis¶

Summary Statistics of Patients' Age¶

In [21]:

df.Age.mean()

Out[21]:

37.08887421173107

In [24]:

df.Age.min()

Out[24]:

In [19]:

df.Age.max()

Out[19]:

What Percentage of patients did not show up for their Appointment?¶

In [61]:

Noshow1 = df['No_show'].value_counts('No')
print(Noshow1)

No     0.798067
Yes    0.201933
Name: No_show, dtype: float64

In [54]:

Noshow = df['No_show'].value_counts()
print(Noshow)

No     88208
Yes    22319
Name: No_show, dtype: int64

In [82]:

import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)

def without_hue(plot, feature):
    total = len(feature)
    for p in ax.patches:
        percentage = '{:.1f}%'.format(100 * p.get_height()/total)
        x = p.get_x() + p.get_width() / 2 - 0.05
        y = p.get_y() + p.get_height()
        ax.annotate(percentage, (x, y), size = 12)
    plt.show()
    
plt.figure(figsize = (7,5))
ax = sns.countplot('No_show',data = df)
plt.xticks(size = 12)
plt.xlabel('No_show', size = 12)
plt.yticks(size = 12)
plt.ylabel('count', size = 12)
without_hue(ax,df.No_show);

We can see only 20% of patients failed to show up for their appointments.

Did Patients enrolled in the scholarship program show up for appointment more than others?¶

In [32]:

sns.displot(data=df,x="Scholarship", hue="No_show", kind="kde", height=6,multiple="fill", clip=(0, None),
            palette="ch:rot=-.25,hue=1,light=.75");

This shows that most of the patients on scholarship are No-show (i.e. they did not show up for their appointment)

Did younger patients show up for their appointment more than older patients?¶

In [27]:

sns.kdeplot(data=df, x="Age", hue="No_show", multiple="stack");

From the chart above it shows that most patients within the age of 0-100 show up for their appointment, However, looking at the No-show(Yes), we can see that those above 60 year are less likely to miss their appointment.

Which Gender has more Patients?¶

In [28]:

df['Gender'].value_counts().plot(kind='pie');

This shows that there are more Female Patience than Male patients

Which Age has the higest nnumber of patients?¶

In [177]:

df['Age'].value_counts().plot(kind='bar', figsize= (25,8));

This shows that children are age of zero(0) are more in number than any age group

Which Location has the highest number of Patients Appointments?¶

In [101]:

df['Neighbourhood'].value_counts().plot(kind='bar',  figsize= (25,8));

Jardim Camburi has the highest number of patient appointments

Which day of the week most Patients set up their appointment?¶

In [138]:

weekdays = df['ScheduledDay'].dt.day_name()
weekdays.value_counts().plot(kind='bar');

Most patients scheduled their Appointments on Tuesdays.

Which day of the week is the most busiest?¶

In [85]:

weekdaysa = df['AppointmentDay'].dt.day_name()
weekdaysa.value_counts().plot(kind='bar');

Most patients got their appointment fixed on Wednesdays.

Conclusions¶

Generally, most of the patient are Females, and we can also see that most patients (80%) attended their Medical appointment.

Looking at the patients on scholarship to see if they show up for their appointment more, but it turned out that those on scholarship miss their appointments the most.

Finally, patieints above age 60 seems to show up for their appointment more.

Because no statistical test was done, this report is could not state the cause as to why those on scholarship miss their appointments despite welfare they enjoy, and also could not establish if age is a determining factor as to whether a patient show up for appointment or Not.

References¶

https://stackoverflow.com/questions/35692781/python-plotting-percentage-in-seaborn-bar-plot
https://seaborn.pydata.org/