top of page

School Data Cleaning

Cleaning Data Basics | Stage 1

The school principal needs some of the student information in the system to be cleaned. Someone new inputted data and the computers were having issues as they were working on it.

 

We need you to look for errors and fix them if you find any.

 

We will only give you a small amount of student information since we need to figure out where the errors are in the input process or what the glitch did.

School Data Set.csv

import pandas as pd
import numpy as np

​

data = pd.read_csv("School Data Set.csv")
data

​

data.drop("Unnamed: 0", axis=1, inplace=True)

​

list(data.columns)

​

list(data["Major"].unique())

​

list(data["Class Year"].unique())

​

list(data["GPA"].unique())

​

list(data["Dorm"].unique())

​

list(data["Home_State"].unique())

​

list(data["Home_Country"].unique())

​

data["Major"] = data["Major"].replace({"Computer sci":"Computer Science", "comp sci":"Computer Science", "computer science":"Computer Science", "bio":"Biology", "biology":"Biology", "business": "Business", "econ": "Economics", "economics":"Economics", "exer science": "Exercise science", "exercise science": "Exercise science", "math":"Mathematics", "Math":"Mathematics", "mathematics":"Mathematics" })

​

data["Major"].unique()

​

correct_class = []
for i in list(data["Class Year"]):
    correct_class.append(i[-1:])

​

data["Class Year"] = correct_class

​

gpa = list(data["GPA"])

​

data["GPA"] = data["Dorm"]

​

data["Dorm"] = gpa

​

data["Dorm"].unique()

​

correct_dorm = []
for i in list(data["Dorm"]):
    correct_dorm.append(i[0]) 

​

data["Dorm"] = correct_dorm

data["Dorm"] = data["Dorm"].replace({"e":"East", "w":"West", "s":"South", "n":"North"})

​

data["Dorm"].unique()

​

country = list(data["Home_State"])
data["Home_State"] = data["Home_Country"]

​

data["Home_Country"] = country

​

data["Use_Gym"] = data["Use_Gym"].replace({1:"yes", 0:"no"})

​

data["GPA"] = data["GPA"].replace(8.0,'nan')

​

data["GPA"].unique()

​

data.to_csv("School Data Set Solved.csv")

​

bottom of page