首页 > 代码库 > Data Cleaning 2

Data Cleaning 2

1. When we match a set of data with duplicated values in a column, and we want to use this column as an unify column which is sharing for each database. We are going to filter them into a DataFrame we want.

  class_size = data["class_size"]
  class_size= class_size[class_size["GRADE "] == "09-12" ]
  class_size= class_size[class_size["PROGRAM TYPE"]=="GEN ED"]

2. Once we filtered the column ,we want to condence the duplicated column into one by using groupby() and agg function.

  import numpy as np
  group_by = class_size.groupby(‘DBN‘) #group_by is a special type of data called GroupBy
  class_size = group_by.aggregate(np.mean) # we use aggregate function to deal with the GroupBy types of data .At his moment, the index of class_size will change to the grouped by value (DBN).
  class_size.reset_index(inplace = True) # reset_index allows us to reset the index as a row number - 1
  data[‘class_size‘] = class_size

3. Numeric all the number string by using pd.numeric() function:

  cols = [‘AP Test Takers ‘, ‘Total Exams Taken‘, ‘Number of Exams with scores 3 4 or 5‘]

  for col in cols:
  data["ap_2010"][col] = pd.to_numeric(data["ap_2010"][col],errors = "coerce")

4. After cleanning each dataset, we could like to combine them together so that we can plot them. Normally we use merge() function to combine two dataset.

  combined = data["sat_results"]

  combined = combined.merge(data["ap_2010"],how = "left")
  combined = combined.merge(data["graduation"],how = "inner")
  print(combined.shape)

5. At last, we want to extract some number form certain rows by using apply() function:

  index = combined.index

  def get_first_two_char(data):
    return data[0:2]

  combined["school_dist"] = combined["DBN"].apply(get_first_two_char)#usually once we need to use for loop in the DataFrame, we would like to use apply function to simplieze it.

Data Cleaning 2