7 Data Pre-Processing Methods With SciKit-Learn
7 Data Pre-Processing Methods With SciKit-LearnUsing Python and Google ColabPhoto by James Harrison on UnsplashData pre-processing is an important part of preparing, organizing, and structuring data for further analysis or Machine Learning model engineering.
#Define X and y:y = df['RainTomorrow']X = df.drop('RainTomorrow', axis=1)If we further inspect y, we will find that it is coded as a string with values ‘Yes’ and ‘No’.
The RobustScaler function tries to solve this problem by applying data transformation that removes the median and scales the data according to the quantile range.
Remember that even though this change in scale may seem counter-productive for data visualization, our focus here is on data preparation to building Machine Learning models and not Data Visualization.
Let’s try with our sample dataset:#Import and read DataFrame:df = pd.read_csv('/content/weather.csv')dfNow we will inspect our variables to check if any of them have binomial distribution:#Plot histograms for numerical variables:fig, axs = plt.subplots(4, 4, figsize=(14, 14)) sns.histplot(data=df, x="MinTemp", kde=True, color="skyblue", ax=axs[0, 0])sns.histplot(data=df, x="MaxTemp", kde=True, color="skyblue", ax=axs[0, 1])sns.histplot(data=df, x="Rainfall", kde=True, color="skyblue", ax=axs[0, 2])sns.histplot(data=df, x="Evaporation", kde=True, color="skyblue", ax=axs[0, 3]) sns.histplot(data=df, x="WindGustSpeed", kde=True, color="skyblue", ax=axs[1, 0])sns.histplot(data=df, x="WindSpeed9am", kde=True, color="skyblue", ax=axs[1, 1])sns.histplot(data=df, x="WindSpeed3pm", kde=True, color="skyblue", ax=axs[1, 2])sns.histplot(data=df, x="Humidity9am", kde=True, color="skyblue", ax=axs[1, 3]) sns.histplot(data=df, x="Humidity3pm", kde=True, color="skyblue", ax=axs[2, 0])sns.histplot(data=df, x="Pressure9am", kde=True, color="skyblue", ax=axs[2, 1])sns.histplot(data=df, x="Pressure3pm", kde=True, color="skyblue", ax=axs[2, 2])sns.histplot(data=df, x="Cloud9am", kde=True, color="skyblue", ax=axs[2, 3]) sns.histplot(data=df, x="Cloud3pm", kd