1 Exploratory Data Analysis

1.1 Introduction

Welcome to the exciting EDA phase of our project! In this section, we’ll dive deep into the 3D Printer Dataset for Mechanical Engineers and uncover fascinating insights. First, we’ll familiarize ourselves with the dataset’s feature and target variables to get a better understanding of the data we’re working with. Then, we’ll explore the data further to identify any potential machine learning modeling issues that we might encounter as we proceed to the regression phase of this project. Are you ready to join us on this journey of discovery? Let’s get started!

1.2 About the Dataset

The 3D Printer Dataset for Mechanical Engineers(Okudan (2016)) is dataset for those who want to find relations between the input parameters of 3D printers and the output quality, accuracy and strength of the printed objects¹. It was generated by Ahmet Okudan, a senior engineer at Numesys and a mechanical engineering researcher at Selcuk University in Turkey.

The dataset has nine feature variables and three target variables.

1.2.1 Features

Layer Height (mm): The thickness of each layer of the printed object
Wall Thickness (mm): The thickness of the outer shell of the printed object
Infill Density (%): The percentage of the interior volume of the printed object that is filled with material
Infill Pattern: The shape of the infill material, such as grid, honeycomb, triangles, etc.
Nozzle Temperature (Cº): The temperature of the extruder nozzle that melts and deposits the material
Bed Temperature (Cº): The temperature of the platform that supports the printed object
Print Speed (mm/s): The speed at which the nozzle moves along the X-Y plane
Material: The type of filament material used for printing, such as PLA, ABS, PETG, etc.
Fan Speed (%): The speed of the cooling fan that reduces the temperature of the printed object

1.2.2 Targets

Roughness (µm): The average deviation of the surface height from the mean plane
Tension (ultimate) Strength (MPa): The maximum stress that a material can withstand before breaking
Elongation (%): The percentage of increase in length that a material undergoes before breaking

1.3 Exploratory Data Analysis

The dataset underwent a thorough exploration and assessment, which included the following steps:

Verification of missing values to ensure data completeness.
Identification of duplicated rows to prevent data redundancy.
Assessment of feature normality to determine the appropriate statistical tests.
Detection of outliers among the features to prevent skewed results.
Evaluation of class balance, proportion, and categorical features to ensure data representativeness.
Investigation of correlations among numerical features to uncover hidden relationships.
Examination of multi-collinearity among features to prevent model instability.

To facilitate this comprehensive analysis, a central python class named EDA was constructed. For further information about this class, please refer to the code listing below.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.preprocessing import LabelEncoder

from warnings import filterwarnings
filterwarnings(action="ignore", category=FutureWarning)

# Load the dataset in a pandas dataframe
dataset = pd.read_csv("book_assets/dataset/data.csv")

class EDA:
    def __init__(self, df) -> None:
        self.df = df
        # Split the dataset into features and targets
        self.features = df.iloc[:, 0:9]
        self.targets  = df.iloc[:, 9:]

        # Split the features into numerical predictors and categorical predictors
        self.categorical_predictors = self.features.select_dtypes(include="object")
        self.numerical_predictors   = self.features.select_dtypes(include="number")

    def check_missing_counts(self):
        return pd.DataFrame(self.df.isna().sum(), columns=["Missing Counts"])
    

    def check_duplicated_rows(self) -> int:
        return self.df.duplicated().sum()
    

    def check_outliers(self, n_sigma_limit=3):
        cl = self.df.mean()              
        ll = cl - n_sigma_limit * self.df.std()      
        ul = cl + n_sigma_limit * self.df.std()

        below_ll = (self.df < ll).sum()
        above_ul = (self.df > ul).sum()

        return pd.concat(
            objs = [cl, ll, ul, below_ll, above_ul],
            axis = 1,
            keys = [
                "Mean",
                "Lower Limit(LL)",
                "Upper Limit (UL)",
                "Below LL",
                "Above UL"]
        ).dropna()


    def check_normality(self):
        alpha = 0.05
        normality_test_df = pd.DataFrame(
            columns=[
                "Feature",
                "Statistic",
                "p-value",
                "Null Hypothesis"
            ]
        )

        for predictor in self.numerical_predictors.columns:
            stat, pval = stats.normaltest(self.numerical_predictors[predictor])

            normality_test_df.loc[len(normality_test_df)] = [
                predictor,
                stat,
                pval,
                "Reject" if pval <= alpha else "Accept"
            ]

        return normality_test_df


    def check_categorical_class_names(self):
        for predictor in self.categorical_predictors:
            print(predictor, self.categorical_predictors[predictor].unique())


    def check_categorical_class_balance(self):
        df = pd.DataFrame(columns=[
            "Predictor",
            "Class",
            "Frequency",
            "Total Observations",
            "Class Proportion"
        ])
        for predictor_name in self.categorical_predictors.columns:
            predictor = self.categorical_predictors[predictor_name]
            predictor_classes = predictor.unique()
            predictor_nclasses = len(predictor_classes)

            for predictor_class in predictor_classes:
                class_freq = len(predictor[predictor == predictor_class])
                total_items = len(predictor)
                class_prop = class_freq / total_items

                df.loc[len(df)] = [
                    predictor_name,
                    predictor_class,
                    class_freq,
                    total_items,
                    class_prop
                ]

        return df

    
    def check_categorical_class_distribution(self):
        return self.df.groupby(["material", "infill_pattern"]).count().T

    def interpret_corr(self, corr):
        corr_interpretation = ""
                    
        # Interpret the correlation
        if np.abs(corr) == 0:
            corr_interpretation = "No Correlation"
        elif np.abs(corr) > 0 and np.abs(corr) < 0.2:
            corr_interpretation = "Very Weak Correlation"
        elif np.abs(corr) >= 0.2 and np.abs(corr) < 0.4:
            corr_interpretation = "Weak Correlation"      
        elif np.abs(corr) >= 0.4 and np.abs(corr) < 0.6:
            corr_interpretation = "Moderately Strong Correlation"
        elif np.abs(corr) >= 0.6 and np.abs(corr) < 0.8:
            corr_interpretation = "Strong Correlation"                 
        elif np.abs(corr) >= 0.8 and np.abs(corr) < 1.0:
            corr_interpretation = "Very Strong Correlation"
        elif np.abs(corr) == 1.0:
            corr_interpretation = "Perfect Correlation"

        return corr_interpretation

    def check_corr_of_features_and_targets(self, only_strong=False):
        df = pd.DataFrame(columns=["Target", "Feature", "Pearson (r)", "Remarks"])
        for target in self.targets.columns:
            for feature in self.features.columns:
                if feature in self.numerical_predictors.columns:
                    # Calculate the correlation
                    corr = self.targets[target].corr(self.features[feature])
        
                    # Append the result
                    df.loc[len(df)] = [
                        target,
                        feature,
                        corr,
                        self.interpret_corr(corr)
                    ]

        # Return the final dataframe
        if only_strong:
            return df[(
                    df["Pearson (r)"].abs() >= 0.4
                )].sort_values(by=["Target", "Pearson (r)"])
        else:
            return df.sort_values(by=["Target", "Pearson (r)"])

    def check_assoc_of_features_and_targets(self):
        label_encoder = LabelEncoder()
        df = pd.DataFrame(columns=["Target", "Feature", "Point Biserial (r)", "Remarks"])
        for target in self.targets.columns:
            for feature in self.features.columns:
                if feature in self.categorical_predictors.columns:
                    x = self.features[feature]
                    y =  self.targets[target] 

                    label_encoder.fit(x)

                    stat, pval = stats.pointbiserialr(label_encoder.transform(x), y)

                    df.loc[len(df)] = [
                        target,
                        feature,
                        stat,
                        self.interpret_corr(stat)
                    ]

        return df.sort_values(by=["Target", "Point Biserial (r)"])

    def check_feature_multicorr(self):
        corr = self.numerical_predictors.corr()
        sns.heatmap(corr, annot=True)


    def visualize_numerical_predictors(self):
        # | fig-cap: Histogram of the Numerical Features. Based on above data, majority of our features were not normally distributed. This is in-fact natural since most of the features were machine parameters settings.
        for i in range(len(self.numerical_predictors.columns)):
            for j in range(len(self.numerical_predictors.columns)):
                feature_index = i+j+1
                if feature_index < len(self.numerical_predictors.columns):
                    plt.subplot(3,2,feature_index)
                    sns.kdeplot(self.numerical_predictors.iloc[:,feature_index])
                    sns.histplot(self.numerical_predictors.iloc[:,feature_index], stat="density")
                    plt.tight_layout()
                else:
                    break

    def visualize_targets(self):
        fig, ax = plt.subplots(3,1)
        fig.tight_layout()
        for i in range(len(self.targets.columns)):
            sns.kdeplot(self.targets.iloc[:, i], ax=ax[i])
            sns.histplot(self.targets.iloc[:, i], ax=ax[i], stat="density")


    def visualize_outliers(self):
        sns.boxplot(data=self.df)
        plt.xticks(rotation=90)

    def visualize_class_distribution(self):
        sns.histplot(self.df, x="material", hue="infill_pattern", multiple='dodge')

    def visualize_targets_dist(self, by):
        for i in range(len(self.targets.columns)):
            plt.subplot(1,3,i+1)
            plt.tight_layout()
            sns.boxplot(data=self.df, x=by, y=self.targets.columns[i])

eda = EDA(df=dataset)

1.3.1 Data Summary

Using the EDA class above, the following items were observed from the dataset:

1.3.2 Missing and Duplicate Values:

The dataset was found to have no missing values and duplicated rows.

eda.check_missing_counts()

	Missing Counts
layer_height	0
wall_thickness	0
infill_density	0
infill_pattern	0
nozzle_temperature	0
bed_temperature	0
print_speed	0
material	0
fan_speed	0
roughness	0
tension_strenght	0
elongation	0

print(f"Number of duplicated rows: {eda.check_duplicated_rows()}-rows")

Number of duplicated rows: 0-rows

1.3.3 Normality Check

In terms of its empirical distribution, the dataset was tested for normality using the D’Agostino and Pearson’s test implemented in scipy.stats.normaltest. Based on the result of the test, the three target variables nozzle_temperature, roughness, and elongation have sufficient evidence at 5% significance level that they were drawn from a normal distribution. All of the numerical predictors were not normally distributed. This is expected since all of the predictors were machine parameter settings.

eda.visualize_targets()

Kernel density estimate (KDE) plot and histogram of the target variables. The three variables seemed to be normally distributed but exhibits bimodal property especially on the `tension_strength` variable.

eda.visualize_numerical_predictors()

Kernel density estimate (KDE) plot and histogram of the feature variables. All of the feature variable seemed to be uniformly distributed.

eda.check_normality()

	Feature	Statistic	p-value	Null Hypothesis
0	layer_height	17.231029	0.000181	Reject
1	wall_thickness	8.034294	0.018004	Reject
2	infill_density	6.650689	0.035960	Reject
3	nozzle_temperature	2.890296	0.235711	Accept
4	bed_temperature	18.617248	0.000091	Reject
5	print_speed	10.203504	0.006086	Reject
6	fan_speed	18.617248	0.000091	Reject

Tabulated results of the D’Agostino and Pearson’s test. Based on the results, at 5% significance, we have enough statistical evidence that all of our feature variables were not drawn from a normal distribution. On the other hand, at the same level of statistical significance, we don’t have enough statistical evidence to reject the null hypthesis that our target variables were drawn from a normal distribution.

1.3.4 Outliers

In this dataset, outliers were defined as a univariate datapoint that exists outside the 3 sigma limits (\(\bar{x}\pm3\sigma\)) from the mean. Based on the result of EDA, the dataset does not contain outliers outside the defined limits.

eda.visualize_outliers()

Boxplot of the numerical variables of the set. Based on above results, only `print_speed` seems to have an outlier however, this value does not fall outside the \(3\sigma\) limits.

eda.check_outliers()

	Mean	Lower Limit(LL)	Upper Limit (UL)
layer_height	0.106	-0.087190	0.299190
wall_thickness	5.220	-3.548240	13.988240
infill_density	53.400	-22.690440	129.490440
nozzle_temperature	221.500	177.038942	265.961058
bed_temperature	70.000	48.571429	91.428571
print_speed	64.000	-25.076899	153.076899
fan_speed	50.000	-57.142857	157.142857
roughness	170.580	-126.522388	467.682388
tension_strenght	20.080	-6.696901	46.856901
elongation	1.672	-0.692565	4.036565

Tabulated result of outlier checking. Based on the above results, there was not outliers that fall outside the \(3\sigma\) limits

1.3.5 Balance of Categorical Classes

Upon inspection of class distribution, it was found that for both material, and infill_pattern, the proportion of their respective classes were balanced at 50% proportion each.

eda.check_categorical_class_balance()

	Predictor	Class	Frequency	Total Observations	Class Proportion
0	infill_pattern	grid	25	50	0.5
1	infill_pattern	honeycomb	25	50	0.5
2	material	abs	25	50	0.5
3	material	pla	25	50	0.5

Result of counting class frequencies per categorical features. Based on above results, each of the classess has a 50-50 proportion.

eda.visualize_class_distribution()

Class heirarchical distribution. Based on the above histogram. Although each classes of each categorical features has a 50-50 proportion. If we check the proportion of infill_pattern per material, the infill pattern distribution were not balanced.

eda.check_categorical_class_distribution()

material	abs		pla
infill_pattern	grid	honeycomb	grid	honeycomb
layer_height	15	10	10	15
wall_thickness	15	10	10	15
infill_density	15	10	10	15
nozzle_temperature	15	10	10	15
bed_temperature	15	10	10	15
print_speed	15	10	10	15
fan_speed	15	10	10	15
roughness	15	10	10	15
tension_strenght	15	10	10	15
elongation	15	10	10	15

Tabulation of class heirarchical distribution supporting the histogram above.

1.3.6 Correlation of Numeric Features with Targets

In terms of correlation among features and targets, it was found that nozzle_temperatureand layer_height had a moderately strong correlation with printing elongation. As the nozzle_temperature increased, printing elongation tended to decrease. In contrast, as the layer_height increased, printing elongation also tended to increase. Furthermore, as the layer_height increased, so did the printing roughness. Lastly, as the nozzle_temperature decreased, the printing tensile_strength tended to increase.

eda.check_corr_of_features_and_targets()

	Target	Feature	Pearson (r)	Remarks
17	elongation	nozzle_temperature	-0.527447	Moderately Strong Correlation
18	elongation	bed_temperature	-0.300871	Weak Correlation
20	elongation	fan_speed	-0.300871	Weak Correlation
19	elongation	print_speed	-0.234052	Weak Correlation
16	elongation	infill_density	0.159009	Very Weak Correlation
15	elongation	wall_thickness	0.176364	Very Weak Correlation
14	elongation	layer_height	0.507583	Moderately Strong Correlation
1	roughness	wall_thickness	-0.226987	Weak Correlation
2	roughness	infill_density	0.118389	Very Weak Correlation
5	roughness	print_speed	0.121066	Very Weak Correlation
4	roughness	bed_temperature	0.192142	Very Weak Correlation
6	roughness	fan_speed	0.192142	Very Weak Correlation
3	roughness	nozzle_temperature	0.348611	Weak Correlation
0	roughness	layer_height	0.801341	Very Strong Correlation
10	tension_strenght	nozzle_temperature	-0.405908	Moderately Strong Correlation
12	tension_strenght	print_speed	-0.264590	Weak Correlation
11	tension_strenght	bed_temperature	-0.252883	Weak Correlation
13	tension_strenght	fan_speed	-0.252883	Weak Correlation
7	tension_strenght	layer_height	0.338230	Weak Correlation
9	tension_strenght	infill_density	0.358464	Weak Correlation
8	tension_strenght	wall_thickness	0.399849	Weak Correlation

Correlation table between features and targets

eda.check_corr_of_features_and_targets(only_strong=True)

	Target	Feature	Pearson (r)	Remarks
17	elongation	nozzle_temperature	-0.527447	Moderately Strong Correlation
14	elongation	layer_height	0.507583	Moderately Strong Correlation
0	roughness	layer_height	0.801341	Very Strong Correlation
10	tension_strenght	nozzle_temperature	-0.405908	Moderately Strong Correlation

Correlation table between features and targets showing only those with strong correlations and above.

1.3.7 Association of Categorical Features with Targets

Upon checking the association of categorical features with the target variables, it was found that neither material nor infill_pattern had a moderately strong association with any of the target variables. However, looking at the point biserial correlation coefficient, elongation, roughness, and tensile_strength had a weak correlation with the material used.

eda.check_assoc_of_features_and_targets()

	Target	Feature	Point Biserial (r)	Remarks
4	elongation	infill_pattern	0.046138	Very Weak Correlation
5	elongation	material	0.394737	Weak Correlation
1	roughness	material	-0.233173	Weak Correlation
0	roughness	infill_pattern	-0.068340	Very Weak Correlation
2	tension_strenght	infill_pattern	0.009054	Very Weak Correlation
3	tension_strenght	material	0.289726	Weak Correlation

Association table between features and targets

1.3.8 Multicollinearity Among Numerical Features

Upon checking of collinearity among the features, it was found that bed_temparature and fan_speed were perfectly correlated. Furthermore, nozzle_temparature was strongly correlated with bed_temparature and fan_speed since both were perfectly correlated.

eda.check_feature_multicorr()

Correlation heatmap among the numerical features.

1.4 Conclusion

In this section, we explored the 3D Printer dataset using both univariate and bivariate EDA methods. Our exploration revealed that the dataset is relatively small, simple, and clean, requiring minimal data preprocessing aside from feature scaling and categorical encoding. In the next section, we will preprocess this dataset and use it in a series of linear models to quantify the impact of each machine setting on printing quality. Stay tuned for our exciting findings!