Object Oriented Programming for Data Science

Data Science is a pretty hot topic today. We are living in a world that more and more data is collected daily. There is a high demand to understand and analyse all the available data. This is done by using Python, JavaScript, Java, Scala, Julia, R, MATLAB, C++ and other languages.

There are two programming paradigms you could use to write and style your code. The first one is functional programming and the second one is object oriented programming (OOP). Functional programming is mainly used in the data science field since many of the data scientists often come from a background that might not be related to traditional computer science/software engineering. Additionally, the majority of the data science courses available have a focus mostly on the mathematics, data analysis & visualisation, and machine learning models. Due to these reasons, data scientists often find it hard to follow the principles of OOP.

The goal of this project is to use Object Oriented Programming for Data Science with Python. The Loan Prediction with 3 Problem Statement dataset from kaggle is used for this project. There are three csv files with data for training, test and test targets. This problem is treated as a classification problem, trying to predict whether the loan candidate is applicable to get a loan or not. Additionally, OOP is used as a programming paradigm, therefore four classes are created:

ExploratoryDataAnalysis Class
PreProcessing Class
Processing Class
MachineLearning Class

The code developed for the project can be viewed here.

A screenshot of the code can be seen in the figure below.

In the first class, ExploratoryDataAnalysis, information about the shape, data types, column names and missing values are extracted.

In the second class, PreProcessing, methods are created using pandas functions to drop, fill null and encode specific columns of the dataset.

In the third class, Processing, the above methods are used in both training and test sets.

In the final class, MachineLearning, several classification algorithms from scikit-learn are used. Some of algorithms are: Nearest Neighbours, Linear SVM, Logistic Regression, Random Forest, Naïve Bayes and others.

Again, the code can be found in my GitHub repo under OOP-Data-Science repository.