Python: Exploratory Data Analysis of Red Wine Dataset

Karuna Rathore
Oct 29, 2024
3 min read

Updated: Nov 18, 2024

I am thrilled to share my First EDA Project. Project focused on Evaluating the Quality of Red Wine Based on its Chemical Properties and sensory attribute (Quality).

EDA is all about understanding the specific data we have. It includes data cleaning and understanding relationship among variables.

📊 Steps Involved: -

A) Data collection and Overview

The dataset is related to red variants of the Portuguese "Vinho Verde" wine. The data set csv file is obtained from Kaggle. It contains 12 columns and 1599 samples of red wine. 11 are input variables and we have one output feature “Quality”. It ranges from 0 to 10. Higher the value, better the quality of red wine sample is.

Imported dataset csv in Python using Pandas.

B) Data Cleaning

We have no missing value in the dataset.

We have 240 Duplicates entries in the dataset. So, Dropped those 240 records for further Analysis.

C) EDA

Conducted detailed statistical Analysis along with Summary Statistics.

Quality Insights: - We have 6 different types of red wines available in data set depending on the Quality that are 5, 6, 7, 4, 8, 3. Higher the no. better quality sample wine is. The Quality of wine ranges from 3 to 8, with average of 5.64.

Visualized Key relationship using Heatmap, Bar graph, Histograms, Pair Plot, Scatter Plots, and Box plot between input variables and Wine Quality. Identified Outliers using matplotlib and seaborn. Insights are as Follow:-

🎯Correlation Insights- Heat Map🗺

Correlation shows how one variable is related to the other in dataset. Brighter the color the positive correlation exists and darker the color negative correlation exists.
Input variables like Fixed Acidity, Citric acid, and Alcohol are Positively correlated while Volatile acidity, Total sulfur dioxide, and pH have negative impact on Quality.

🎯Imbalanced Data Set: - Bar Graph📊

It shows that dataset is Imbalanced that means Most of the wines are rated either 5 or 6 in terms of quality. These are average-quality wines and make up the bulk of the dataset.

🎯Probability Distribution: - Histogram📊

The distribution seems skewed to the right, with most samples concentrated between 9 and 12% alcohol content. This suggests that higher alcohol content (above 12%) is less common in the dataset.

🎯Pair Plot Insights📉

Shows interesting features about the data set. It is used for Univariate, Bivariate, Multivariate Analysis. To compare one feature with the other. It shows Strong linear relationship among the variables.

🎯Categorical Plot📉

Boxplot is plotted using output variable quality and input variable alcohol. General positive relationship between alcohol content and quality is observed.

Higher-quality wines (7 and 8) generally have higher alcohol content, while lower-quality wines (3 to 5) have lower alcohol content.

There are several outliers in the 5th quality category.

🎯Scatterplot Insights📉

Represents relationship between the alcohol content (x-axis) and pH level (y-axis) indicating alcohol % and acidity in wine sample respectively. The color gradient represents different wine quality ratings.

Lower pH values indicate higher acidity, and higher values indicate more alkaline characteristics.

Alcohol content Lies mostly between 9% and 14%. pH values cluster mostly between 3.0 and 3.6, showing that most wine samples are acidic. Higher-quality wines (darker points) seem to be scattered more across the range of alcohol content and pH levels.

Tools used: - Pandas, Matplotlib, and Seaborn

Further Analysis: - These insights set stage for machine learning models that try to predict wine quality.

𝗚𝗶𝘁𝗛𝘂𝗯 𝗟𝗶𝗻𝗸:- https://lnkd.in/gxBXtn_e

𝗣𝗿𝗼𝗷𝗲𝗰𝘁 𝗙𝗶𝗹𝗲 𝗶𝗻 𝗛𝗧𝗠𝗟 𝗙𝗼𝗿𝗺𝗮𝘁 (Linked with my GitHub):- https://lnkd.in/gv8z5bEK

For Glimpse of the codes:-

Python: Exploratory Data Analysis of Red Wine Dataset

Recent Posts

Comentarios