1 W5 — Descriptive Analysis Techniques in Data Science Bianca R S Sousa Southern States University… | by Dru Macasieb

9 min readSep 11, 2023

1 W5 — Descriptive Analysis Techniques in Data Science Bianca R S Sousa Southern States University IT 531 | Data Analytics Dr. Dru Macasieb August 20, 2023 2 1, 2 — Techniques in descriptive analysis — Definition, formula and example Measures of Central Tendency (Mean, Median, Mode) Measures of Central Tendency are statistical measures used to describe the center or typical value of a dataset. They provide insights into where the bulk of the data is concentrated, helping to understand the overall trend of the distribution. The three main measures of central tendency are the Mean, the Median, and the Mode. ● Mean: Sum of all values divided by the number of values. ● Median: Middle value in an ordered dataset. ● Mode: Most frequently occurring value. Example: In data science, the mean, median, and mode help summarize data. For example, calculating the mean salary of employees in a company provides insights into the central tendency of the salary distribution and assists in understanding how salaries are distributed among the employees in the company. Measures of Dispersion (Range, Variance, Standard Deviation) Measures of Dispersion are statistical indicators used to quantify the spread or variability of data points within a dataset. They provide information about how closely or widely the data points are distributed around the center or mean. The three main measures of dispersion are the Range, the Variance, and the Standard Deviation. 3 ● Range: Difference between the maximum and minimum values. ● Variance: Average of the squared differences from the mean. ● Standard Deviation: Square root of the variance. Example: In data science, standard deviation is used to understand how much individual data points deviate from the mean, indicating the level of variability. Five-Number Summary The Five-Number Summary and Box Plot are tools commonly used in descriptive statistics to summarize and visually represent the distribution of a dataset. They provide insights into the central tendency, spread, and skewness of the data. The Five-Number Summary consists of five key values that provide a concise summary of the distribution of a dataset. These values are: ● Minimum: The smallest value in the dataset. ● First Quartile (Q1): The median of the lower half of the dataset. ● Median (Q2): The middle value when the dataset is arranged in ascending order. ● Third Quartile (Q3): The median of the upper half of the dataset. ● Maximum: The largest value in the dataset. The Five-Number Summary gives an overview of the data’s range, central tendency, and the spread of values. It’s particularly useful for identifying skewness and the presence of outliers. 4 Example: A real-life example of a Five-Number Summary could be the heights of students in a class: ● Minimum: The shortest student’s height. ● Q1 (First Quartile): The height below which 25% of the students’ heights fall. ● Median (Second Quartile): The middle height that separates taller and shorter students. ● Q3 (Third Quartile): The height below which 75% of the students’ heights lie. ● Maximum: The tallest student’s height. Box Plot A Box Plot is a graphical representation of the Five-Number Summary. It visually displays the distribution of a dataset by using a rectangular “box” and “whiskers.” The components of a box plot include: ● A rectangular box: The box represents the interquartile range (IQR), which is the range between the first quartile (Q1) and the third quartile (Q3). The width of the box indicates the spread of the middle 50% of the data. ● Whiskers: The whiskers extend from the edges of the box to the minimum and maximum values within a certain range. The length of the whiskers is typically 1.5 times the IQR. ● Outliers: Individual data points outside the whiskers are considered outliers and are plotted individually. 5 A box plot provides a visual way to identify the spread of the data, the presence of outliers, and the symmetry or skewness of the distribution. It’s especially useful for comparing multiple datasets side by side. In summary, the Five-Number Summary and Box Plot are valuable tools for understanding the distribution of data in terms of its central tendency, variability, and the presence of outliers. They help in quickly identifying key characteristics of a dataset without needing to analyze the entire dataset in detail. Frequency Tables and Bar Graphs A frequency table is a tabular representation that displays the number of occurrences of various items or values in a dataset. It categorizes data into intervals or bins and shows the count or frequency of data points falling into each interval. Frequency tables are commonly used to summarize categorical or grouped numerical data. Bar graphs, also known as bar charts, are visual representations of data using rectangular bars or columns. Each bar typically represents a category or a range of values, and the length or height of the bar corresponds to the frequency or value it represents. Bar graphs are used to compare and display the frequencies, counts, or totals of different categories or groups. They are especially useful for displaying categorical data and making comparisons between different data points. Frequency tables display the count of occurrences of values in a dataset, while bar graphs visually represent this information using bars of varying heights. No specific formula, but you count occurrences for frequency tables. 6 Example: In data science, frequency tables and bar graphs help illustrate the distribution of categorical data. For instance, a bar graph can show the distribution of product preferences among customers. Histograms Histograms are graphical representations of data distribution for a continuous or quantitative variable. It consists of a series of bars or bins, where each bin represents a range of values, and the height of the bar corresponds to the frequency or relative frequency of data points within that range. Histograms provide insights into the underlying distribution of the data, showing the frequency of data points that fall into specific ranges. They are especially useful for identifying patterns, central tendencies, spreads, and potential outliers in the data. No specific formula, but data is grouped into bins. Example: In data science, a histogram can show the distribution of test scores in a class. It helps identify patterns, such as whether scores cluster around a certain range. Scatter Plots A scatter plot is a graphical representation used to display the relationship between two continuous variables. It consists of a two-dimensional grid where each data point is represented as a single point. These points are positioned on the grid based on the values of the respective variables. Scatter plots are valuable for identifying patterns, trends, and the strength of correlation between the two variables. They are commonly used to visualize and understand the relationships, associations, and clusters within the data. 7 No specific formula, but data pairs are plotted. Example: In data science, a scatter plot can illustrate the correlation between study hours and exam scores. It helps identify patterns, trends, and potential relationships between variables. 3 — Example of dataset in real-life Wine Dataset The Wine dataset is another classic dataset commonly used for classification and data analysis. It contains 13 different attributes that describe the chemical composition of wines, and the goal is often to classify wines into one of three classes based on these attributes. Here’s how we can apply the techniques to the Wine dataset: Five-Number Summary: Each of the 13 attributes in the dataset can have its own five-number summary, describing the minimum, first quartile, median, third quartile, and maximum values. Box Plot: A box plot for each attribute can help visualize the distribution of values and identify potential outliers. Histogram: Histograms for the attributes can show the frequency distribution of values and provide insights into their distributions. Scatter Plot: Since there are multiple attributes, you can create scatter plots between pairs of attributes to see if any patterns or correlations emerge. 8 Bar Graph: If you’re interested in the class distribution, you can create a bar graph to show the count of wines in each class. 4 — Report The purpose of this report is to perform exploratory data analysis (EDA) on the Wine dataset using various statistical and visualization techniques. The dataset contains 13 attributes describing the chemical composition of wines, with the goal of classifying wines into three classes. The techniques used include the Five-Number Summary, Box Plots, Histograms, Scatter Plots, and Bar Graphs. Findings and Insights: Five-Number Summary: The five-number summary provided a concise overview of the central tendency and dispersion of each attribute. It revealed the range of values, median, and quartiles. For instance, the alcohol content ranges from approximately 11 to 14, with a median around 13. Box Plots: Box plots allowed us to visualize the spread and identify potential outliers in the data. Outliers were detected in attributes like magnesium and total_phenols. The distribution of attributes varied, with some showing symmetric distributions and others skewed. Histograms: Histograms depicted the frequency distribution of attribute values. They provided insights into the distribution shape, such as whether it’s normal, skewed, or bimodal. Attributes like flavonoids exhibited a bimodal distribution, indicating the presence of two distinct groups within the data. 9 Scatter Plots: Scatter plots were created for pairs of attributes to explore potential correlations and patterns. Attributes like alcohol and flavonoids showed a positive correlation, suggesting that wines with higher alcohol content tend to have higher flavonoid levels. Bar Graphs: The bar graph illustrates the distribution of wine classes. It revealed that the dataset is relatively balanced among the three classes, which is essential for building classification models. Insights and Significance ● The dataset contains varying attributes with different ranges, which could potentially affect the performance of machine learning algorithms. ● The presence of outliers in certain attributes may need to be addressed during data preprocessing to ensure model robustness. ● Bimodal distributions suggest the existence of distinct subgroups within the dataset, which could impact the classification task. ● Correlations between attributes can provide valuable insights for feature selection and model building. ● The balanced distribution of classes is favorable for classification tasks, as it prevents model bias towards a particular class. 5 — Challenges Data Incompleteness and Quality: One of the primary challenges was missing or incomplete data entries. Incomplete attributes could potentially hinder the accuracy of analysis and 10 modeling. Additionally, ensuring the quality of data was vital, as inaccuracies or outliers could lead to incorrect insights. Outlier Detection and Handling: The presence of outliers, as identified through box plots, posed challenges in terms of their treatment. Determining whether outliers were erroneous data points or valid extreme values requiring consideration was crucial yet complex. Interpreting Multivariate Relationships: Understanding the relationships between multiple attributes simultaneously, especially in scatter plots, proved challenging. Deciphering which correlations were meaningful and which were spurious demanded a thorough understanding of the domain. Choosing Appropriate Visualizations: Selecting the right visualization technique for each attribute was a challenge. In some cases, histograms were insufficient to represent complex distributions, and alternative approaches had to be considered. Dimensionality and Feature Selection: The dataset’s dimensionality introduced complexities regarding feature selection. Determining which attributes were most relevant for classification while retaining meaningful information required careful consideration. Handling Bimodal Distributions: Identifying and interpreting bimodal distributions in certain attributes required domain knowledge. Unraveling the underlying reasons for these patterns was not straightforward and demanded further investigation. Balancing Exploration and Hypothesis Testing: Balancing the need for exploratory analysis with hypothesis testing was challenging. It was essential to derive insights from the data without making premature assumptions that could lead to biased conclusions. 11 Mitigation and Importance Addressing data incompleteness and quality issues through imputation and validation techniques is crucial for accurate analysis. Employing robust outlier detection algorithms and domain knowledge to make informed decisions about outlier handling. Leveraging statistical methods to quantify and prioritize significant correlations among multiple attributes. Experimenting with diverse visualization techniques and seeking expert advice when dealing with complex distribution patterns. Utilizing dimensionality reduction techniques to manage high-dimensional data and enhance model performance. Investigating bimodal distributions through domain expertise or seeking guidance from subject matter experts. Striking a balance between exploratory analysis and hypothesis testing to ensure comprehensive yet unbiased insights. The challenges encountered during the exploratory data analysis of the Wine dataset underscore the complexity of real-world data exploration. Addressing these challenges not only enhances the accuracy of insights but also paves the way for meaningful feature engineering and model development. The thorough understanding gained from overcoming these challenges contributes significantly to the overall success of data-driven decision-making processes. 12 References: Faghihnejad, Ali. ( Nov 24, 2021). Towards Data Science. Wine data set: A Classification Problem. https://towardsdatascience.com/wine-data-set-a-classification-problem-983efb3676c9 Surbhi, S. (September 1, 2017). Difference Between Histogram and Bar Graph. https://keydifferences.com/difference-between-histogram-and-bar-graph.html#:~:text=A%20hist ogram%20represents%20the%20frequency,no%20gap%20between%20the%20bars. ·

Written by Dru Macasieb