Data Exploration
What is data exploration?
Data exploration refers to the initial step in data analysis in which we use data visualization and statistical techniques to describe dataset characterizations, such as size, quantity, and accuracy, in order to better understand the nature of the data.
Key aspects of data exploration include:
Understanding the Data Structure:
Examining the types of data (e.g., numerical, categorical, datetime).
Checking for missing values and handling them appropriately.
Understanding the distribution of the data, including the mean, median, mode, variance, and standard deviation.
Data Visualization:
Creating visual representations of the data to uncover underlying patterns.
Using plots like histograms, bar charts, box plots, scatter plots, and heatmaps.
Visualizing correlations between variables using correlation matrices and pair plots.
Normal Distribution
The normal distribution os a theoretical concept of how large samples of ratio or interval level data will look once its plotted.
In a normal distribution, measures of central tendency are: mean, meadinan, mode.
Mean: The average of the data.
Median: The middle value separating the higher half from the lower half of the data.
Mode: The most frequently occurring value in the data.
Standard Deviation
Standard deviations are used to measure how much variation exists in a distribution.
Low standard deviations means that the values are close to the High standard deviation means that values are spread over a large range.
For us to understand more about distributions it is important to understand modality, symmetry and peakedness.
Modality
A distributions can have more than one peak. The number of peaks in a distribution determines the modality of the distribution
Most distribution are normally distriburted and have only one peak, however it is possible to have distibutions with 2 or more peaks.
Skewness
If two halves of a distibution can be superimposed on each other, where one i a mirror of the other, the data is symetrical.
However sometimes data is not symetrical. If the peak is not in the center, one tail of the distribution will be longer than the other. Meaning it is skewed.
Skewness is a measure of symetry of distributions
In a perfect world the skewness would be zero, because the mean = median.
Positive skewness means there is a lot of data to the left Negative skewness means there is a lot of data on the right
Kurtosis
Kurtosis measures if the bell of the curve is normal, flat or peaked
Last updated