A scatter plot is a type of data visualization used to display the relationship between two quantitative variables. Each point on a scatter plot represents a pair of values, plotted along two Cartesian axes (horizontal and vertical). The position of each point is determined by the values of the variables being analyzed, with one variable defining the x-axis and the other the y-axis. Scatter plots are fundamental tools in exploratory data analysis, allowing for a quick visual examination of potential correlations, trends, and patterns within data.
Scatter plots are often used to visually estimate correlation between variables. Correlation quantifies the strength and direction of a linear relationship:
To calculate the correlation coefficient `r` for a sample dataset:
r = Σ [(x_i - x̄)(y_i - ȳ)] / √[Σ (x_i - x̄)² * Σ (y_i - ȳ)²]
Here, x_i and y_i represent individual sample values, and x̄ and ȳ are the sample means for the x and y variables, respectively.
A scatter plot can also support the identification of outliers, which are data points that deviate significantly from the overall pattern.
In scatter plots with a discernible trend, a line of best fit, or regression line, can be drawn to represent the relationship. The simplest form, linear regression, fits a straight line to the data points based on the following equation:
y = mx + b
Where `m` is the slope (rate of change in y for each unit increase in x), and `b` is the y-intercept (the value of y when x is zero).
For a dataset with n observations, the slope (m) and y-intercept (b) are calculated as:
These equations enable the line to minimize the vertical distances between the points and the line itself, providing the best linear approximation of the relationship.
Scatter plots are extensively used across scientific research, finance, engineering, and machine learning to visually inspect data before formal modeling. They are particularly valuable in Big Data and Data Science, where identifying relationships, trends, and anomalies can inform data cleaning, feature selection, and model development. Scatter plots also lay the groundwork for more sophisticated analyses by offering an intuitive first look at data distributions and relationships.