One of the most common mistakes people make when analysing or interpreting data is to make assumptions about what they think they are seeing, rather than what the data really says.
Suppose you have a lot of PR and communications data, I mean a vast cloud of data points, and you can see there might be a pattern in the data. Obviously, you want to understand the causes behind the pattern, but there are so many factors within the data it is difficult to tell for sure what is causing the trend. Don’t panic, some straight forward mathematics called regression analysis can help.
At its simplest, regression analysis is a tool to determine the strength of a correlation within a dataset between one variable (called the dependent variable), and an array of other variables that may influence it (called the independent variables). These are external factors such as local temperature, timings of a competitors’ promotion, etc.
This is done by creating individual relationship graphs between the dependent variable on the Y axis, and one of the independent variables on the X axis, and then finding a line of best fit. If the data set is small, this is often the last step since any outliers can be identified and removed and the line of best fit can be judged by eye. However, in the case of a larger data set where this is impractical, regression analysis can be used to find the mean squared error of the set, which can be thought of as the amount that the data deviated from what the line of best fit suggests.
This leads to the formula Y=mX+c+E, where Y=mX+c represents the line of best fit and the mean squared error is calculated thusly. For each point of data, measure the distance between the actual Y value and the one predicted by the line of best fit. Then find the mean of the squares of all of these values to get the Mean Squared Error (MSE). The lower the MSE is, the more likely that there is a direct correlation between the dependent and independent variables, with a very high standard deviation implying a loose correlation at best, and a very low one implying that the correlation is strong and worth pursuing.
This formula won’t tell you which, if any, of the independent variables has influence over the dependent variable, but it will show how well they match with each other. Regression analysis can not interpret the data, it can’t pick out coincidental trends from true ones, or account for outliers or bad data. So the data should be parsed beforehand, either by a human or a program specifically designed to do so. However, it is a very quick method, that can evaluate a lot of data efficiently, providing a general ranking of which factors align best with the trends seen in the dependent variable for additional human interpretation.
MSE scores must be viewed against each other. No one factor will be the sole cause of any trend, which makes it worthwhile to try to find sets of independent variables with low relative MSEs, then combine them to get a more accurate view of the root cause of the observed trends. For instance, BMI and the amount of cigarettes smoked daily will both correlate with cholesterol levels, and also with each other, meaning that a separate statistic can be calculated by using the line of best fit of the relation between the BMI and smoking and compared to the cholesterol levels for a more accurate look, allowing us to find which is more influential in the process.
If you want to learn more about regression analysis for PR and communications data or how you can improve your use of data in public relations and communications, then get in touch for a quick chat.
Listen to the audio and subscribe to the PR Futurist podcast.