Media companies and advertisers rely on TV ratings every day to measure the success of TV shows, verify that their audience size and composition are delivering against media-buy targets, and make-good in case the numbers come up short. From that point of view, TV ratings are metrics that measure the past, or at best the present, of TV viewing.
But media companies are also using ratings to predict the future. Ratings set expectations and affect programming decisions from one season to the next. They also help set advertising rates well in advance of when a campaign might actually air. In the U.S., for instance, TV networks sell the majority of their ad inventory for the season at the “upfront,” an event they organize only once a year (between March and May). This means that the rate for the ads you’re seeing on TV today might have been negotiated more than a year ago.
In order to predict what a show’s rating might be in three, six or 12 months, researchers are using forecasting models. Many of those models have been used for years with little or no modification. They’ve been successful at predicting ratings and have done a great job of supporting the exchange of billions of advertising dollars each year. But fast changes in the TV ecosystem are making it increasingly difficult to develop reliable models.
Consider the list of recent technology innovations in the media industry: Viewers are increasingly using their laptops, tablets and smartphones to watch content; streaming services like Netflix and Amazon Prime have reached mass-adoption; new TV-connected devices are reshaping the big-screen experience. People are time-shifting, streaming and binge watching—they’re more in control of the media they consume than they’ve ever been. Their behavior is not only more complex, but more unpredictable as well.
At Nielsen, we have access to many data resources that measure how people consume media. Before adding digital TV data into the mix (as input as well as output of our forecasting models), we wanted to examine whether it was possible to first improve how we predicted ratings for traditional TV, using traditional TV data as our only source. Thanks to the Nielsen National People Meter, we have high-quality data that goes back many years, with consistent methodology and a robust panel of nationally-representative viewers.
We tapped into this rich data at a very detailed level to create new predictive models: Variables like historical Live+7 ratings (i.e., ratings that include live audiences, as well as viewers up to seven days after the initial broadcast), C3 ratings (commercial ratings that include playback up to three days afterwards), HUT (the percentage of households using television at any point in time), reach, household ratings, demographic ratings, day of week, hour of day, and the identity of the network are some of the key pieces of information we used as input variables; and we capitalized on advanced machine learning and statistical algorithms (like ridge regression, random forest and gradient boosting) to identify relevant data relationships.
Working in cooperation with a client, we conducted a number of proof-of-concept studies to test and validate the models we created. We designed our models to predict future ratings at a granular level (hour-blocks for small demographic groups, like males ages 2-5 or females 65+), but we also rolled up those figures to the network level. In order to understand how our models performed against reality, we used a hold-out period of two quarters to compare our forecasts as well as our client’s internal forecasts to true ratings data. For example, we accurately predicted an average Live+7 rating of 1.94 for persons 30-34 on Network A between 9 p.m. and 10 p.m. on Tuesdays during second-quarter 2015, based solely on historical data up to the first quarter of 2014. Predictions were very accurate at the network level, where we had a 99% R-squared (percentage of variance explained), but they were more difficult at the more granular hour-block daypart level, or for some of the smaller demographic groups. Even at the hour-block level though, our model’s R-squared still topped 95% and significantly outperformed a model that our client had been relying on up to that point. Across more than 2,000 day-time projections, our forecasts were 41% more accurate for R-squared and 16% more accurate for weighted absolute percentage error (WAPE)—two key measures in forecasting accuracy.
We’ll share more details about those proof-of-concept models and the tests we conducted in an upcoming paper. The key takeaway of this project is that we were able to convert big and noisy behavioral data into predictive modeling features and do so in a very efficient (and automated) manner. But every decimal point of a rating point has enormous financial implications, and we need to keep pushing the envelope by adding new input variables (such as ad spend or program-specific data), building ways to quickly adapt to changes in programming packages and channel lineups, testing new forms of regression and classification algorithms, or even combining multiple promising models into one.
While this project focused on traditional TV, it’s interesting to note that the impact of digital data is reflected in changes in TV ratings in the historical data—and thus in our predictions as well. But this is an indirect measurement of a cumulative effect and no substitute for a model that would focus specifically on over-the-top viewing, for instance, or viewing on a smartphone app. In addition to the next steps outlined above, the use of digital data will be an important element to improve our forecasts in the future.
In the end, we also need to recognize that each client has intimate knowledge of its programs, as well as a strong intuition about how those programs might be received in the future. That “human element” should not be ignored when we put together predictive models and can be especially valuable when reacting to significant and unforeseen changes in the marketplace. A system that integrates rich data, powerful machine learning algorithms and domain expertise can achieve better results than either could accomplish on their own.