Part
01
of one
Part
01
Please prove the claim that 90% of the world's data has been generated in the past two years, including hard statistics to back it up.
Thank you for your inquiry about big data trends over the last two years. In my research I found that I can use regression analysis to confirm that the amount of data produced in the last two years (2015 and 2016) was likely to have produced about 80% of the total data created and the last three years (2014-2016) was likely to have produced about 91% of that total data. I was not able to confirm the claim that 90% of all data was produced in the last two years. I did find that this claim was made in 2013 at the latest though, so perhaps this value is out of date. Here’s my detailed analysis.
METHODOLOGY
To determine the amount of data which I was able to make use of an info graphic and I was able to trace the article from which the original claim heralds. Surprisingly, most of the information available is in articles which are all very similar repetitions of a few facts and statistics which are currently known about big data. Since this data is produced mostly by businesses, they are often not willing to offer much information about this data so I was not able to find much source data.
I was able to collect some data about big data in this spreadsheet, from which I was able to produce a regression to project values for the recent years of data production.
DATA PRODUCTION BY YEAR
Most of the current articles about big data refer to the same basic statistics about big data. The only statistics which are relevant to your question are those found in an info graphic produced by vcloudnews.com and a statement by IBM which is referred to by most of the other articles on this topic.
Vcloudnews provides valuable statistics for 4 significant years in the history of data collection. From these values I can calculate an exponential regressions from which to approximate the values for the last couple years.
I used the Keisan Online Calculator to calculate an exponential regression for this data as follows:
GBPS=(.0006)(2.26)^year
I used 50 significant digits which are not shown here in my actual calculations to maintain maximum accuracy. These values are shows in the spreadsheet.
“GBPS” is the gigabytes per second in each year and “year” is the years since 1990.
This regression has a correlation coefficient of .93 (often referred to as 93%), which indicates a very close fit to the observed data, so this regression presents a reasonable method for predicting the values for the years from 1992-2016.
I used this equation to project values for the years from 1992-2016. I changed these values into exabytes (XB) per year. Then found the sum of all these values as well as the sum of the years 2015 and 2016. Finally, I divided the 15-16 value by the total value to find out the approximate percentage.
My projection is that 80% of all data was produced in the last two years, not 90%. I went ahead and calculated the value for the last 3 years (14-16) for comparison and that percentage is 91%.You can observe each of my calculations in the spreadsheet provided.
Considering that the source of the 90% claim is from IBM, it is very likely to be accurate. It is essential to note here that this claim was made in 2013 at the latest since there are article referring to this statistic in 2013. IBM still has the same claim on their website currently, so perhaps IBM has simply neglected to change it. It's very possible that this 90% value was true in 2013, but is no longer true.
Without IBM's data though, there is no way to prove it either way. I cannot confirm the 90% in the last two years claim, but it is not unreasonable since I found percentages for the last two and three years of 80% and 91% respectively. 90% is between those two values, so, if the actual values for those particular years are higher than I though or if the previous years are smaller, then it is very reasonable for the 90% claim to be true.
Statista also has more accurate year by year data available from which a better regression could possibly be produced to confirm the 90% claim. You must be a member of their service to obtain this data though.
POSSIBLE CONTINUED RESEARCH
These might be some great follow up questions for which more data is available: How much has the data produced by certain companies increased? Facebook, Instagram, Google, and Amazon would be great companies to research regarding data. How much has a certain type of data increased? Text, picture, video, surveillance, usable, and non-usable would be interesting data categories to look further into.
SUMMARY
Overall, the 90% claim is reasonable considering the very clear exponential pattern that data production is currently following, but it very well may be outdated as well since it was first stated in 2013. I found that in the last 2-3 years the amount of data produced is likely to have been 80-91% of the total data ever produced. This is in the same ballpark as 90%, but does not confirm it. More data about the production of consumer data is available through services like Statista, but only for members, so I was not able to use that data in my predictions.
Thanks for using Wonder! Please let us know if we can help with anything else!