Exploratory Data Analysis on Steam Store Games Dataset
Introduction
As a casual gamer myself, I’m always interested in any development related to computer games or gaming hardware. Ironically, I only recently registered a Steam account, which makes me feel like I’ve been living under a rock. Since I didn’t know much about Steam at all, I think I’d take this opportunity to explore what Steam has to offer by exploring the Steam Store Games dataset.
Data preparation
This dataset is published by Nik Davis on Kaggle. It includes over 27,000 games scraped from Steam, and Nik has done a fantastic job cleaning it up for data analysis.
The dataset was last updated in May 2019, so I removed the games that were released in 2019 to prevent them from skewing the year-to-year comparison. Since I’m only interested in Windows games (sorry, Mac and Linux gamers) in a language that I can understand, I also removed any non-English and non-Windows games from the dataset.
The player rating information for each game is stored as positive rating count and negative rating count in the dataset. So I put them together as the total rating count and then calculated the positive rate (positive ratings/total ratings), which I use as a standard measure for game quality.
In terms of the player count (owners) for each game, the dataset provides a standard range instead of a set number. So I took the lower bond as a more comparable measure.
The last change I have to make to the dataset is to parse category information for a more accurate tally because Steam allows multiple category descriptions for each game.
With these preparation steps out of the way, let’s dig in.
How many games were released each year on Steam?
This was the very first question that popped into my head when I started playing with the dataset. Once I have it plotted out, I realized that this could be the perfect opportunity to take a walk down the memory lane for a brief history of Steam.
For those of you who are not familiar with Steam, it is a digital game distribution platform created by Valve. Speaking of Valve, I have to mention the game that put them on the world stage: “Half-Life”, which was released in 1998. Valve launched Steam in 2003 as an update and patch distribution platform for their games. In 2005, Valve started publisher partnership, allowing games from other developers to be distributed via Steam. This change paved the way for the Steam that we know and love today. We can certainly see healthy growth in the number of games being released on Steam for the years that followed. In 2012, Steam Greenlight was launched to attract more game developers and engage players by allowing them to vote on if a game would be published on Steam. Steam Greenlight was largely credited as the reason for the exponential growth of Steam.
Did the increased number of games affect the quality of the games?
When Steam Greenlight was replaced by Steam Direct in 2017, the players worried that the quality of the games on Steam would go down without their input. It was a reasonable doubt giving the huge number of games rushed to steam in recent years. So I plotted the median positive rate of games published in each year, as an indicator of game quality over time.
From the look of it, games released in 2018 received a higher percentage of positive ratings, in general, comparing to the previous year, despite the increased number of games. So it seemed that phasing out Steam Greenlight did not hurt the quality of the games. Interestingly, when Steam Greenlight was introduced in 2012, the positive rate had immediately dropped for two consecutive years, likely because the players were overwhelmed by the sudden exposure to a large number of games with varying quality standards. Also, the positive rate took a nosedive from 2004 to 2005, which might have been the push behind the publisher partnership, where Steam opened up its door to diversify and enrich the game library.
Overall, the positive rate seemed to decline over time, which implies that the player base is less impressed with the quality of the increasingly large number of game choices at their fingertip. But an argument can be made here though, were games getting worse or the expectations of players were getting higher?
Were games getting worse or the expectations were getting higher?
To answer this question, let’s exam the total player base and their average playtime per game over the year.
The growth of the player base in the past decade is undoubtedly impressive. With more casual gamers joining the hardcore gamers, as well as more options of games, there’s no doubt that the expectations will make it harder for game developers to keep up.
The year 2013 seemed to be a turning point because both player base and their average playtime per game showed a clear decline afterward. From our previous discussion, we know that this is also where the Steam game library started to explode. So it seems that the ballooned game library failed to attract more players or playtime from them. It clearly says something about the game quality in general.
The player base drop in 2018 is likely because the dataset was last updated in May 2019. So games released in 2018 didn’t have much time to accumulate their player base and playtime.
What also caught my attention here is the unreal spike of playtime in 2000. Similar but much smaller spikes also happened in 1998, 2004, and 2013. Let’s zoom in and see what happened in those years.
Which games were the greatest hits?
I plotted the games released in 1998, 2000, 2004, and 2013. The y-axis shows the average playtime per game, and the size of the dots shows the size of the player base for that particular game.
Now the answer is clear. Those spikes in playtime were results of the greatest hits in game history.
“Half-Life” was the cornerstone that brought fame and wealth to Valve in the first place. “Counter-Strike” was the legendary first-person shooter (FPS) game that took the world by storm. That explains the unreal spike of playtime in 2000, and no wonder why its sequel caused another spike in 2004. 2013 was an interesting year, as we have “The Banner Saga: Factions” recorded an unusually high playtime with a relatively small player base. The very same year was also backed up by “Dota 2” with the largest player base on Steam.
Amongst those greatest hits, is there a game that would claim the title of the most popular game in Steam’s history?
If we define “popular game” as a game with most players, I think we need to differentiate the free game from paid ones. Because the price would certainly influence player attraction. The top 10 games in both free and paid categories have a lot of ties because I took the lower bond of owners as the standard player base indicator. Within those games, the percentage of their positive ratings changed their positions significantly. In the end, I think any games made to this chart are winners and deserve a pat on the back.
What are the popular game categories?
Since we can’t land on a most popular game, let’s see if we can shed some light on what the most popular game categories are.
It took me by surprise that the Single-player category took the first spot in this day and age. But it does make sense, as it’s the most effective way to engage a larger player base due to the convenience and accessibility of playing such games. Not to mention that single-player games may be much easier to implement from a developer perspective.
All those Steam features seem to be a successful marketing strategy that adds value and attraction for games. Controller support is also understandably popular, giving that the young gamers grew up in the era of Xbox and PlayStation. From that perspective, I would expect VR and AR support to climb up the ladder soon.
Next Steps
The dataset only includes the games distributed on a single platform does not represent the history and trends of the entire gaming industry. For example, my all-time favorite, Diablo II, is nowhere to be found in the dataset, because it was not available via Steam. So expanding the game dataset beyond Steam would be my first step going forward.
Also, I intentionally avoided diving into analysis on the free vs. paid games. Because no game is truly free, and the industry has evolved to more complicated business models that we can’t simply label any games as either free or paid anymore. For example, the “free to play” but “paid to win” model certainly blurred the line and made it hard for us to track the true cost to its players.
All in all, it was a fun journey to explore Steam’s offerings this way. I’m looking forward to an updated version of this dataset, so we can shed some light on the trends during the COVID-19 pandemic.