Rill Treasure Hunt: CIS Twitch Oscar 2020
Recently streamer mokrivskiy
announced event CIS Twitch Oscar 2020 (all in Russian and with Russian-speaking streamers), which includes multiple nominations, including a nomination called “Breakthrough of the year”. As far as I understood, Twitch viewers and streamers proposed nominees for the contest. Before the event I was thinking about how to classify Twitch channels into various categories, for example, rising stars, declining, stable. Category “Rising stars” and nomination “Breakthrough of the year” sound similar to me, so I looked at twelve nominees to see how growing Twitch channels look like. In this post I will try to jump in into an opportunity to analyse these channels and try to prioritize speed of analysis delivery over building data pipelines and managing infrastructure. The goal is to look at nominated channels through multiple angles such as hours streamed and viewed, followers, and viewers.
Rill Stage 2-2. Double Dataframe I: PySpark
Rill Stage 2-1: Ways of command-line data analysis
So far our Rill journey comprised API exploration and building ingestion pipelines for Twitch and Giantbomb APIs. Next thing to do with data is to analyze it. In this part we will answer some questions about the downloaded data with help of Linux command-line tools: zcat, zgrep, sort, uniq, tr, cut, jq, awk, GNU Parallel.
Rill Stage 1-99: Data collector Scheduling
For the last two years I have been fetching data from Twitch API using StreamSets Data Collector and over the course of these two years Twitch API pipelines were scheduled in various ways: JDBC Query Consumer, Cron script, Orchestration pipelines.