Spark

Posts with the tag Spark:

Rill Treasure Hunt: CIS Twitch Oscar 2020

2021-03-14 3211 words 16 minutes Spark rill ClickHouse TimescaleDB Twitch

Recently streamer mokrivskiy announced event CIS Twitch Oscar 2020 (all in Russian and with Russian-speaking streamers), which includes multiple nominations, including a nomination called “Breakthrough of the year”. As far as I understood, Twitch viewers and streamers proposed nominees for the contest. Before the event I was thinking about how to classify Twitch channels into various categories, for example, rising stars, declining, stable. Category “Rising stars” and nomination “Breakthrough of the year” sound similar to me, so I looked at twelve nominees to see how growing Twitch channels look like. In this post I will try to jump in into an opportunity to analyse these channels and try to prioritize speed of analysis delivery over building data pipelines and managing infrastructure. The goal is to look at nominated channels through multiple angles such as hours streamed and viewed, followers, and viewers.

Rill Stage 2-2. Double Dataframe I: PySpark

2021-02-05 2612 words 13 minutes Spark rill Jupyter

Let’s continue our ad hoc data analysis journey with the next tool: Apache Spark and in particular PySpark. In the previous post we used Linux command-line tools to perform a data analysis, which is a hard way for people who do not spend most of their time in terminal. PySpark should be much easier to understand for people who use SQL and Python for data analysis. We will use the same questions as previously about the number of streams per day/month, the number of games per day/month, most popular games and genres. In our setup we will use a Docker container provided by Jupyter (called pyspark-notebook) and run Spark in local mode (and write code in Jupyter notebook).