Posts with the tag rill:

Rill Stage 2-4: Data Goes Into ClickHouse

In our journey with data about streams, we did ad hoc analysis with Linux command-line tools, PySpark, and PostgreSQL (powered by TimescaleDB). Those are capable tools that enable analytics in various scenarios: when only Linux command line is available or when PostgreSQL compatibility is a requirement (then TimescaleDB is a good choice) or when queries should scale easily to hundreds of machines, then PySpark shines. But these tools come with their drawbacks. Since source data is stored in quite many GZIP compressed JSON files, it brings some challenges. In case of PySpark, initial read (and schema inference) of these files takes some time (and will take more when the number of files increases).

Rill Treasure Hunt: CIS Twitch Oscar 2020

Recently streamer mokrivskiy announced event CIS Twitch Oscar 2020 (all in Russian and with Russian-speaking streamers), which includes multiple nominations, including a nomination called “Breakthrough of the year”. As far as I understood, Twitch viewers and streamers proposed nominees for the contest. Before the event I was thinking about how to classify Twitch channels into various categories, for example, rising stars, declining, stable. Category “Rising stars” and nomination “Breakthrough of the year” sound similar to me, so I looked at twelve nominees to see how growing Twitch channels look like. In this post I will try to jump in into an opportunity to analyse these channels and try to prioritize speed of analysis delivery over building data pipelines and managing infrastructure. The goal is to look at nominated channels through multiple angles such as hours streamed and viewed, followers, and viewers.

Rill Stage 2-2. Double Dataframe I: PySpark

Let’s continue our ad hoc data analysis journey with the next tool: Apache Spark and in particular PySpark. In the previous post we used Linux command-line tools to perform a data analysis, which is a hard way for people who do not spend most of their time in terminal. PySpark should be much easier to understand for people who use SQL and Python for data analysis. We will use the same questions as previously about the number of streams per day/month, the number of games per day/month, most popular games and genres. In our setup we will use a Docker container provided by Jupyter (called pyspark-notebook) and run Spark in local mode (and write code in Jupyter notebook).

Rill Stage 2-1: Ways of command-line data analysis

So far our Rill journey comprised API exploration and building ingestion pipelines for Twitch and Giantbomb APIs. Next thing to do with data is to analyze it. In this part we will answer some questions about the downloaded data with help of Linux command-line tools: zcat, zgrep, sort, uniq, tr, cut, jq, awk, GNU Parallel.

Rill Stage 1-99: Data collector Scheduling

For the last two years I have been fetching data from Twitch API using StreamSets Data Collector and over the course of these two years Twitch API pipelines were scheduled in various ways: JDBC Query Consumer, Cron script, Orchestration pipelines.

Rill Meets StreamSets

In the previous post we have explored Twitch API using Elixir programming language. We have done our exploration in order to plan how to build a process that acquires data from Twitch API. Data acquisition problem is a common problem in Data analysis and Business intelligence. In data warehousing there is a process called ETL (Extract, Transform, Load), which represents how data flows from source systems to destinations. One way to acquire data is to write custom code for each source (bringing challenges of maintenance, flexibility, reliability). The other way is to use one of systems that were built to solve the data acquisition problem.