Sizing Sequences with SQL and ClickHouse

Recently, I was planning a data extraction strategy from an API and the goal was to schedule the frequency of data extraction to avoid cached responses, but also to be within rate limits. In order to have data for analysis, I have collected API response every minute for 3 hours. It resulted in 185 files (3 full hours plus several minutes more) with a total size of 6.15 MB saved in JSON New Lines format (and compressed with gzip) and 55746 records. As an analysis tool, I used clickhouse-local. This utility helps us to run SQL against local files without setting up a database, creating tables, and loading data.

Rill Stage 2-4: Data Goes Into ClickHouse

In our journey with data about streams, we did ad hoc analysis with Linux command-line tools, PySpark, and PostgreSQL (powered by TimescaleDB). Those are capable tools that enable analytics in various scenarios: when only Linux command line is available or when PostgreSQL compatibility is a requirement (then TimescaleDB is a good choice) or when queries should scale easily to hundreds of machines, then PySpark shines. But these tools come with their drawbacks. Since source data is stored in quite many GZIP compressed JSON files, it brings some challenges. In case of PySpark, initial read (and schema inference) of these files takes some time (and will take more when the number of files increases).

Rill Treasure Hunt: CIS Twitch Oscar 2020

Recently streamer mokrivskiy announced event CIS Twitch Oscar 2020 (all in Russian and with Russian-speaking streamers), which includes multiple nominations, including a nomination called “Breakthrough of the year”. As far as I understood, Twitch viewers and streamers proposed nominees for the contest. Before the event I was thinking about how to classify Twitch channels into various categories, for example, rising stars, declining, stable. Category “Rising stars” and nomination “Breakthrough of the year” sound similar to me, so I looked at twelve nominees to see how growing Twitch channels look like. In this post I will try to jump in into an opportunity to analyse these channels and try to prioritize speed of analysis delivery over building data pipelines and managing infrastructure. The goal is to look at nominated channels through multiple angles such as hours streamed and viewed, followers, and viewers.

Rill Stage 2-2. Double Dataframe I: PySpark

Let’s continue our ad hoc data analysis journey with the next tool: Apache Spark and in particular PySpark. In the previous post we used Linux command-line tools to perform a data analysis, which is a hard way for people who do not spend most of their time in terminal. PySpark should be much easier to understand for people who use SQL and Python for data analysis. We will use the same questions as previously about the number of streams per day/month, the number of games per day/month, most popular games and genres. In our setup we will use a Docker container provided by Jupyter (called pyspark-notebook) and run Spark in local mode (and write code in Jupyter notebook).

Rill Stage 2-1: Ways of command-line data analysis

So far our Rill journey comprised API exploration and building ingestion pipelines for Twitch and Giantbomb APIs. Next thing to do with data is to analyze it. In this part we will answer some questions about the downloaded data with help of Linux command-line tools: zcat, zgrep, sort, uniq, tr, cut, jq, awk, GNU Parallel.

Rill Stage 1-99: Data collector Scheduling

For the last two years I have been fetching data from Twitch API using StreamSets Data Collector and over the course of these two years Twitch API pipelines were scheduled in various ways: JDBC Query Consumer, Cron script, Orchestration pipelines.

Redshift lessons from re:Invent 2019

On December 2-6 2019 AWS organized conference called AWS re:Invent 2019 during which many announcements from AWS were made and AWS users shared their experiences with AWS platform. Recently I use Amazon Redshift extensively for Data warehousing and was particularly interested in learning more about Redshift tips and best practices. I have watched 12 videos that are related to Redshift (some featured repeated announcements). My focus was not on new features that are in preview, but more on data modelling, query optimization, and data platform architecture (for example, usage of Redshift Spectrum or data ingestion). AWS Big Data Blog already published an article Amazon Redshift at re:Invent 2019 that lists and introduces relevant Amazon Redshift sessions.

TIL: Background Processes and Docker containers

Recently I have set up Apache Airflow in Docker containers. Apart from the container with Apache Airflow backend database server (running PostgreSQL), there are two containers running Apache Airflow Webserver and Apache Airflow Scheduler. The first problem that was encountered is the inability to view task logs through Web UI (while Apache Airflow is configured with LocalExecutor). There is a relevant issue on github repo puckel/docker-airflow. The solution was to attach a volume to each container so that in each container Airflow logs directory is pointed at the same place on the host machine. The second problem is clean up of old logs.

Rill Meets StreamSets

In the previous post we have explored Twitch API using Elixir programming language. We have done our exploration in order to plan how to build a process that acquires data from Twitch API. Data acquisition problem is a common problem in Data analysis and Business intelligence. In data warehousing there is a process called ETL (Extract, Transform, Load), which represents how data flows from source systems to destinations. One way to acquire data is to write custom code for each source (bringing challenges of maintenance, flexibility, reliability). The other way is to use one of systems that were built to solve the data acquisition problem.

Rill Stage 1

The goal of the project Rill is to collect data about online streams from Twitch (and, possibly, other streaming platforms) for further analysis. Set up Twitch client ID according to: The process to obtain data about streams for a particular user looks like this: Find user’s username (e.g., from a Twitch URL) Make a request to Twitch API to convert username to stream id. Make a request to Twitch API to obtain data about user’s stream (is there a live steam, is there a recording being played) In stage 1 we will write a simple functions to explore Twitch API.