Python

Rill Stage 2-4: Data Goes Into ClickHouse

2021-10-06 2920 words 14 minutes ClickHouse rill Python

In our journey with data about streams, we did ad hoc analysis with Linux command-line tools, PySpark, and PostgreSQL (powered by TimescaleDB). Those are capable tools that enable analytics in various scenarios: when only Linux command line is available or when PostgreSQL compatibility is a requirement (then TimescaleDB is a good choice) or when queries should scale easily to hundreds of machines, then PySpark shines. But these tools come with their drawbacks. Since source data is stored in quite many GZIP compressed JSON files, it brings some challenges. In case of PySpark, initial read (and schema inference) of these files takes some time (and will take more when the number of files increases).