Rill Stage 2-1: Ways of command-line data analysis

Linux command-line tools in action

So far our Rill journey comprised API exploration and building ingestion pipelines for Twitch and Giantbomb APIs. Next thing to do with data is to analyze it. In this part we will answer some questions about the downloaded data with help of Linux command-line tools: zcat, zgrep, sort, uniq, tr, cut, jq, awk, GNU Parallel.

Downloaded files are organized in one file per hour of a day stored under folder that respresents hour, which is stored under folder that represents day. For example, path 2020_07_01/00/helix_streams_2020_07_01_00.json.gz contains:

  • file helix_streams_2020_07_01_00.json.gz: Gzip compressed JSON file (format: new-line delimited JSON) for hour 00 of date 2020_07_01 sourced from Twitch Helix API;
  • folder 00: hour of the day in 2-digits 24-hour format;
  • folder 2020_07_01: date in format YYYY_MM_DD.

Used hardware: Core i7-6700HQ, 32GB RAM DDR4 2133, Samsung SSD 860 EVO 1TB.

Number of streams per day / per month

First interesting question is how many unique streams happened per day and per month. For this ad-hoc analysis, we define Twitch stream as a unique value in field “ID”. Let’s start simple with using zcat to extract lines from compressed files, grep ID of streams, and then count number of unique stream IDs. The easiest way to analyze by a particular hour/day/month is to navigate to the respective directory and run some commands.

For July 2020 we have 12_971_288 total captured events for Twitch streams endpoint (Twitch Helix Streams dataset contains data only for streams that had at least 40 viewers).

Let’s count number of streams that happened on 1st July 2020:

1
2
3
4
5
6
zcat 2020_07_01/**/*.json.gz |
grep -oh '"id": "[[:digit:]]\+' | # filter strings where field "id" with numerical value is present
cut -f 4 -d'"' | # remove double quotes and take fourth column
sort -n | # sort output numerically (sort is needed for uniq)
uniq | # remove only duplicates. This will keep only one value of each string.
wc -l # count number of lines

There were 34_898 unique streams on 1st July 2020.

If zgrep is available, then zcat is not required and the command can start with:

1
zgrep -oh '"id": "[[:digit:]]\+' 2020_07_01/**/*.json.gz |

Counting number of streams for a month is done by replacing date with 2020_07_*:

1
zcat 2020_07_*/**/*.json.gz |

There were 939_209 unique streams in July 2020. Monthly calculation took 305 seconds, which feels long.

Measures of duration for command-line commands were taken using time command. On Manjaro Linux time (man time) command that shows real time (wall clock time) had to be installed with sudo pacman -Syy time. Running monthly count pipeline with /usr/bin/time -p ./num_streams.sh gave the following output:

1
2
3
4
5
$ /usr/bin/time -p ./num_streams.sh
939209
real 305.92
user 512.97
sys 47.52

One idea that comes to mind is to parallelize processing and there is a great tool to implement parallel processing called GNU Parallel.

1
2
3
4
5
find $PWD -path '*2020_07_**' -type f | # list all files (absolute path) that have '2020_07_' in path
parallel "zgrep -oH '\"id\": \"[[:digit:]]\+' {} | cut -f4 -d'\"' | sort | uniq" | # run pipeline with zgrep,cut,sort,uniq for as many files as possible in parallel 
sort -n | # perform numerical sort (this is required for uniq)
uniq | # remove only duplicates. This will keep only one value of each string.
wc -l # count number of lines

Benchmarking parallel streams counting script with time shows that parallel script takes almost half time of the original script.

1
2
3
4
5
/usr/bin/time -p ./num_streams.sh
939209
real 160.71
user 845.90
sys 100.70

Number of unique games per day / per month

In the previous part we have used zgrep with regular expression and cut to extract relevant data from JSON strings. Since we know that we work only with JSON strings then we can use tool called jq that makes working with JSON much more convenient. Also we used naive (and implicit) way of detecting dates - navigate to desired directory and run scripts. Now we will display date together with calculated result.

We will extract date from field captured_at, which has the following format “2020-07-01 00:00:02+00:00” (date separated from time by whitespace).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
zcat /twitch/raw/helix/streams/2020_07_01/**/*.gz |
jq -r 'select(.game_id | length > 0) | [.game_id, .captured_at] | @csv' | # filter records where `game_id` is not empty.
tr -d '"' | # remove double quotes
cut -d' ' -f1 | # split by whitespace and take column 1. This will keep game_id and date part (without time) from `captured_at`.
sort -t',' -b -k 2,2 -k 1,1 | # sort input by date first and then by game_id.
uniq | # deduplicate input. This will result in only one unique occcurence of game_id per day.
cut -d',' -f 2 | # take only the second column (which is date)
uniq -c | # count number of unique occurences of date.  
awk '{print $2, $1}' | # swap first column (count of unique occurences) and second (date)
column --output-separator ' |' --table --table-columns Day,'Number of unique games' # transform input into a table with table columns and separator between columns

In this script awk was used to swap output columns. Command uniq -c outputs count column first and input columns after count column. We prefer to read day and then count left to right, so command awk '{print $2, $1}' was used to swap columns. (Additional thing that we have learned here is that command sort sorts best on multiple columns from left to right (sort -k 2,1 did not return correct results) Script that counts number of unique games per day is formatted using column (man column) and produces a result that looks like this:

1
2
3
4
5
6
Day        |Number of unique games
2020-07-01 |1711
2020-07-02 |1704
2020-07-03 |1669
2020-07-04 |1759
2020-07-05 |1706

Running this script for month of July 2020 via time shows the following metrics:

1
2
3
real 308.09
user 518.12
sys 43.77

Doing the same unique games analysis per month (instead of day) is more difficult because the date needs to be stripped of day part on hyphen character. In order to preserve game_id after the string is split by hyphen we change jq command to output game_id first and then captured_at. this will also force us to swap columns during the script two times.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
zcat /twitch/raw/helix/streams/2020_07_*/**/*.gz |
jq -r 'select(.game_id | length > 0) | [.game_id, .captured_at] | @csv' | # filter records where `game_id` is not empty.
tr -d '"' | # remove double quotes
cut -d'-' -f1,2 | # split by hyphen and take fields 1 and 2. This will keep game_id and only month part of the date (without time and day) from `captured_at`.
awk -F',' -v OFS=',' '{print $2, $1}' | # swap first column (game_id) and second (date)
sort -t',' -b -k 1,2 | # sort input by date first and then by game_id.
uniq | # deduplicate input. This will result in only one unique occcurence of game_id per month.
cut -d',' -f 1 | # take only first column (which is date)
uniq -c | # count number of unique occurences of date. 
awk '{print $2, $1}' | # swap first column (count of unique occurences) and second (date)
column --output-separator ' |' --table --table-columns Month,'Number of unique games' # transform input into a table with table columns and separator between columns

Output of the monthly script looks as following:

1
2
3
4
Month   |Number of unique games
2020-06 |6866
2020-07 |6995
2020-08 |6910

Running monthly script only for month of July 2020 using time gives similar performance results as daily script:

1
2
3
real 327.49
user 593.01
sys 47.19

Running the monthly script for three months (June, July, August) takes much more time:

1
2
3
real 998.15
user 1798.37
sys 145.51

In the previous part we have introduced tools jq, awk, and column to transform data, but we always analyzed only one dataset. Our next question what are most streamed games and their genres.

We could use code from the previous part to find most streamed game_ids, but this time we want to see actual game title. Game titles can be obtained from a dataset that contains data from Twitch Helix Top Games endpoint. Twitch Helix Top Games dataset contains fields id (which is the same as game_id in Twitch Streams dataset) and name (game title). Twitch Streams and Twitch Top Games datasets can be joined on fields game_id and id respectively. To join datasets on command line, we can use Linux command join (man join). Datasets can be joined if they are sorted on the same column. In order to avoid problems with having commas in game names, we will use caret character (^) as separator between columns (implemented using jq function join).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# join inputs on second column from first input ("game_id") and first column of the second input ("id")
join -t'^' -1 2 -2 1 -o 1.3,2.2,1.1 \
<(
    zcat /twitch/raw/helix/streams/2020_07_0[1-2]/**/*.json.gz | 
    jq -r 'select(.game_id | length > 0) | [.id, .game_id, .captured_at] | join("^")' | # extract fields "id" and "game_id" from json strings 
    cut -d' ' -f1 | # split input by whitespace and take first column. This will remove time part of the date
    sort -t'^' -b -k 1 | # perform sort starting from first column
    uniq | # deduplicate columns. Uniq needs sort because it deduplicates adjacent lines
    cut -d'^' -f2,3 | # split input by caret ^ and keep the second column and third
    sort -b | # sort input. Now only game_ids are present
    uniq -c | # deduplicate lines and print number of times each value occured in the input 
    sed -e 's/^ *//;s/ /^/' | # this first removes the leading spaces then replaces the first space with a caret ^
    sort -t'^' -b -k 2,2 # sort by second column. This is needed to join on games dataset
) \
<(
    zcat /twitch/raw/helix/top_games/2020_07_01/**/*.gz |
    jq -r '[.id, .name] | join("^")' | # extract fields "id" and "name" from json string
    sort -t'^' -b -k 1,1 | # sort by first column. This is needed to join on streams dataset
    uniq # deduplicate input
) | 
awk -F'^' -v OFS='^' '$3 >= 350' | # filter records that have at least 350 streams
sort -t'^' -b -k 1,1 -k 3n,3r |  # sort by day first and then by number of streams such that highest values are printed first
column -s'^' --output-separator ' |' --table --table-columns Day,'Game name','Number of streams' # pretty print output into a table

In this code, awk is used to filter records. For awk we also had to explicitly define input separator (-F) and output separator (OFS) because caret character is used as delimiter in this script. Finally, the output is nicely formatted into a table. From the table we see that category “Just Chatting” beats actual games by a wide margin.

Output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
Day        |Game name                        |Number of streams
2020-07-01 |Just Chatting                    |5581
2020-07-01 |Fortnite                         |2652
2020-07-01 |Grand Theft Auto V               |2077
2020-07-01 |League of Legends                |1764
2020-07-01 |Call of Duty: Modern Warfare     |1705
2020-07-01 |VALORANT                         |1466
2020-07-01 |Counter-Strike: Global Offensive |1022
2020-07-01 |Minecraft                        |917
2020-07-01 |World of Warcraft                |835
2020-07-01 |Apex Legends                     |732
2020-07-01 |Music & Performing Arts          |667
2020-07-01 |Dead by Daylight                 |603
2020-07-01 |FIFA 20                          |467
2020-07-01 |Tom Clancy's Rainbow Six: Siege  |460
2020-07-01 |Escape From Tarkov               |440
2020-07-01 |Overwatch                        |437
2020-07-01 |Art                              |426
2020-07-01 |Dota 2                           |420
2020-07-01 |PLAYERUNKNOWN'S BATTLEGROUNDS    |358
2020-07-01 |NBA 2K20                         |351
2020-07-02 |Just Chatting                    |5511
2020-07-02 |Fortnite                         |2470
2020-07-02 |Grand Theft Auto V               |2106
2020-07-02 |League of Legends                |1768
2020-07-02 |Call of Duty: Modern Warfare     |1658
2020-07-02 |VALORANT                         |1407
2020-07-02 |Counter-Strike: Global Offensive |981
2020-07-02 |Minecraft                        |922
2020-07-02 |World of Warcraft                |762
2020-07-02 |Apex Legends                     |710
2020-07-02 |Music & Performing Arts          |656
2020-07-02 |FIFA 20                          |653
2020-07-02 |Dead by Daylight                 |584
2020-07-02 |Overwatch                        |443
2020-07-02 |Tom Clancy's Rainbow Six: Siege  |416
2020-07-02 |Dota 2                           |414
2020-07-02 |Art                              |412
2020-07-02 |Escape From Tarkov               |410
2020-07-02 |PLAYERUNKNOWN'S BATTLEGROUNDS    |377

Analysis of most popular games required one join between streams and top games datasets. Twitch datasets do not contain genre information. We have Giantbomb dataset and Twitch Top Games dataset from Twitch API V5, which contains mapping between Twitch Game IDs and Giantbomb Game IDs. This means that two joins will be required: one join between Twitch V5 Top Games and Giantbomb datasets to get a mapping between Twitch Game IDs and genres, and then between mapping from the previous step and Twitch Helix Streams dataset. We achieve this by piping result of first join command into the second join.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# first join Twitch Top Games V5 dataset for July and Giantbomb dataset to obtain twitch_game_id and genre
join -t'^' -1 2 -2 1 -o 1.1,2.2 \
 <(
     zcat /twitch/raw/v5/top_games_sdc/2020-07-*/**/*.gz |
     jq --raw-output 'select(.game_giantbomb_id > 0) | [.game__id, .game_giantbomb_id, .game_name] | join("^")' | # valid games have giantbomb_id higher than 0
     sort -t'^' -b -k 2,2 | # sort on join column "game_giantbomb_id"
     uniq
 ) \
 <(
     zcat /giantbomb/raw/game/2020-07-*/**/*.gz |
     jq --raw-output '[.giantbomb_id, .genre] | join("^")' |
     sort -t'^' -b -k 1,1 | # sort on join column "giantbomb_id"
     uniq
 ) |
sort -t'^' -b -k 1,1 |
# now join Streams data with pre-joined Game_id - Genre dataset
join -t'^' -1 1 -2 1 -o 1.2,2.2 \
<(
    zcat /twitch/raw/helix/streams/2020_07_0[1-2]/**/*.json.gz | 
    jq -r 'select(.game_id | length > 0) | [.id, .game_id, .captured_at] | join("^")' | # extract fields "id", "game_id", and "captured_at" from json strings 
    cut -d' ' -f1 | # split input by whitespace and take first column. This will remove time part of the date
    sort -t'^' -b -k 1 | # perform sort starting from first column
    uniq | # deduplicate columns. Uniq needs sort because it deduplicates adjacent lines
    cut -d'^' -f2,3 | # split input by caret ^ and keep the second column and third
    sort -t'^' -b -k 1,1 # sort by second column. This is needed to join on games dataset.
) - |
sort -t'^' -b -k 1,1 -k 2,2 | # sort by day and then by genre
uniq  -c | # count number of streams by genre by day
sed -e 's/^ *//;s/ /^/' | # this first removes the leading spaces then replaces the first space with caret ^.
sort -t'^' -b -k 2,2 -k 1n,1r | # sort by day and then by number of streams in reverse order, such that highest values are printed first
awk -F'^' -v OFS='^' '$1 >= 350' | # filter records where number of streams is higher than 350
awk -F'^' -v OFS='^' '{print $2, $3, $1}' | # re-order columns to output day, genre, number of streams
column -s'^' --output-separator ' |' --table --table-columns Day,'Genre','Number of streams'

Biggest difficulty of this script is the proper definition of column positions for join, sort, cut commands. Output shows that “First-Person Shooter” is the most popular genre (at least on first and second of July). Output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Day        |Genre                |Number of streams
2020-07-01 |First-Person Shooter |6189
2020-07-01 |Action               |3345
2020-07-01 |Shooter              |3227
2020-07-01 |Driving/Racing       |3133
2020-07-01 |MOBA                 |2423
2020-07-01 |Role-Playing         |2195
2020-07-01 |Action-Adventure     |2012
2020-07-01 |MMORPG               |1881
2020-07-01 |Simulation           |1656
2020-07-01 |Strategy             |1078
2020-07-01 |Card Game            |862
2020-07-01 |Music/Rhythm         |801
2020-07-01 |Adventure            |736
2020-07-01 |Soccer               |505
2020-07-01 |Fighting             |499
2020-07-02 |First-Person Shooter |6599
2020-07-02 |Action               |3338
2020-07-02 |Driving/Racing       |3123
2020-07-02 |Shooter              |3053
2020-07-02 |MOBA                 |2425
2020-07-02 |Role-Playing         |2077
2020-07-02 |Action-Adventure     |1983
2020-07-02 |MMORPG               |1789
2020-07-02 |Simulation           |1616
2020-07-02 |Strategy             |1077
2020-07-02 |Card Game            |835
2020-07-02 |Music/Rhythm         |825
2020-07-02 |Adventure            |710
2020-07-02 |Soccer               |692
2020-07-02 |Fighting             |487

Hidden challenge: Unix timestamps

One unexpected problem of Twitch V5 Top Games dataset is that value captured_at is stored as Unix timestamp. Example JSON input is:

{"game_name":"League of Legends","game__id":21779,"game_giantbomb_id":24024,"game_localized_name":"League of Legends","game_locale":"en-us","viewers":127440,"channels":2505,"captured_at":1594033202000}

It is much easier to operate on timestamp that is human readable than on a timestamp that looks like 1594033202000. This problem can be solved with jq or awk.

JQ

JQ has built-in functions to work with dates. We can use function strftime to format into string like ‘2020-07-06’. Before transforming Unix timestamp into date string we must divide input by 1000, otherwise resulting will not be correct.

1
2
3
zcat top_games_3173c38b-abc2-4662-9738-56f1609a8095.json.gz |
jq -r 'select(.game_giantbomb_id > 0) | .captured_at |= (. / 1000 | strftime("%Y-%m-%d")) | [.game__id, .game_giantbomb_id, .game_name, .captured_at] | @csv' |
head
AWK

AWK also has built-in time functions. One of them is strftime, which works in the same way as function with the same name in Jq.

1
2
3
4
zcat top_games_3173c38b-abc2-4662-9738-56f1609a8095.json.gz |
jq -r 'select(.game_giantbomb_id > 0) | [.game__id, .game_giantbomb_id, .game_name, .captured_at] | @csv' |
awk -F, -v OFS=',' '{ts=strftime("%Y-%m-%d", $4/1000); print $1, $2, ts}' |
head

Special tool: GNU Datamash

GNU Datamash (website) is another handy Linux command-line tool. So far our questions were answered with counting because it is an easy operation to do in command-line without extra tools. GNU Datamash is an extra tool that adds such operations as grouping, calculation of median, mean, max, min values and some more. Let’s quickly find out what are the top 10 games by a peak number of viewers in July 2020 (based on Twitch V5 Top Games dataset).

1
2
3
4
5
6
zcat /twitch/raw/v5/top_games_sdc/2020-07-*/*/*.gz |
jq -r 'select(.game_giantbomb_id > 0) | [.game_name, .viewers] | join("^")' |
datamash --sort -t'^' --group=1 max 2 |
sort -t'^' -b -k 2,2nr |
head |
column -s'^' --output-separator ' |' --table --table-columns 'Game name','Max viewers'

In this command we tell Datamash to group by first column (“game_name”) and calculate max values for the second column (“viewers”, which is number of viewers).

Output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
Game name                    |Max viewers
League of Legends            |494890
Minecraft                    |443496
Call of Duty: Modern Warfare |357199
Fortnite                     |328230
Grand Theft Auto V           |327837
Ghost of Tsushima            |312128
Grounded                     |256957
Fall Guys                    |242859
World of Warcraft            |238570
Hyper Scape                  |235309

Summary

We have used Linux command-line tools to answer some questions about Twitch data. Since the input data is in JSON format, we have focused on JSON processing and used jq, which is JSON command-line processor. Some scripts felt a little slow (no surprise here as there are around 12 million records per month for streams) and then we learned how to benchmark our shell scripts with time. Then we improved performance of shell scripts with GNU Parallel. One of the tools that we used is programming language AWK. In our examples, it was used for rather simple operations: change order of columns or filter records based on some condition. Lastly, questions about finding most popular games and genres required us to gather data from multiple datasets in order to answer these questions. Command join was used to combine multiple datasets. And even result of one join operation was used as an input to another join operation to combine more than two datasets.

All questions were essentially about counting something, because it is the easiest thing to do. But there are tools that help with more difficult questions. One of them is GNU Datamash that we briefly mentioned and even AWK and jq can be used to do some calculations (check this SO question) It is possible to answer many questions about data using only command-line tools, but solutions become more and more difficult, especially when multiple datasets are involved. A lot of time was spent on finding a combination of keys for sort that would produce a desired result. Second most time was spent on debugging input column numbers for join command.

Command-line tools are great to use when there is a constrained environment (it is not possible to install other tools) and there is no analytical system available (for example, Data Lake), but they lose readability on more complex analytical tasks.

Sample inputs

Twitch Helix Streams

{"id": "39027381472", "user_id": "189290002", "user_name": "x2Twins", "game_id": "33214", "type": "live", "title": "Viewer Games !play | Code x2Twins #ad | !socials !newvid", "viewer_count": 10446, "started_at": "2020-07-01T06:48:49Z", "language": "en", "tag_ids": ["6ea6bca4-4712-4ab9-a906-e3336a9d8039"], "captured_at": "2020-07-01 13:00:02+00:00"}

Twitch Helix Top Games

{"id": "33214", "name": "Fortnite", "captured_at": "2020-07-01 09:00:02+00:00"}

Twitch V5 Top Games

{"game_name":"Fortnite","game__id":33214,"game_giantbomb_id":37030,"game_localized_name":"Fortnite","game_locale":"en-us","viewers":200889,"channels":12441,"captured_at":1593966412000}

Giantbomb game

{"giantbomb_id":37030,"captured_at":1594166464000,"original_release_date":1500940800000,"name":"Fortnite","guid":"3030-37030","genre":"Shooter"}
cli  rill  linux