diff --git a/.gitignore b/.gitignore index d2fcdb9a4de..64ab02ef165 100644 --- a/.gitignore +++ b/.gitignore @@ -74,6 +74,9 @@ docs/_site # TODO Make the API auto generate and relocate into this api folder for webpage # docs/api +# Input dataset +scripts/ssb/data + # Test Artifacts src/test/scripts/**/*.dmlt src/test/scripts/functions/mlcontextin/ diff --git a/scripts/staging/ssb/Dockerfile b/scripts/staging/ssb/Dockerfile new file mode 100644 index 00000000000..2500d9722f6 --- /dev/null +++ b/scripts/staging/ssb/Dockerfile @@ -0,0 +1,37 @@ +#------------------------------------------------------------- +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# +#------------------------------------------------------------- + +# Follow the tutorial: https://docs.docker.com/compose/gettingstarted/#step-1-set-up + +FROM postgres:latest +# Init the data and load to the database with a sql script. +COPY other/ssb_init.sql /docker-entrypoint-initdb.d/ + +# Copy data into container +#COPY data_dir tmp + +#WORKDIR /tmp +#RUN sed -i 's/|$//' "customer.tbl" +#RUN sed -i 's/|$//' "part.tbl" +#RUN sed -i 's/|$//' "supplier.tbl" +#RUN sed -i 's/|$//' "date.tbl" +#RUN sed -i 's/|$//' "lineorder.tbl" + diff --git a/scripts/staging/ssb/ReadMe.md b/scripts/staging/ssb/ReadMe.md new file mode 100644 index 00000000000..100ce372b16 --- /dev/null +++ b/scripts/staging/ssb/ReadMe.md @@ -0,0 +1,220 @@ +# Star Schema Benchmark (SSB) for SystemDS [SystemDS-3862](https://issues.apache.org/jira/browse/SYSTEMDS-3862) + + +## Foundation +- There are [13 queries already written in SQL](https://github.com/apache/doris/tree/master/tools/ssb-tools/ssb-queries). +- There are existing DML relational algebra operations raSelect(), raJoin() and raGroupBy(). +- Our task is to implement the DML version of these queries and run them in SystemDS and PostgreSQL. +- There are existing DML query implementations ([Git request](https://github.com/apache/systemds/pull/2280) and [code](https://github.com/apache/systemds/tree/main/scripts/staging/ssb)) of the previous group which are a bit slow and contain errors. They also provided longer scripts to run experiments in SystemDS, PostgreSQL and DuckDB. +## Changes +1. **DML Queries** +- In this project, we improved and fixed errors of some DML queries. +- The major changes are + - Switching the join algorithm from `sort-merge` to `hash2`. + - Using consistently transformencode() and transformdecode() for string comparisions (Before, it was only used in [q4_3](https://github.com/apache/systemds/tree/main/scripts/staging/ssb/queries/q4_3.dml)). It leads to correct results. +2. **Test Script** +- The main purpose of this project test script is to simply run the queries. The focus is less on benchmarking the execution times. The reason is that the queries run very slow in SystemDS which cannot be compared to PostgreSQL and DuckDB. The main bottleneck are the join algorithms. +- Thus the main differences to the last group are: + - Using the simpler [ssb-dbgen](https://github.com/eyalroz/ssb-dbgen/tree/master) for generating data. + - Not a full testbench with detailed views for different execution times and database. + - Running PostgreSQL and SystemDS in Docker containers instead of using it locally. See below. +## Directory structure +``` +ssb/ +├── docker-compose.yaml # Compose file for Docker containers (here for PostgreSQL) +├── Dockerfile +├── other # Some other files (necessary) +├── README.md # This explanation +├── queries/ # DML queries (q1_1.dml ... q4_3.dml) +│ ├── q1_1.dml - q1_3.dml +│ ├── q2_1.dml - q2_3.dml +│ ├── q3_1.dml - q3_4.dml +│ └── q4_1.dml - q4_3.dml +├── shell/ +│ ├── run_script.sh # Main script +└── (sql/ # SQL versions) +## Setup +- First, install [Docker](https://docs.docker.com/get-started/get-docker/), [Docker Compose](https://docs.docker.com/compose/install/) and its necessary libraries. The script does not cover Docker installation. + + For Ubuntu, there is the following tutorials [for Docker](https://docs.docker.com/engine/install/ubuntu/#install-using-the-repository) and [Docker Compose](https://docs.docker.com/compose/install/linux/#install-using-the-repository) using apt repository. You can add [Docker Desktop](https://docs.docker.com/desktop/setup/install/linux/ubuntu/), too. + +The shell script covers the installation of the following points. We use Ubuntu and Debian. For other OS, please look closer at the documentations. +- Docker compose installation for Ubuntu/Debian (For other OS look [here](https://docs.docker.com/compose/install/)) +- Docker version of the database system [SystemDS](https://apache.github.io/systemds/site/docker) +- Docker compose version of [PostgreSQL](docker-compose.yaml) based on its [documentation]((https://hub.docker.com/_/postgres)). +- [ssb-dbgen](https://github.com/eyalroz/ssb-dbgen/tree/master) (SSB data set generator `datagen`) +- + +## Structure of the test system +![diagram](other/dia_ssb_script_structure1.jpg) +Our script will depict the following structure. +The data is generated by datagen and stored locally (on localhost). + +After that, it is copied into two database containers (SystemDS, PostgreSQL) and a local DuckDB database where the queries are executed. + +## Run the script +### Before running the script +Before running the script, create an .env file to set PostgreSQL environment variables. +``` +# in .env file +POSTGRES_USER=[YOUR_USERNAME] +POSTGRES_PASSWORD=[YOUR_USERNAME] +POSTGRES_DB=[YOUR_DB_NAME] +PORT_NUMBER=[YOUR_PORT_NUMBER] +``` + +Mark the script as executable. +``` +$ chmod +x run_script.sh +``` +### Run the script +To run the queries, we can execute the following shell script `run_script.sh` (in ssb directory). It has the three following parameter flags. +1. `-q`: (QUERY_NAME) Name of the query to be executed. + - `all`: executes all queries + - **[QUERY_NAME]** like q1_1 or q1.1: Executes the selected query like q1_1. dml. Both formats q1_1 or q1.1 are allowed. It will be automatically translated. + - Currently, the following queries are available (q1_1, q1_2, q1_3, q2_1, q2_2, q2_3, q3_1, "q3_2,q3_3, q3_4, q4_1, q4_2, q4_3) + - Default: `q2_1` +2. `-s`: (SCALE) The numerical scale factor like 0.01 or 1 etc. + - Be careful: Please do not experiment with large scale factor over 0.2 in SystemDS. Its join operation is currently very slow. + - Default: `0.1` +3. `-d`: (DB_SYSTEM) Name of the database system used. + - `all`: executes queries in all three databases. + - `systemds`: SystemDS executes DML scripts with basic output. + - `systemds_stats`: SystemDS executes DML scripts with extended output (--stats). + - `postgres`: PostgreSQL executes SQL queries. + - `duckdb`: DuckDB executes SQL queries. + - Default: `systemds` +4. `-g`: (GUI_DOCKER): Use GUI docker desktop. No arguments to pass. Set only the flag "-g". +5. `-h`: (HELP) Display the script explanation from ReadMe.md. No arguments to pass. Set only the flag "-h". +The command line could look like this: +``` +$ ./run_script.sh -q [YOUR_QUERY_NAME] -s [YOUR_SCALE] -d [YOUR_DB_SYSTEM] +``` +Examples +``` +$ ./run_script.sh -q all -s 0.1 -d all +$ ./run_script.sh -q q4_3 -s 0.1 -d systemds +$ ./run_script.sh -q all -s 1 -d duckdb +$ ./run_script.sh -q q1.1 -s 1 -d postgres -g +``` + +# Example output +Here is how the (abridged) script output could like. +The script does the following steps: +- Loading arguments and environment variables +- Installing packages (and asking permissions for it) +- Generating data with datagen (SSB data generator) +- Loading Docker images for SystemDS or PostgreSQL +- Initializing Docker database containers and duckDB database +- Loading the SQL scheme and data into databases +- Running the selected queries +``` +user@user1:~/systemds/scripts/staging/ssb$ ./shell/run_script.sh -q q2_3 -s 0.1 -d all -g +=== Test environment for SSB Data === + +g-flag is set. That means, the docker desktop GUI is used. +Arg 0 (SHELL_SCRIPT): ./shell/run_script.sh +Arg 1 (QUERY_NAME): q2_3 +Arg 2 (SCALE): 0.1 +Arg 3 (DB_SYSTEM): all +========== +Install required packages +Check whether the following packages exist: +If only SystemDS: docker 'docker compose' git gcc cmake make +For PostgreSQL: 'docker compose' +For DuckDB: duckdb +If using g-flag [GUI]: docker desktop +========== +Check for existing data directory and prepare the ssb-dbgen +Can we look for new updates of the datagen repository?. If there are, do you want to pull it? (yes/no) +yes +Your answer is 'no' +========== +Build ssb-dbgen and generate data with a given scale factor +[...] +SSB (Star Schema Benchmark) Population Generator (Version 1.0.0) +Copyright Transaction Processing Performance Council 1994 - 2000 +Generating data for part table [pid: 1]: done. +Generating data for suppliers table [pid: 1]: done. +[...] +Number of rows of created tables. +Table customer has 3000 rows. +Table part has 20000 rows. +Table supplier has 200 rows. +Table date has 255 rows. +Table lineorder has 600597 rows. +========== +Start the SystemDS docker container. +Docker Desktop is already running +========== +Execute DML queries in SystemDS + +Execute query q2_3.dml +WARNING: Using incubator modules: jdk.incubator.vector +Loading tables from directory: /scripts/data_dir +SUM(lo_revenue) | d_year | p_brand +# FRAME: nrow = 1, ncol = 3 +# C1 C2 C3 +# INT32 INT32 STRING +72081993 1992 MFGR#2239 + + +Q2.3 finished. + +SystemDS Statistics: +Total execution time: 9.924 sec. + +========== +Start the PostgreSQL Docker containter and load data. +Docker Desktop is already running + +Successfully copied 282kB to ssb-postgres-1:/tmp +Load customer table with number_of_rows: +TRUNCATE TABLE +COPY 3000 +Successfully copied 1.7MB to ssb-postgres-1:/tmp +Load part table with number_of_rows: +TRUNCATE TABLE +COPY 20000 +[...] +========== +Execute SQL queries in PostgresSQL +Execute query q2.3.sql +docker exec -i ssb-postgres-1 psql -U userA -d db1 < sql/q2.3.sql + sum | d_year | p_brand +----------+--------+----------- + 72081993 | 1992 | MFGR#2239 +(1 row) + +========== +Start a DuckDB persistent database and load data. +Load customer table +┌────────────────┐ +│ number_of_rows │ +│ int64 │ +├────────────────┤ +│ 3000 │ +└────────────────┘ +Load part table +┌────────────────┐ +│ number_of_rows │ +│ int64 │ +├────────────────┤ +│ 20000 │ +└────────────────┘ +[...] +========== +Execute SQL queries in DuckDB +Execute query q2.3.sql +┌─────────────────┬────────┬───────────┐ +│ sum(lo_revenue) │ d_year │ p_brand │ +│ int128 │ int32 │ varchar │ +├─────────────────┼────────┼───────────┤ +│ 72081993 │ 1992 │ MFGR#2239 │ +└─────────────────┴────────┴───────────┘ +========== +Test bench finished successfully. +``` + +## Troubleshooting +- If you encounter docker problems like "Permission denied" or data not loaded successfully into the tables, try to restart docker or remove the container. You can also switch between the standard Docker Engine (with GUI) or Docker Desktop (with GUI) with flag `-g`. \ No newline at end of file diff --git a/scripts/staging/ssb/docker-compose.yaml b/scripts/staging/ssb/docker-compose.yaml new file mode 100644 index 00000000000..70aec4dcab9 --- /dev/null +++ b/scripts/staging/ssb/docker-compose.yaml @@ -0,0 +1,46 @@ +#------------------------------------------------------------- +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# +#------------------------------------------------------------- + +## The docker compose file to create a postgres instance. +#docker compose up --build +## Or (if does not work) +#docker compose -f "$[THE_ACTUAL_PATH]/docker-compose.yaml" up -d --build + +## Create .env file and modify before each docker compose up. +## in .env file +#POSTGRES_USER=[YOUR_USERNAME] +#POSTGRES_PASSWORD=[YOUR_USERNAME] +#POSTGRES_DB=[YOUR_DB_NAME] +#PORT_NUMBER=[YOUR_PORT_NUMBER] + +#This docker-compose file is linked to the Dockerfile. + +services: + postgres: + build: + context: . + restart: always + environment: + POSTGRES_USER: ${POSTGRES_USER} + POSTGRES_PASSWORD: $(POSTGRES_PASSWORD) + POSTGRES_DB: ${POSTGRES_DB} + ports: + - "${PORT_NUMBER}:5432" \ No newline at end of file diff --git a/scripts/staging/ssb/other/dia_ssb_script_structure1.jpg b/scripts/staging/ssb/other/dia_ssb_script_structure1.jpg new file mode 100644 index 00000000000..a2e4a739195 Binary files /dev/null and b/scripts/staging/ssb/other/dia_ssb_script_structure1.jpg differ diff --git a/scripts/staging/ssb/other/script_flags_help.txt b/scripts/staging/ssb/other/script_flags_help.txt new file mode 100644 index 00000000000..3e82ce27a73 --- /dev/null +++ b/scripts/staging/ssb/other/script_flags_help.txt @@ -0,0 +1,32 @@ +From ReadMe.md: +To run the queries, we can execute the following shell script `run_script.sh` (in ssb directory). It has the three following parameter flags. +1. `-q`: (QUERY_NAME) Name of the query to be executed. + - `all`: executes all queries + - **[QUERY_NAME]** like q1_1 or q1.1: Executes the selected query like q1_1. dml. Both formats q1_1 or q1.1 are allowed. It will be automatically translated. + - Currently, the following queries are available (q1_1, q1_2, q1_3, q2_1, q2_2, q2_3, q3_1, "q3_2,q3_3, q3_4, q4_1, q4_2, q4_3) + - Default: `q2_1` +2. `-s`: (SCALE) The numerical scale factor like 0.01 or 1 etc. + - Be careful: Please do not experiment with large scale factor over 0.2 in SystemDS. Its join operation is currently very slow. + - Default: `0.1` +3. `-d`: (DB_SYSTEM) Name of the database system used. + - `all`: executes queries in all three databases. + - `systemds`: SystemDS executes DML scripts with basic output. + - `systemds_stats`: SystemDS executes DML scripts with extended output (--stats). + - `postgres`: PostgreSQL executes SQL queries. + - `duckdb`: DuckDB executes SQL queries. + - Default: `systemds` +4. `-g`: (GUI_DOCKER): Use GUI docker desktop. No arguments to pass. Set only the flag "-g". +5. `-h`: (HELP) Display the script explanation from ReadMe.md. No arguments to pass. Set only the flag "-h". +The command line could look like this: +``` +$ ./run_script.sh -q [YOUR_QUERY_NAME] -s [YOUR_SCALE] -d [YOUR_DB_SYSTEM] +``` +Examples +``` +$ ./run_script.sh -q all -s 0.1 -d all +$ ./run_script.sh -q q4_3 -s 0.1 -d systemds +$ ./run_script.sh -q all -s 1 -d duckdb +$ ./run_script.sh -q q1.1 -s 1 -d postgres -g +``` + +For more details give a closer look to ReadMe.md. \ No newline at end of file diff --git a/scripts/staging/ssb/other/ssb_init.sql b/scripts/staging/ssb/other/ssb_init.sql new file mode 100644 index 00000000000..4e28a9cf712 --- /dev/null +++ b/scripts/staging/ssb/other/ssb_init.sql @@ -0,0 +1,117 @@ +-- Use https://github.com/eyalroz/ssb-dbgen/blob/master/doc/ssb.ddl +-- A bit modified. +-- Drop tables if they exist +DROP TABLE IF EXISTS lineorder CASCADE; +DROP TABLE IF EXISTS customer CASCADE; +DROP TABLE IF EXISTS part CASCADE; +DROP TABLE IF EXISTS supplier CASCADE; +DROP TABLE IF EXISTS date CASCADE; + +-- Date dimension +CREATE TABLE date ( + d_datekey INTEGER NOT NULL, + d_date VARCHAR(19) NOT NULL, + d_dayofweek VARCHAR(10) NOT NULL, + d_month VARCHAR(10) NOT NULL, + d_year INTEGER NOT NULL, + d_yearmonthnum INTEGER NOT NULL, + d_yearmonth VARCHAR(8) NOT NULL, + d_daynuminweek INTEGER NOT NULL, + d_daynuminmonth INTEGER NOT NULL, + d_daynuminyear INTEGER NOT NULL, + d_monthnuminyear INTEGER NOT NULL, + d_weeknuminyear INTEGER NOT NULL, + d_sellingseason VARCHAR(13) NOT NULL, + d_lastdayinweekfl VARCHAR(1) NOT NULL, + d_lastdayinmonthfl VARCHAR(1) NOT NULL, + d_holidayfl VARCHAR(1) NOT NULL, + d_weekdayfl VARCHAR(1) NOT NULL +); + +-- Customer dimension +CREATE TABLE customer +( + c_custkey INTEGER NOT NULL, + c_name VARCHAR(25) NOT NULL, + c_address VARCHAR(25) NOT NULL, + c_city VARCHAR(10) NOT NULL, + c_nation VARCHAR(15) NOT NULL, + c_region VARCHAR(12) NOT NULL, + c_phone VARCHAR(15) NOT NULL, + c_mktsegment VARCHAR(10) NOT NULL +); + +-- Part dimension +CREATE TABLE part ( + p_partkey INTEGER NOT NULL, + p_name VARCHAR(22) NOT NULL, + p_mfgr VARCHAR(6), + p_category VARCHAR(7) NOT NULL, + p_brand VARCHAR(9) NOT NULL, + p_color VARCHAR(11) NOT NULL, + p_type VARCHAR(25) NOT NULL, + p_size INTEGER NOT NULL, + p_container VARCHAR(10) NOT NULL +); + +-- Supplier dimension +CREATE TABLE supplier ( + s_suppkey INTEGER NOT NULL, + s_name VARCHAR(25) NOT NULL, + s_address VARCHAR(25) NOT NULL, + s_city VARCHAR(10) NOT NULL, + s_nation VARCHAR(15) NOT NULL, + s_region VARCHAR(12) NOT NULL, + s_phone VARCHAR(15) NOT NULL +); + +-- LineOrder fact table +CREATE TABLE lineorder ( + lo_orderkey INTEGER NOT NULL, + lo_linenumber INTEGER NOT NULL, + lo_custkey INTEGER NOT NULL, + lo_partkey INTEGER NOT NULL, + lo_suppkey INTEGER NOT NULL, + lo_orderdate INTEGER NOT NULL, + lo_orderpriority VARCHAR(15) NOT NULL, + lo_shippriority VARCHAR(1) NOT NULL, + lo_quantity INTEGER NOT NULL, + lo_extendedprice INTEGER NOT NULL, + lo_ordertotalprice INTEGER NOT NULL, + lo_discount INTEGER NOT NULL, + lo_revenue INTEGER NOT NULL, + lo_supplycost INTEGER NOT NULL, + lo_tax INTEGER NOT NULL, + lo_commitdate INTEGER NOT NULL, + lo_shipmode VARCHAR(10) NOT NULL +); + +ALTER TABLE date +ADD PRIMARY KEY(d_datekey); + +ALTER TABLE supplier +ADD PRIMARY KEY(s_suppkey); + +ALTER TABLE customer +ADD PRIMARY KEY (c_custkey); + +ALTER TABLE part +ADD PRIMARY KEY (p_partkey); + +--ALTER TABLE lineorder +--ADD PRIMARY KEY (lo_orderkey); + +--ALTER TABLE lineorder +--ADD FOREIGN KEY (lo_orderdate) REFERENCES date (d_datekey); + +--ALTER TABLE lineorder +--ADD FOREIGN KEY (lo_commitdate) REFERENCES date (d_datekey); + +--ALTER TABLE lineorder +--ADD FOREIGN KEY (lo_suppkey) REFERENCES supplier (s_suppkey); + +--ALTER TABLE lineorder +--ADD FOREIGN KEY (lo_custkey) REFERENCES customer (c_custkey); + +--ALTER TABLE lineorder +--ADD FOREIGN KEY (lo_partkey) REFERENCES part (p_partkey); diff --git a/scripts/staging/ssb/queries/q1_1.dml b/scripts/staging/ssb/queries/q1_1.dml new file mode 100644 index 00000000000..992e1691453 --- /dev/null +++ b/scripts/staging/ssb/queries/q1_1.dml @@ -0,0 +1,140 @@ +#------------------------------------------------------------- +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# +#------------------------------------------------------------- + + + +/* DML-script implementing the ssb query Q1.1 in SystemDS. +**input_dir="/scripts/ssb/data" + +* Run with docker: +docker run -it --rm -v $PWD:/scripts/ apache/systemds:nightly -f /scripts/queries/q1_1.dml -nvargs input_dir="/scripts/data/" + +SELECT SUM(lo_extendedprice * lo_discount) AS REVENUE +FROM lineorder, date +WHERE + lo_orderdate = d_datekey + AND d_year = 1993 + AND lo_discount BETWEEN 1 AND 3 + AND lo_quantity < 25; + +*Please run the original SQL query (eg. in Postgres) +to verify the correctness of DML version. +-> First tests: Works on the dataset with scale factor 0.1. + +*Based on the older implementation. +https://github.com/ghafek/systemds/blob/feature/ssb-benchmark/scripts/ssb/queries/q1_1.dml +In comparison to older version the join method was changed +from sort-merge to hash2 to improve the performance. + +Input parameters: +input_dir - Path to input directory containing the table files (e.g., ./data) +*/ + +# Call ra-modules with ra-functions. +source("./scripts/builtin/raSelection.dml") as raSel +source("./scripts/builtin/raJoin.dml") as raJoin + +# Set input parameters. +input_dir = ifdef($input_dir, "./data"); +print("Loading tables from directory: " + input_dir); + +# Read and load input CSV files from date and lineorder. +date_csv = read(input_dir + "/date.tbl", data_type="frame", format="csv", header=FALSE, sep="|"); +lineorder_csv = read(input_dir + "/lineorder.tbl", data_type="frame", format="csv", header=FALSE, sep="|"); + +# General variables. +hasRows = 1; # If hasRows = 0, the result table is empty. + +# -- Data preparation -- + +# Extract only the necessary columns from date and lineorder table. +# Extracted: COL-1 | COL-5 +# => D_DATEKEY | D_YEAR +date_csv_min = cbind(date_csv[, 1], date_csv[, 5]); +date_matrix_min = as.matrix(date_csv_min); + +# Extracted: COL-6 | COL-9 | COL-10 | COL-12 +# => LO_ORDERDATE | LO_QUANTITY | LO_EXTPRICE | LO_DISCOUNT +lineorder_csv_min = cbind(lineorder_csv[, 6], lineorder_csv[, 9], lineorder_csv[, 10], lineorder_csv[, 12]); +lineorder_matrix_min = as.matrix(lineorder_csv_min); + +# -- Filter the data with RA-SELECTION function. + +# D_YEAR = 1993 +d_year_filt = raSel::m_raSelection(date_matrix_min, col=2, op="==", val=1993); +if( as.scalar(d_year_filt[1,1]) == 0){ + hasRows = 0; +} +# LO_QUANTITY < 25 +if(hasRows){ + lo_filt = raSel::m_raSelection(lineorder_matrix_min, col=2, op="<", val=25); + if( as.scalar(lo_filt[1,1]) == 0){ + hasRows = 0; + } +} +# LO_DISCOUNT BETWEEN 1 AND 3 +if(hasRows){ + lo_filt = raSel::m_raSelection(lo_filt, col=4, op=">=", val=1); + lo_filt = raSel::m_raSelection(lo_filt, col=4, op="<=", val=3); + if( as.scalar(lo_filt[1,1]) == 0){ + hasRows = 0; + } + else{ + # Minimize LO TABLE + # => LO_ORDERDATE | LO_EXTPRICE | LO_DISCOUNT + lo_filt = cbind(lo_filt[, 1], lo_filt[, 3], lo_filt[, 4]); + } +} + +# -- Join -- +# Join LINEORDER and DATE tables with RA-JOIN function +joined_matrix = matrix(0, rows=0, cols=1); +# WHERE LO_ORDERDATE = D_DATEKEY +# => (D-KEY | D-YEAR) | (LO_ORDERDATE | LO_EXTPRICE | LO_DISCOUNT) +if(hasRows){ + joined_matrix = raJoin::m_raJoin(A=d_year_filt, colA=1, B=lo_filt, colB=1, method="hash2"); + if(nrow(joined_matrix[,1]) == 0){ + hasRows = 0; + } +} +# Print the first row. +#print(toString(joined_matrix[1,])) + +# -- Aggregation (SUM)-- + +if(hasRows){ + # SUM(lo_extendedprice * lo_discount) AS REVENUE + # Use the joined_matrix with LO_EXTPRICE (COL-4), LO_DISCOUNT (COL-5) + lo_extprice = joined_matrix[, 4]; + lo_disc = joined_matrix[, 5]; + revenue = sum(lo_extprice * lo_disc); + + print("REVENUE") + print(as.integer(revenue)); + + print("\nQ1.1 finished.\n"); +} +else{ + print("REVENUE") + print("The result table has 0 rows.") + print("\nQ1.1 finished.\n"); +} + diff --git a/scripts/staging/ssb/queries/q1_2.dml b/scripts/staging/ssb/queries/q1_2.dml new file mode 100644 index 00000000000..599ee849b29 --- /dev/null +++ b/scripts/staging/ssb/queries/q1_2.dml @@ -0,0 +1,149 @@ +#------------------------------------------------------------- +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# +#------------------------------------------------------------- + +/* DML-script implementing the ssb query Q1.2 in SystemDS. +**input_dir="/scripts/ssb/data" + +* Run with docker: +docker run -it --rm -v $PWD:/scripts/ apache/systemds:nightly -f /scripts/queries/q1_1.dml -nvargs input_dir="/scripts/data/" + +# Open in scripts/ssb/ +../../bin/systemds queries/q1_1.dml -nvargs input_dir="data/" + +SELECT SUM(lo_extendedprice * lo_discount) AS REVENUE +FROM lineorder, date --dates +WHERE + lo_orderdate = d_datekey + AND d_yearmonth = 'Jan1994' + AND lo_discount BETWEEN 4 AND 6 + AND lo_quantity BETWEEN 26 AND 35; + +*Please run the original SQL query (eg. in Postgres) +to verify the correctness of DML version. + +*Based on the older implementation. +https://github.com/ghafek/systemds/blob/feature/ssb-benchmark/scripts/ssb/queries/q1_2.dml +*Especially: +https://github.com/ghafek/systemds/blob/feature/ssb-benchmark/scripts/ssb/queries/q4_3.dml +In comparison to older version the join method was changed +from sort-merge to hash2 to improve the performance. +A binary column of d_filt (date_filtered) was removed. + +Input parameters: +input_dir - Path to input directory containing the table files (e.g., ./data) +*/ + +# Call ra-modules with ra-functions. +source("./scripts/builtin/raSelection.dml") as raSel +source("./scripts/builtin/raJoin.dml") as raJoin + +# Set input parameters. +input_dir = ifdef($input_dir, "./data"); +print("Loading tables from directory: " + input_dir); +# Read and load input CSV files from date and lineorder. +date_csv = read(input_dir + "/date.tbl", data_type="frame", format="csv", header=FALSE, sep="|"); +lineorder_csv = read(input_dir + "/lineorder.tbl", data_type="frame", format="csv", header=FALSE, sep="|"); + +# General variables. +hasRows = 1; # If hasRows = 0, the result table is empty. + +# -- Data preparation -- + +# Extract only the necessary columns from date and lineorder table. + +# Extracted: COL-6 | COL-9 | COL-10 | COL-12 +# => LO_ORDERDATE | LO_QUANTITY | LO_EXTPRICE | LO_DISCOUNT +lineorder_csv_min = cbind(lineorder_csv[, 6], lineorder_csv[, 9], lineorder_csv[, 10], lineorder_csv[, 12]); +lineorder_matrix_min = as.matrix(lineorder_csv_min); + +# -- Filter the data with RA-SELECTION function. + +# LO_DISCOUNT BETWEEN 4 AND 6 +lo_filt = raSel::m_raSelection(lineorder_matrix_min, col=4, op=">=", val=4); +lo_filt = raSel::m_raSelection(lo_filt, col=4, op="<=", val=6); +if( as.scalar(lo_filt[1,1]) == 0){ + hasRows = 0; +} +# LO_QUANTITY BETWEEN 26 AND 35 +if(hasRows){ + lo_filt = raSel::m_raSelection(lo_filt, col=2, op=">=", val=26); + lo_filt = raSel::m_raSelection(lo_filt, col=2, op="<=", val=35); + if( as.scalar(lo_filt[1,1]) == 0){ + hasRows = 0; + } + else{ + # Minimize LO TABLE + # => LO_ORDERDATE | LO_EXTPRICE | LO_DISCOUNT + lo_filt = cbind(lo_filt[, 1], lo_filt[, 3], lo_filt[, 4]); + } +} + +# -- Filter table over string values. +# Extracted: COL-1 | COL-7 +# D_DATEKEY | D_YEARMONTH +d_filt = matrix(0, rows=0, cols=1); +if(hasRows){ + # Build filtered DATE table (D_YEARMONTH = 'Jan1994') + for (i in 1:nrow(date_csv)) { + if (as.scalar(date_csv[i,7]) == "Jan1994") { + key_val = as.double(as.scalar(date_csv[i,1])); + d_filt = rbind(d_filt, matrix(key_val, rows=1, cols=1)); + } + } + if (nrow(d_filt) == 0) { + hasRows = 0; + } +} + + +# -- Join -- +# Join LINEORDER and DATE tables with RA-JOIN function +joined_matrix = matrix(0, rows=0, cols=1); +# WHERE LO_ORDERDATE = D_DATEKEY +# => (D_DATEKEY) | (LO_ORDERDATE | LO_EXTPRICE | LO_DISCOUNT) +if(hasRows){ + joined_matrix = raJoin::m_raJoin(A=d_filt, colA=1, B=lo_filt, colB=1, method="hash2"); + if(nrow(joined_matrix[,1]) == 0){ + hasRows = 0; + } +} + +# Print the first row. +#print(toString(joined_matrix[1,])) + +# -- Aggregation (SUM)-- +if(hasRows){ + # SUM(lo_extendedprice * lo_discount) AS REVENUE + # Use the joined_matrix with LO_EXTPRICE (COL-4), LO_DISCOUNT (COL-5) + lo_extprice = joined_matrix[, 3]; + lo_disc = joined_matrix[, 4]; + revenue = sum(lo_extprice * lo_disc); + + print("REVENUE") + print(as.integer(revenue)); + + print("\nQ1.2 finished.\n"); +} +else{ + print("REVENUE") + print("The result table has 0 rows.") + print("\nQ1.2 finished.\n"); +} \ No newline at end of file diff --git a/scripts/staging/ssb/queries/q1_3.dml b/scripts/staging/ssb/queries/q1_3.dml new file mode 100644 index 00000000000..4a9484da11e --- /dev/null +++ b/scripts/staging/ssb/queries/q1_3.dml @@ -0,0 +1,142 @@ +#------------------------------------------------------------- +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# +#------------------------------------------------------------- + +/* DML-script implementing the ssb query Q1.3 in SystemDS. +**input_dir="/scripts/ssb/data" + +* Run with docker: +docker run -it --rm -v $PWD:/scripts/ apache/systemds:nightly -f /scripts/queries/q1_1.dml -nvargs input_dir="/scripts/data/" + +SELECT + SUM(lo_extendedprice * lo_discount) AS REVENUE +FROM lineorder, date +WHERE + lo_orderdate = d_datekey + AND d_weeknuminyear = 6 + AND d_year = 1994 + AND lo_discount BETWEEN 5 AND 7 + AND lo_quantity BETWEEN 26 AND 35; + +*Please run the original SQL query (eg. in Postgres) +to verify the correctness of DML version. +-> First tests: Works on the dataset with scale factor 0.1. + +*Based on the older implementation. +https://github.com/ghafek/systemds/blob/feature/ssb-benchmark/scripts/ssb/queries/q1_1.dml +In comparison to older version the join method was changed +from sort-merge to hash2 to improve the performance. + +Input parameters: +input_dir - Path to input directory containing the table files (e.g., ./data) +*/ + +# Call ra-modules with ra-functions. +source("./scripts/builtin/raSelection.dml") as raSel +source("./scripts/builtin/raJoin.dml") as raJoin + +# Set input parameters. +input_dir = ifdef($input_dir, "./data"); +print("Loading tables from directory: " + input_dir); + +# Read and load input CSV files from date and lineorder. +date_csv = read(input_dir + "/date.tbl", data_type="frame", format="csv", header=FALSE, sep="|"); +lineorder_csv = read(input_dir + "/lineorder.tbl", data_type="frame", format="csv", header=FALSE, sep="|"); + +# General variables. +hasRows = 1; # If hasRows = 0, the result table is empty. + +# -- Data preparation -- + +# Extract only the necessary columns from date and lineorder table. +# Extracted: COL-1 | COL-5 | COL-12 +# => D_DATEKEY | D_YEAR | D_WEEKNUMINYEAR +date_csv_min = cbind(date_csv[, 1], date_csv[, 5], date_csv[, 12]); +date_matrix_min = as.matrix(date_csv_min); + +# Extracted: COL-6 | COL-9 | COL-10 | COL-12 +# => LO_ORDERDATE | LO_QUANTITY | LO_EXTPRICE | LO_DISCOUNT +lineorder_csv_min = cbind(lineorder_csv[, 6], lineorder_csv[, 9], lineorder_csv[, 10], lineorder_csv[, 12]); +lineorder_matrix_min = as.matrix(lineorder_csv_min); + +# -- Filter the data with RA-SELECTION function. + +# WHERE D_YEAR = 1994 +d_filt = raSel::m_raSelection(date_matrix_min, col=2, op="==", val=1994); +# WHERE D_WEEKNUMINYEAR = 6 +d_filt = raSel::m_raSelection(d_filt, col=3, op="==", val=6); +if( as.scalar(d_filt[1,1]) == 0){ + hasRows = 0; +} +# WHERE LO_DISCOUNT BETWEEN 5 AND 7 +if(hasRows){ + lo_filt = raSel::m_raSelection(lineorder_matrix_min, col=4, op=">=", val=5); + lo_filt = raSel::m_raSelection(lo_filt, col=4, op="<=", val=7); + if( as.scalar(lo_filt[1,1]) == 0){ + hasRows = 0; + } +} +# WHERE LO_QUANTITY BETWEEN 26 AND 35 +if(hasRows){ + lo_filt = raSel::m_raSelection(lo_filt, col=2, op=">=", val=26); + lo_filt = raSel::m_raSelection(lo_filt, col=2, op="<=", val=35); + if( as.scalar(lo_filt[1,1]) == 0){ + hasRows = 0; + } + else{ + # Minimize LO TABLE + # => LO_ORDERDATE | LO_EXTPRICE | LO_DISCOUNT + lo_filt = cbind(lo_filt[, 1], lo_filt[, 3], lo_filt[, 4]); + } +} +#print(toString(lo_filt[1,])) + +# -- Join -- +# Join LINEORDER and DATE tables with RA-JOIN function +joined_matrix = matrix(0, rows=0, cols=1); +# WHERE LO_ORDERDATE = D_DATEKEY +# Print the first row. +# => (D_DATEKEY | D_YEAR | D_WEEKNUMINYEAR) | (LO_ORDERDATE | LO_EXTPRICE | LO_DISCOUNT) +if(hasRows){ + joined_matrix = raJoin::m_raJoin(A=d_filt, colA=1, B=lo_filt, colB=1, method="hash2"); + if(nrow(joined_matrix[,1]) == 0){ + hasRows = 0; + } +} +#print(toString(joined_matrix[1,])) + +# -- Aggregation (SUM)-- +if(hasRows){ + # SUM(lo_extendedprice * lo_discount) AS REVENUE + # Use the joined_matrix with LO_EXTPRICE (COL-5), LO_DISCOUNT (COL-6) + lo_extprice = joined_matrix[, 5]; + lo_disc = joined_matrix[, 6]; + revenue = sum(lo_extprice * lo_disc); + + print("REVENUE") + print(as.integer(revenue)); + + print("\nQ1.3 finished.\n"); +} +else{ + print("REVENUE") + print("The result table has 0 rows.") + print("\nQ1.3 finished.\n"); +} \ No newline at end of file diff --git a/scripts/staging/ssb/queries/q2_1.dml b/scripts/staging/ssb/queries/q2_1.dml new file mode 100644 index 00000000000..24e70a7d01a --- /dev/null +++ b/scripts/staging/ssb/queries/q2_1.dml @@ -0,0 +1,220 @@ +#------------------------------------------------------------- +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# +#------------------------------------------------------------- + + +/* DML-script implementing the ssb query Q2.1 in SystemDS. +**input_dir="/scripts/ssb/data" + +* Run with docker: +docker run -it --rm -v $PWD:/scripts/ apache/systemds:nightly -f /scripts/queries/q2_1.dml -nvargs input_dir="/scripts/data/" + +SELECT SUM(lo_revenue), d_year, p_brand +FROM lineorder, date, part, supplier +WHERE + lo_orderdate = d_datekey + AND lo_partkey = p_partkey + AND lo_suppkey = s_suppkey + AND p_category = 'MFGR#12' + AND s_region = 'AMERICA' + GROUP BY d_year, p_brand + ORDER BY p_brand; + +*Please run the original SQL query (eg. in Postgres) +to verify the correctness of DML version. +-> First tests: Works on the dataset with scale factor 0.1. +-> Sorting does not work. + +*Based on older implementations. +https://github.com/ghafek/systemds/blob/feature/ssb-benchmark/scripts/ssb/queries/q1_1.dml +*Especially: +https://github.com/ghafek/systemds/blob/feature/ssb-benchmark/scripts/ssb/queries/q4_3.dml +In comparison to older version the join method was changed +from sort-merge to hash2 to improve the performance. + +Input parameters: +input_dir - Path to input directory containing the table files (e.g., ./data) +*/ + +# Call ra-modules with ra-functions. +source("./scripts/builtin/raSelection.dml") as raSel +source("./scripts/builtin/raJoin.dml") as raJoin +source("./scripts/builtin/raGroupby.dml") as raGrp + +# Set input parameters. +input_dir = ifdef($input_dir, "./data"); +print("Loading tables from directory: " + input_dir); + +# Read and load input CSV files from lineorder, date, part, supplier. +lineorder_csv = read(input_dir + "/lineorder.tbl", data_type="frame", format="csv", header=FALSE, sep="|"); +date_csv = read(input_dir + "/date.tbl", data_type="frame", format="csv", header=FALSE, sep="|"); +part_csv = read(input_dir + "/part.tbl", data_type="frame", format="csv", header=FALSE, sep="|"); +supp_csv = read(input_dir + "/supplier.tbl", data_type="frame", format="csv", header=FALSE, sep="|"); + +# General variables. +general_spec = "{ \"ids\": false, \"recode\": [\"C1\"] }"; +hasRows = 1; # If hasRows = 0, the result table is empty. + +# -- Data preparation -- + +# Extract only the necessary columns from tables. +# Extracted: COL-4 | COL-5 | COL-6 | COL_COL-13 +# => LO_PARTKEY | LO_SUPPKEY | LO_ORDERDATE | LO_REVENUE +lineorder_csv_min = cbind(lineorder_csv[, 4], lineorder_csv[, 5], lineorder_csv[, 6], lineorder_csv[, 13]); +lineorder_matrix_min = as.matrix(lineorder_csv_min); + +# Extracted: COL-1 | COL-5 +# => D_DATEKEY | D_YEAR +date_csv_min = cbind(date_csv[, 1], date_csv[, 5]); +date_matrix_min = as.matrix(date_csv_min); + +# -- Filter tables over string values. + +# Prepare PART table on-the-fly encodings +# Extracted: COL-1 | COL-4 | COL-5 +# P_PARTKEY | P_CATEGORY | P_BRAND +# (only need P_BRAND encoding, filter by P_CATEGORY string) +[part_brand_enc_f, part_brand_meta] = transformencode(target=part_csv[,5], spec=general_spec); +#print(toString(part_brand_enc_f)); + +part_filt_keys = matrix(0, rows=0, cols=1); +part_filt_brand = matrix(0, rows=0, cols=1); +part_filt = matrix(0, rows=0, cols=1); + +# Build filtered PART table (P_CATEGORY = 'MFGR#12'), keeping key and encoded brand +for (i in 1:nrow(part_csv)) { + if (as.scalar(part_csv[i,4]) == "MFGR#12") { + key_val = as.double(as.scalar(part_csv[i,1])); + brand_code = as.double(as.scalar(part_brand_enc_f[i,1])); + part_filt_keys = rbind(part_filt_keys, matrix(key_val, rows=1, cols=1)); + part_filt_brand = rbind(part_filt_brand, matrix(brand_code, rows=1, cols=1)); + } +} +if (nrow(part_filt_keys) == 0) { + hasRows = 0; +} +else{ + part_filt = cbind(part_filt_keys, part_filt_brand); +} + +# Extracted: COL-1 | COL-6 +# S_SUPPKEY | S_REGION +supp_filt = matrix(0, rows=0, cols=1); + +if(hasRows){ + # Build filtered SUPPLIER table (S_REGION = 'AMERICA') + for (i in 1:nrow(supp_csv)) { + if (as.scalar(supp_csv[i,6]) == "AMERICA") { + key_val = as.double(as.scalar(supp_csv[i,1])); + supp_filt = rbind(supp_filt, matrix(key_val, rows=1, cols=1)); + } + } + if (nrow(supp_filt) == 0) { + hasRows = 0; + } +} +#print(toString(supp_filt[1,])) + +# -- JOIN TABLES WITH RA-JOIN FUNCTION -- + +# Join LINEORDER table with PART, SUPPLIER, DATE tables (star schema) +# Join order does matter! +lo_part = matrix(0, rows=0, cols=1); +lo_part_supp = matrix(0, rows=0, cols=1); +joined_matrix = matrix(0, rows=0, cols=1); +# LINEORDER table with DATE, PART, SUPPLIER is much slower! +# WHERE LO_PARTKEY = P_PARTKEY +if(hasRows){ + lo_part = raJoin::m_raJoin(A=part_filt, colA=1, B=lineorder_matrix_min, colB=1, method="hash2"); + if(nrow(lo_part[,1]) == 0){ + hasRows = 0; + } +} +# WHERE LO_SUPPKEY = S_SUPPKEY +if(hasRows){ + lo_part_supp = raJoin::m_raJoin(A=supp_filt, colA=1, B=lo_part, colB=4, method="hash2"); + if(nrow(lo_part_supp[,1]) == 0){ + hasRows = 0; + } +} +# WHERE LO_ORDERDATE = D_DATEKEY +# (D_DATEKEY | D_YEAR) | (S_SUPPKEY | P_PARTKEY | P_BRAND | LO_PARTKEY | LO_SUPPKEY | LO_ORDERDATE | LO_REVENUE) +# Example: +# 19920325.000 1992.000 17.000 608.000 381.000 608.000 17.000 19920325.000 5702508.000 +if(hasRows){ + joined_matrix = raJoin::m_raJoin(A=date_matrix_min, colA=1, B=lo_part_supp, colB=6, method="hash2"); + if(nrow(joined_matrix[,1]) == 0){ + hasRows = 0; + } +} +#print(toString(joined_matrix[1,])) + +# -- Group-By and Aggregation (SUM)-- + +if(hasRows){ + # Group-By + d_year = joined_matrix[,2] + p_brand = joined_matrix[,5] + lo_revenue = joined_matrix[,9] + + # CALCULATING COMBINATION KEY D_YEAR, P_BRAND + + max_p_brand = max(p_brand); + max_d_year = max(d_year); + + p_brand_scale_f = ceil(max_p_brand) + 1; + d_year_scale_f = ceil(max_d_year) + 1; + + combined_key = d_year * p_brand_scale_f + p_brand; + + group_input = cbind(lo_revenue, combined_key) + + agg_result = raGrp::m_raGroupby(X=group_input, col=2, method="nested-loop"); + + # Aggregation (SUM) + key = agg_result[, 1]; + revenue = rowSums(agg_result[, 2:ncol(agg_result)]); + + # EXTRACTING D_YEAR, P_BRAND + d_year = round(floor(key / (p_brand_scale_f))); + p_brand = round(key %% p_brand_scale_f); + result = cbind(revenue, d_year, p_brand, key); + + # -- Sorting -- -- Sorting int columns works, but string does not. + # ORDER BY P_BRAND ASC + result_ordered = order(target=result, by=3, decreasing=FALSE, index.return=FALSE); + + p_brand_dec = transformdecode(target=result_ordered[,3], spec=general_spec, meta=part_brand_meta); + res = cbind(as.frame(result_ordered[,1]), as.frame(result_ordered[,2]), p_brand_dec) ; + + # Print result + print("SUM(lo_revenue) | d_year | p_brand") + print(res) + + print("\nQ2.1 finished.\n"); +} +else{ + # If the result table has 0 rows, skip group-by and aggregation. + # Print result + print("SUM(lo_revenue) | d_year | p_brand") + print("The result table has 0 rows.") + + print("\nQ2.1 finished.\n"); +} \ No newline at end of file diff --git a/scripts/staging/ssb/queries/q2_2.dml b/scripts/staging/ssb/queries/q2_2.dml new file mode 100644 index 00000000000..8636ea67421 --- /dev/null +++ b/scripts/staging/ssb/queries/q2_2.dml @@ -0,0 +1,224 @@ +#------------------------------------------------------------- +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# +#------------------------------------------------------------- + + +/* DML-script implementing the ssb query Q2.2 in SystemDS. +**input_dir="/scripts/ssb/data" + +* Run with docker: +docker run -it --rm -v $PWD:/scripts/ apache/systemds:nightly -f /scripts/queries/q2_2.dml -nvargs input_dir="/scripts/data/" + +SELECT SUM(lo_revenue), d_year, p_brand +FROM lineorder, date, part, supplier --dates +WHERE + lo_orderdate = d_datekey + AND lo_partkey = p_partkey + AND lo_suppkey = s_suppkey + AND p_brand BETWEEN 'MFGR#2221' AND 'MFGR#2228' + AND s_region = 'ASIA' +GROUP BY d_year, p_brand +ORDER BY d_year, p_brand; + +*Please run the original SQL query (eg. in Postgres) +to verify the correctness of DML version. +-> First tests: Works on the dataset with scale factor 0.1. +-> Sorting does not work. + +*Based on older implementations. +https://github.com/ghafek/systemds/blob/feature/ssb-benchmark/scripts/ssb/queries/q1_1.dml +*Especially: +https://github.com/ghafek/systemds/blob/feature/ssb-benchmark/scripts/ssb/queries/q4_3.dml +In comparison to older version the join method was changed +from sort-merge to hash2 to improve the performance. + +Input parameters: +input_dir - Path to input directory containing the table files (e.g., ./data) +*/ + +# Call ra-modules with ra-functions. +source("./scripts/builtin/raSelection.dml") as raSel +source("./scripts/builtin/raJoin.dml") as raJoin +source("./scripts/builtin/raGroupby.dml") as raGrp + +# Set input parameters. +input_dir = ifdef($input_dir, "./data"); +print("Loading tables from directory: " + input_dir); + +# Read and load input CSV files from lineorder, date, part, supplier. +lineorder_csv = read(input_dir + "/lineorder.tbl", data_type="frame", format="csv", header=FALSE, sep="|"); +date_csv = read(input_dir + "/date.tbl", data_type="frame", format="csv", header=FALSE, sep="|"); +part_csv = read(input_dir + "/part.tbl", data_type="frame", format="csv", header=FALSE, sep="|"); +supp_csv = read(input_dir + "/supplier.tbl", data_type="frame", format="csv", header=FALSE, sep="|"); + +# General variables. +general_spec = "{ \"ids\": false, \"recode\": [\"C1\"] }"; +hasRows = 1; # If hasRows = 0, the result table is empty. + +# -- Data preparation -- + +# Extract only the necessary columns from tables. +# Extracted: COL-4 | COL-5 | COL-6 | COL_COL-13 +# => LO_PARTKEY | LO_SUPPKEY | LO_ORDERDATE | LO_REVENUE +lineorder_csv_min = cbind(lineorder_csv[, 4], lineorder_csv[, 5], lineorder_csv[, 6], lineorder_csv[, 13]); +lineorder_matrix_min = as.matrix(lineorder_csv_min); + +# Extracted: COL-1 | COL-5 +# => D_DATEKEY | D_YEAR +date_csv_min = cbind(date_csv[, 1], date_csv[, 5]); +date_matrix_min = as.matrix(date_csv_min); + +# -- Filter tables over string values. + +# Prepare PART table on-the-fly encodings +# Extracted: COL-1 | COL-5 +# P_PARTKEY | P_BRAND +# (only need P_BRAND encoding, filter by P_BRAND string itself) +[part_brand_enc_f, part_brand_meta] = transformencode(target=part_csv[,5], spec=general_spec); +#print(toString(part_brand_enc_f)); + +part_filt_keys = matrix(0, rows=0, cols=1); +part_filt_brand = matrix(0, rows=0, cols=1); +part_filt = matrix(0, rows=0, cols=1); + +# Build filtered PART table (P_BRAND BETWEEN 'MFGR#2221' AND 'MFGR#2228'), keeping key and encoded brand +for (i in 1:nrow(part_csv)) { + p_elem = as.scalar(part_csv[i,5]) + if ( p_elem >= "MFGR#2221" & p_elem <= "MFGR#2228") { + key_val = as.double(as.scalar(part_csv[i,1])); + brand_code = as.double(as.scalar(part_brand_enc_f[i,1])); + part_filt_keys = rbind(part_filt_keys, matrix(key_val, rows=1, cols=1)); + part_filt_brand = rbind(part_filt_brand, matrix(brand_code, rows=1, cols=1)); + } +} +if (nrow(part_filt_keys) == 0) { + hasRows = 0; +} +else{ + part_filt = cbind(part_filt_keys, part_filt_brand); +} + +# Extracted: COL-1 | COL-6 +# S_SUPPKEY | S_REGION +supp_filt = matrix(0, rows=0, cols=1); +if(hasRows){ + # Build filtered SUPPLIER table (S_REGION = 'ASIA') + for (i in 1:nrow(supp_csv)) { + if (as.scalar(supp_csv[i,6]) == "ASIA") { + key_val = as.double(as.scalar(supp_csv[i,1])); + supp_filt = rbind(supp_filt, matrix(key_val, rows=1, cols=1)); + } + } + if (nrow(supp_filt) == 0) { + hasRows = 0; + } +} + +#print("LO,DATE,PART,SUPP") +#print(toString(lineorder_matrix_min[1,])) +#print(toString(date_matrix_min[1,])) +#print(toString(part_filt[1,])) +#print(toString(supp_filt[1,])) + +# -- JOIN TABLES WITH RA-JOIN FUNCTION -- + +# Join LINEORDER table with PART, SUPPLIER, DATE tables (star schema) +# Join order does matter! +lo_part = matrix(0, rows=0, cols=1); +lo_part_supp = matrix(0, rows=0, cols=1); +joined_matrix = matrix(0, rows=0, cols=1); +# WHERE LO_PARTKEY = P_PARTKEY +if(hasRows){ + lo_part = raJoin::m_raJoin(A=part_filt, colA=1, B=lineorder_matrix_min, colB=1, method="hash2"); + if(nrow(lo_part[,1]) == 0){ + hasRows = 0; + } +} +# WHERE LO_SUPPKEY = S_SUPPKEY +if(hasRows){ + lo_part_supp = raJoin::m_raJoin(A=supp_filt, colA=1, B=lo_part, colB=4, method="hash2"); + if(nrow(lo_part_supp[,1]) == 0){ + hasRows = 0; + } +} +# WHERE LO_ORDERDATE = D_DATEKEY +# (D_DATEKEY | D_YEAR) | (S_SUPPKEY | P_PARTKEY | P_BRAND | LO_PARTKEY | LO_SUPPKEY | LO_ORDERDATE | LO_REVENUE) +if(hasRows){ + joined_matrix = raJoin::m_raJoin(A=date_matrix_min, colA=1, B=lo_part_supp, colB=6, method="hash2"); + if(nrow(joined_matrix[,1]) == 0){ + hasRows = 0; + } +} +#print(toString(joined_matrix[1,])) + +# -- GROUP-BY & AGGREGATION -- + +# -- Group-By and Aggregation (SUM)-- + +if(hasRows){ + # Group-By + d_year = joined_matrix[,2] + p_brand = joined_matrix[,5] + lo_revenue = joined_matrix[,9] + + # CALCULATING COMBINATION KEY WITH PRIORITY:1 D_YEAR, 2 P_BRAND + + max_p_brand = max(p_brand); + max_d_year = max(d_year); + + p_brand_scale_f = ceil(max_p_brand) + 1; + d_year_scale_f = ceil(max_d_year) + 1; + + combined_key = d_year * p_brand_scale_f + p_brand; + + group_input = cbind(lo_revenue, combined_key) + + agg_result = raGrp::m_raGroupby(X=group_input, col=2, method="nested-loop"); + + # Aggregation (SUM) + key = agg_result[, 1]; + revenue = rowSums(agg_result[, 2:ncol(agg_result)]); + + # EXTRACTING D_YEAR, P_BRAND + d_year = round(floor(key / (p_brand_scale_f))); + p_brand = round(key %% p_brand_scale_f); + result = cbind(revenue, d_year, p_brand, key); + + # -- Sorting -- -- Sorting int columns works, but string does not. + # ORDER BY D_YEAR, P_BRAND ASC + result_ordered = order(target=result, by=4, decreasing=FALSE, index.return=FALSE); + + p_brand_dec = transformdecode(target=result_ordered[,3], spec=general_spec, meta=part_brand_meta); + res = cbind(as.frame(result_ordered[,1]), as.frame(result_ordered[,2]), p_brand_dec) ; + + # Print result + print("SUM(lo_revenue) | d_year | p_brand") + print(res) + + print("\nQ2.2 finished.\n"); +} +else{ + # If the result table has 0 rows, skip group-by and aggregation. + # Print result + print("SUM(lo_revenue) | d_year | p_brand") + print("The result table has 0 rows.") + + print("\nQ2.2 finished.\n"); +} \ No newline at end of file diff --git a/scripts/staging/ssb/queries/q2_3.dml b/scripts/staging/ssb/queries/q2_3.dml new file mode 100644 index 00000000000..d7bde49aadd --- /dev/null +++ b/scripts/staging/ssb/queries/q2_3.dml @@ -0,0 +1,218 @@ +#------------------------------------------------------------- +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# +#------------------------------------------------------------- + +/* DML-script implementing the ssb query Q2.3 in SystemDS. +**input_dir="/scripts/ssb/data" + +* Run with docker: +docker run -it --rm -v $PWD:/scripts/ apache/systemds:nightly -f /scripts/queries/q2_3.dml -nvargs input_dir="/scripts/data/" + +SELECT SUM(lo_revenue), d_year, p_brand +FROM lineorder, date, part, supplier --dates +WHERE + lo_orderdate = d_datekey + AND lo_partkey = p_partkey + AND lo_suppkey = s_suppkey + AND p_brand = 'MFGR#2239' + AND s_region = 'EUROPE' +GROUP BY d_year, p_brand +ORDER BY d_year, p_brand; + +*Please run the original SQL query (eg. in Postgres) +to verify the correctness of DML version. +-> First tests: Works on the dataset with scale factor 0.1. +-> Sorting does not work. + +*Based on older implementations. +https://github.com/ghafek/systemds/blob/feature/ssb-benchmark/scripts/ssb/queries/q1_1.dml +*Especially: +https://github.com/ghafek/systemds/blob/feature/ssb-benchmark/scripts/ssb/queries/q4_3.dml +In comparison to older version the join method was changed +from sort-merge to hash2 to improve the performance. + +Input parameters: +input_dir - Path to input directory containing the table files (e.g., ./data) +*/ + +# Call ra-modules with ra-functions. +source("./scripts/builtin/raSelection.dml") as raSel +source("./scripts/builtin/raJoin.dml") as raJoin +source("./scripts/builtin/raGroupby.dml") as raGrp + +# Set input parameters. +input_dir = ifdef($input_dir, "./data"); +print("Loading tables from directory: " + input_dir); + +# Read and load input CSV files from lineorder, date, part, supplier. +lineorder_csv = read(input_dir + "/lineorder.tbl", data_type="frame", format="csv", header=FALSE, sep="|"); +date_csv = read(input_dir + "/date.tbl", data_type="frame", format="csv", header=FALSE, sep="|"); +part_csv = read(input_dir + "/part.tbl", data_type="frame", format="csv", header=FALSE, sep="|"); +supp_csv = read(input_dir + "/supplier.tbl", data_type="frame", format="csv", header=FALSE, sep="|"); + +# General variables. +general_spec = "{ \"ids\": false, \"recode\": [\"C1\"] }"; +hasRows = 1; # If hasRows = 0, the result table is empty. + +# -- Data preparation -- + +# Extract only the necessary columns from tables. +# Extracted: COL-4 | COL-5 | COL-6 | COL_COL-13 +# => LO_PARTKEY | LO_SUPPKEY | LO_ORDERDATE | LO_REVENUE +lineorder_csv_min = cbind(lineorder_csv[, 4], lineorder_csv[, 5], lineorder_csv[, 6], lineorder_csv[, 13]); +lineorder_matrix_min = as.matrix(lineorder_csv_min); + +# Extracted: COL-1 | COL-5 +# => D_DATEKEY | D_YEAR +date_csv_min = cbind(date_csv[, 1], date_csv[, 5]); +date_matrix_min = as.matrix(date_csv_min); + +# -- Filter tables over string values. + +# Prepare PART table on-the-fly encodings +# Extracted: COL-1 | COL-5 +# P_PARTKEY | P_BRAND +# (only need P_BRAND encoding, filter by P_BRAND string itself) +[part_brand_enc_f, part_brand_meta] = transformencode(target=part_csv[,5], spec=general_spec); +#print(toString(part_brand_enc_f)); + +part_filt_keys = matrix(0, rows=0, cols=1); +part_filt_brand = matrix(0, rows=0, cols=1); +part_filt = matrix(0, rows=0, cols=1); + +# Build filtered PART table (P_BRAND = 'MFGR#2239'), keeping key and encoded brand +for (i in 1:nrow(part_csv)) { + if (as.scalar(part_csv[i,5]) == "MFGR#2239") { + key_val = as.double(as.scalar(part_csv[i,1])); + brand_code = as.double(as.scalar(part_brand_enc_f[i,1])); + part_filt_keys = rbind(part_filt_keys, matrix(key_val, rows=1, cols=1)); + part_filt_brand = rbind(part_filt_brand, matrix(brand_code, rows=1, cols=1)); + } +} +if (nrow(part_filt_keys) == 0) { + hasRows = 0; +} +else{ + part_filt = cbind(part_filt_keys, part_filt_brand); +} + +# Extracted: COL-1 | COL-6 +# S_SUPPKEY | S_REGION +supp_filt = matrix(0, rows=0, cols=1); +if(hasRows){ + # Build filtered SUPPLIER table (s_region = 'EUROPE') + for (i in 1:nrow(supp_csv)) { + if (as.scalar(supp_csv[i,6]) == "EUROPE") { + key_val = as.double(as.scalar(supp_csv[i,1])); + supp_filt = rbind(supp_filt, matrix(key_val, rows=1, cols=1)); + } + } + if (nrow(supp_filt) == 0) { + hasRows = 0; + } +} +#print("LO,DATE,PART,SUPP") +#print(toString(lineorder_matrix_min[1,])) +#print(toString(date_matrix_min[1,])) +#print(toString(part_filt[1,])) +#print(toString(supp_filt[1,])) + +# -- JOIN TABLES WITH RA-JOIN FUNCTION -- + +# Join LINEORDER table with PART, SUPPLIER, DATE tables (star schema) +# Join order does matter! +lo_part = matrix(0, rows=0, cols=1); +lo_part_supp = matrix(0, rows=0, cols=1); +joined_matrix = matrix(0, rows=0, cols=1); +# LINEORDER table with DATE, PART, SUPPLIER is much slower! +# WHERE LO_PARTKEY = P_PARTKEY +if(hasRows){ + lo_part = raJoin::m_raJoin(A=part_filt, colA=1, B=lineorder_matrix_min, colB=1, method="hash2"); + if(nrow(lo_part[,1]) == 0){ + hasRows = 0; + } +} +# WHERE LO_SUPPKEY = S_SUPPKEY +if(hasRows){ + lo_part_supp = raJoin::m_raJoin(A=supp_filt, colA=1, B=lo_part, colB=4, method="hash2"); + if(nrow(lo_part_supp[,1]) == 0){ + hasRows = 0; + } +} +# WHERE LO_ORDERDATE = D_DATEKEY +# (D_DATEKEY | D_YEAR) | (S_SUPPKEY | P_PARTKEY | P_BRAND | LO_PARTKEY | LO_SUPPKEY | LO_ORDERDATE | LO_REVENUE) +if(hasRows){ + joined_matrix = raJoin::m_raJoin(A=date_matrix_min, colA=1, B=lo_part_supp, colB=6, method="hash2"); + if(nrow(joined_matrix[,1]) == 0){ + hasRows = 0; + } +}#print(toString(joined_matrix[1,])) + +# -- Group-By and Aggregation (SUM)-- +if(hasRows){ + # Group-By + d_year = joined_matrix[,2] + p_brand = joined_matrix[,5] + lo_revenue = joined_matrix[,9] + + # CALCULATING COMBINATION KEY WITH PRIORITY:1 D_YEAR, 2 P_BRAND + + max_p_brand = max(p_brand); + max_d_year = max(d_year); + + p_brand_scale_f = ceil(max_p_brand) + 1; + d_year_scale_f = ceil(max_d_year) + 1; + + combined_key = d_year * p_brand_scale_f + p_brand; + + group_input = cbind(lo_revenue, combined_key) + + agg_result = raGrp::m_raGroupby(X=group_input, col=2, method="nested-loop"); + + # Aggregation (SUM) + key = agg_result[, 1]; + revenue = rowSums(agg_result[, 2:ncol(agg_result)]); + + # EXTRACTING D_YEAR, P_BRAND + d_year = round(floor(key / (p_brand_scale_f))); + p_brand = round(key %% p_brand_scale_f); + result = cbind(revenue, d_year, p_brand, key); + + # -- Sorting -- -- Sorting int columns works, but string does not. + # ORDER BY D_YEAR, P_BRAND ASC + result_ordered = order(target=result, by=4, decreasing=FALSE, index.return=FALSE); + + p_brand_dec = transformdecode(target=result_ordered[,3], spec=general_spec, meta=part_brand_meta); + res = cbind(as.frame(result_ordered[,1]), as.frame(result_ordered[,2]), p_brand_dec) ; + + # Print result + print("SUM(lo_revenue) | d_year | p_brand"); + print(res); + + print("\nQ2.3 finished.\n"); +} +else{ + # If the result table has 0 rows, skip group-by and aggregation. + # Print result + print("SUM(lo_revenue) | d_year | p_brand") + print("The result table has 0 rows.") + + print("\nQ2.3 finished.\n"); +} \ No newline at end of file diff --git a/scripts/staging/ssb/queries/q3_1.dml b/scripts/staging/ssb/queries/q3_1.dml new file mode 100644 index 00000000000..e47e8a87b43 --- /dev/null +++ b/scripts/staging/ssb/queries/q3_1.dml @@ -0,0 +1,254 @@ +#------------------------------------------------------------- +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# +#------------------------------------------------------------- + +/* DML-script implementing the ssb query Q3.1 in SystemDS. +**input_dir="/scripts/ssb/data" + +* Run with docker: +docker run -it --rm -v $PWD:/scripts/ apache/systemds:nightly -f /scripts/queries/q3_1.dml -nvargs input_dir="/scripts/data/" + +SELECT + c_nation, + s_nation, + d_year, + SUM(lo_revenue) AS REVENUE +FROM customer, lineorder, supplier, date --dates +WHERE + lo_custkey = c_custkey + AND lo_suppkey = s_suppkey + AND lo_orderdate = d_datekey + AND c_region = 'ASIA' + AND s_region = 'ASIA' + AND d_year >= 1992 + AND d_year <= 1997 +GROUP BY c_nation, s_nation, d_year +ORDER BY d_year ASC, REVENUE DESC; + +*Please run the original SQL query (eg. in Postgres) +to verify the correctness of DML version. +-> First tests: Works on the dataset with scale factor 0.1. +-> Sorting does not work. + +*Based on older implementations. +https://github.com/ghafek/systemds/blob/feature/ssb-benchmark/scripts/ssb/queries/q1_1.dml +*Especially: +https://github.com/ghafek/systemds/blob/feature/ssb-benchmark/scripts/ssb/queries/q4_3.dml +In comparison to older version the join method was changed +from sort-merge to hash2 to improve the performance. + +Input parameters: +input_dir - Path to input directory containing the table files (e.g., ./data) +*/ + +# Call ra-modules with ra-functions. +source("./scripts/builtin/raSelection.dml") as raSel +source("./scripts/builtin/raJoin.dml") as raJoin +source("./scripts/builtin/raGroupby.dml") as raGrp + +# Set input parameters. +input_dir = ifdef($input_dir, "./data"); +print("Loading tables from directory: " + input_dir); + +# Read and load input CSV files. +lineorder_csv = read(input_dir + "/lineorder.tbl", data_type="frame", format="csv", header=FALSE, sep="|"); +cust_csv = read(input_dir + "/customer.tbl", data_type="frame", format="csv", header=FALSE, sep="|"); +date_csv = read(input_dir + "/date.tbl", data_type="frame", format="csv", header=FALSE, sep="|"); +supp_csv = read(input_dir + "/supplier.tbl", data_type="frame", format="csv", header=FALSE, sep="|"); + +# General variables. +general_spec = "{ \"ids\": false, \"recode\": [\"C1\"] }"; +hasRows = 1; # If hasRows = 0, the result table is empty. + +# -- Data preparation -- + +# Extract only the necessary columns from tables. +# Extracted: COL-3 | COL-5 | COL-6 | COL-13 +# => LO_CUSTKEY | LO_SUPPKEY | LO_ORDERDATE | LO_REVENUE +lineorder_csv_min = cbind(lineorder_csv[, 3], lineorder_csv[, 5], lineorder_csv[, 6], lineorder_csv[, 13]); +lineorder_matrix_min = as.matrix(lineorder_csv_min); + +# Extracted: COL-1 | COL-5 +# => D_DATEKEY | D_YEAR +date_csv_min = cbind(date_csv[, 1], date_csv[, 5]); +date_matrix_min = as.matrix(date_csv_min); + +# -- Filter tables over string values. + +# WHERE D_YEAR >= 1992 AND D_YEAR <= 1997 +d_filt = raSel::m_raSelection(date_matrix_min, col=2, op=">=", val=1992); +d_filt = raSel::m_raSelection(d_filt, col=2, op="<=", val=1997); +if( as.scalar(d_filt[1,1]) == 0){ + hasRows = 0; +} + +# Prepare SUPPLIER table on-the-fly encodings +# Extracted: COL-1 | COL-5 | COL-6 +# S_SUPPKEY | S_NATION | S_REGION +# (only need S_NATION encoding, filter by S_REGION string) +[supp_nat_enc_f, supp_nat_meta] = transformencode(target=supp_csv[,5], spec=general_spec); + +supp_filt_keys = matrix(0, rows=0, cols=1); +supp_filt_nat = matrix(0, rows=0, cols=1); +supp_filt = matrix(0, rows=0, cols=1); + +if(hasRows){ + # Build filtered SUPPLIER table (S_REGION == 'ASIA') + for (i in 1:nrow(supp_csv)) { + if (as.scalar(supp_csv[i,6]) == "ASIA") { + key_val = as.double(as.scalar(supp_csv[i,1])); + nat_code = as.double(as.scalar(supp_nat_enc_f[i,1])); + supp_filt_keys = rbind(supp_filt_keys, matrix(key_val, rows=1, cols=1)); + supp_filt_nat = rbind(supp_filt_nat, matrix(nat_code, rows=1, cols=1)); + } + } + if (nrow(supp_filt_keys) == 0) { + hasRows = 0; + } + else{ + supp_filt = cbind(supp_filt_keys, supp_filt_nat); + } +} + +# Prepare CUSTOMER table on-the-fly encodings +# Extracted: COL-1 | COL-5 | COL-6 +# C_CUSTKEY | C_NATION | C_REGION +# (only need C_NATION encoding, filter by C_REGION string) +[cust_nat_enc_f, cust_nat_meta] = transformencode(target=cust_csv[,5], spec=general_spec); + +cust_filt_keys = matrix(0, rows=0, cols=1); +cust_filt_nat = matrix(0, rows=0, cols=1); +cust_filt = matrix(0, rows=0, cols=1); + +if(hasRows){ + # Build filtered CUSTOMER table (C_REGION = 'ASIA') + for (i in 1:nrow(cust_csv)) { + if (as.scalar(cust_csv[i,6]) == "ASIA") { + key_val = as.double(as.scalar(cust_csv[i,1])); + nat_code = as.double(as.scalar(cust_nat_enc_f[i,1])); + cust_filt_keys = rbind(cust_filt_keys, matrix(key_val, rows=1, cols=1)); + cust_filt_nat = rbind(cust_filt_nat, matrix(nat_code, rows=1, cols=1)); + } + } + if (nrow(cust_filt_keys) == 0) { + hasRows = 0; + } + else{ + cust_filt = cbind(cust_filt_keys,cust_filt_nat); + } +} +#print("LO,DATE,CUST,SUPP") +#print(toString(lineorder_matrix_min[1,])) +#print(toString(date_matrix_min[1,])) +#print(toString(cust_filt[1,])) +#print(toString(supp_filt[1,])) + + +# -- JOIN TABLES WITH RA-JOIN FUNCTION -- + +# Join LINEORDER table with CUST, SUPPLIER, DATE tables (star schema) +# Join order does matter! +lo_cust = matrix(0, rows=0, cols=1); +lo_cust_supp = matrix(0, rows=0, cols=1); +joined_matrix = matrix(0, rows=0, cols=1); +# WHERE LO_CUSTKEY = C_CUSTKEY +if(hasRows){ + lo_cust = raJoin::m_raJoin(A=cust_filt, colA=1, B=lineorder_matrix_min, colB=1, method="hash2"); + if(nrow(lo_cust[,1]) == 0){ + hasRows = 0; + } +} +# WHERE LO_SUPPKEY = S_SUPPKEY +if(hasRows){ + lo_cust_supp = raJoin::m_raJoin(A=supp_filt, colA=1, B=lo_cust, colB=4, method="hash2"); + if(nrow(lo_cust_supp[,1]) == 0){ + hasRows = 0; + } +} +# WHERE LO_ORDERDATE = D_DATEKEY +# (D_DATEKEY | D_YEAR) | (S_SUPPKEY | S_NATION | C_CUSTKEY | C_NATION | +# LO_CUSTKEY | LO_SUPPKEY | LO_ORDERDATE | LO_REVENUE) +if(hasRows){ + joined_matrix = raJoin::m_raJoin(A=d_filt, colA=1, B=lo_cust_supp, colB=7, method="hash2"); + #print(toString(joined_matrix[1,])) + if(nrow(joined_matrix[,1]) == 0){ + hasRows = 0; + } +} +#print(toString(joined_matrix[1,])) + +# -- Group-By and Aggregation (SUM)-- +if(hasRows){ + # Group-By + d_year = joined_matrix[,2]; + s_nat = joined_matrix[,4]; + c_nat = joined_matrix[,6]; + revenue = joined_matrix[,10]; + + # CALCULATING COMBINATION KEY WITH PRIORITY:1 C_NATION, 2 S_NATION, D_YEAR + max_c_nat= max(c_nat); + max_s_nat= max(s_nat); + max_d_year = max(d_year); + + c_nat_scale_f = ceil(max_c_nat) + 1; + s_nat_scale_f = ceil(max_s_nat) + 1; + d_year_scale_f = ceil(max_d_year) + 1; + + combined_key = c_nat * s_nat_scale_f * d_year_scale_f + s_nat * d_year_scale_f + d_year; + + group_input = cbind(revenue, combined_key) + + agg_result = raGrp::m_raGroupby(X=group_input, col=2, method="nested-loop"); + #print(toString(agg_result[1,])); + + # Aggregation (SUM) + key = agg_result[, 1]; + revenue = rowSums(agg_result[, 2:ncol(agg_result)]); + + # EXTRACTING C_NATION, S_NATION, D_YEAR + c_nat = round(floor(key / (s_nat_scale_f * d_year_scale_f))); + s_nat = round(floor((key %% (s_nat_scale_f * d_year_scale_f)) / d_year_scale_f)); + d_year = round(key %% d_year_scale_f); + + result = cbind(c_nat, s_nat, d_year, revenue, key) + + # -- Sorting -- -- Sorting int columns works, but strings do not. + # ORDER BY D_YEAR ASC, REVENUE DESC + result_ordered = order(target=result, by=4, decreasing=TRUE, index.return=FALSE); + result_ordered = order(target=result_ordered, by=3, decreasing=FALSE, index.return=FALSE); + + c_nat_dec = transformdecode(target=result_ordered[,1], spec=general_spec, meta=cust_nat_meta); + s_nat_dec = transformdecode(target=result_ordered[,2], spec=general_spec, meta=supp_nat_meta); + + res = cbind(c_nat_dec, s_nat_dec, as.frame(result_ordered[,3]), as.frame(result_ordered[,4])) ; + + # Print result + print("c_nation | s_nation | d_year | REVENUE") + print(res) + + print("\nQ3.1 finished.\n"); +} +else{ + # If the result table has 0 rows, skip group-by and aggregation. + # Print result + print("c_nation | s_nation | d_year | REVENUE") + print("The result table has 0 rows.") + print("\nQ3.1 finished.\n"); +} \ No newline at end of file diff --git a/scripts/staging/ssb/queries/q3_2.dml b/scripts/staging/ssb/queries/q3_2.dml new file mode 100644 index 00000000000..f05c8441846 --- /dev/null +++ b/scripts/staging/ssb/queries/q3_2.dml @@ -0,0 +1,256 @@ +#------------------------------------------------------------- +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# +#------------------------------------------------------------- + + +/* DML-script implementing the ssb query Q3.2 in SystemDS. +**input_dir="/scripts/ssb/data" + +* Run with docker: +docker run -it --rm -v $PWD:/scripts/ apache/systemds:nightly -f /scripts/queries/q3_2.dml -nvargs input_dir="/scripts/data/" + +SELECT + c_city, + s_city, + d_year, + SUM(lo_revenue) AS REVENUE +FROM customer, lineorder, supplier, date -- dates +WHERE + lo_custkey = c_custkey + AND lo_suppkey = s_suppkey + AND lo_orderdate = d_datekey + AND c_nation = 'UNITED STATES' + AND s_nation = 'UNITED STATES' + AND d_year >= 1992 + AND d_year <= 1997 +GROUP BY c_city, s_city, d_year +ORDER BY d_year ASC, REVENUE DESC; + +*Please run the original SQL query (eg. in Postgres) +to verify the correctness of DML version. +-> First tests: Works on the dataset with scale factor 0.1. +-> Sorting does not work. + +*Based on older implementations. +https://github.com/ghafek/systemds/blob/feature/ssb-benchmark/scripts/ssb/queries/q1_1.dml +*Especially: +https://github.com/ghafek/systemds/blob/feature/ssb-benchmark/scripts/ssb/queries/q4_3.dml +In comparison to older version the join method was changed +from sort-merge to hash2 to improve the performance. + +Input parameters: +input_dir - Path to input directory containing the table files (e.g., ./data) +*/ + +# Call ra-modules with ra-functions. +source("./scripts/builtin/raSelection.dml") as raSel +source("./scripts/builtin/raJoin.dml") as raJoin +source("./scripts/builtin/raGroupby.dml") as raGrp + +# Set input parameters. +input_dir = ifdef($input_dir, "./data"); +print("Loading tables from directory: " + input_dir); + +# Read and load input CSV files. +lineorder_csv = read(input_dir + "/lineorder.tbl", data_type="frame", format="csv", header=FALSE, sep="|"); +cust_csv = read(input_dir + "/customer.tbl", data_type="frame", format="csv", header=FALSE, sep="|"); +date_csv = read(input_dir + "/date.tbl", data_type="frame", format="csv", header=FALSE, sep="|"); +supp_csv = read(input_dir + "/supplier.tbl", data_type="frame", format="csv", header=FALSE, sep="|"); + +# General variables. +general_spec = "{ \"ids\": false, \"recode\": [\"C1\"] }"; +hasRows = 1; # If hasRows = 0, the result table is empty. + +# -- Data preparation -- + +# Extract only the necessary columns from tables. +# Extracted: COL-3 | COL-5 | COL-6 | COL-13 +# => LO_CUSTKEY | LO_SUPPKEY | LO_ORDERDATE | LO_REVENUE +lineorder_csv_min = cbind(lineorder_csv[, 3], lineorder_csv[, 5], lineorder_csv[, 6], lineorder_csv[, 13]); +lineorder_matrix_min = as.matrix(lineorder_csv_min); + +# Extracted: COL-1 | COL-5 +# => D_DATEKEY | D_YEAR +date_csv_min = cbind(date_csv[, 1], date_csv[, 5]); +date_matrix_min = as.matrix(date_csv_min); + +# -- Filter tables over string values. + +# WHERE D_YEAR >= 1992 AND D_YEAR <= 1997 +d_filt = raSel::m_raSelection(date_matrix_min, col=2, op=">=", val=1992); +d_filt = raSel::m_raSelection(d_filt, col=2, op="<=", val=1997); +if( as.scalar(d_filt[1,1]) == 0){ + hasRows = 0; +} + +# Prepare SUPPLIER table on-the-fly encodings +# Extracted: COL-1 | COL-4 | COL-5 +# S_SUPPKEY | S_CITY | S_REGION +# (only need S_CITY encoding, filter by S_NATION string) +[supp_city_enc_f, supp_city_meta] = transformencode(target=supp_csv[,4], spec=general_spec); + +supp_filt_keys = matrix(0, rows=0, cols=1); +supp_filt_city = matrix(0, rows=0, cols=1); +supp_filt = matrix(0, rows=0, cols=1); + +if(hasRows){ + # Build filtered SUPPLIER table (C_NATION = 'UNITED STATES') + for (i in 1:nrow(supp_csv)) { + if (as.scalar(supp_csv[i,5]) == "UNITED STATES") { + key_val = as.double(as.scalar(supp_csv[i,1])); + city_code = as.double(as.scalar(supp_city_enc_f[i,1])); + supp_filt_keys = rbind(supp_filt_keys, matrix(key_val, rows=1, cols=1)); + supp_filt_city = rbind(supp_filt_city, matrix(city_code, rows=1, cols=1)); + } + } + if (nrow(supp_filt_keys) == 0) { + hasRows = 0; + } + else{ + supp_filt = cbind(supp_filt_keys, supp_filt_city); + } +} + +# Prepare CUSTOMER table on-the-fly encodings +# Extracted: COL-1 | COL-5 | COL-6 +# C_CUSTKEY | C_CITY | C_NATION +# (only need C_CITY encoding, filter by C_NATION string) +[cust_city_enc_f, cust_city_meta] = transformencode(target=cust_csv[,4], spec=general_spec); + +cust_filt_keys = matrix(0, rows=0, cols=1); +cust_filt_city = matrix(0, rows=0, cols=1); +cust_filt = matrix(0, rows=0, cols=1); + +if(hasRows){ + # Build filtered CUSTOMER table (C_NATION = 'UNITED STATES') + for (i in 1:nrow(cust_csv)) { + if (as.scalar(cust_csv[i,5]) == "UNITED STATES") { + key_val = as.double(as.scalar(cust_csv[i,1])); + city_code = as.double(as.scalar(cust_city_enc_f[i,1])); + cust_filt_keys = rbind(cust_filt_keys, matrix(key_val, rows=1, cols=1)); + cust_filt_city = rbind(cust_filt_city, matrix(city_code, rows=1, cols=1)); + } + } + if (nrow(cust_filt_keys) == 0) { + hasRows = 0; + } + else{ + cust_filt = cbind(cust_filt_keys,cust_filt_city); + } +} + +#print("LO,DATE,CUST,SUPP") +#print(toString(lineorder_matrix_min[1,])) +#print(toString(d_filt[1,])) +#print(toString(cust_filt[1,])) +#print(toString(supp_filt[1,])) + +# -- JOIN TABLES WITH RA-JOIN FUNCTION -- + +# Join LINEORDER table with CUST, SUPPLIER, DATE tables (star schema) +# Join order does matter! +lo_cust = matrix(0, rows=0, cols=1); +lo_cust_supp = matrix(0, rows=0, cols=1); +joined_matrix = matrix(0, rows=0, cols=1); +# WHERE LO_CUSTKEY = C_CUSTKEY +if(hasRows){ + lo_cust = raJoin::m_raJoin(A=cust_filt, colA=1, B=lineorder_matrix_min, colB=1, method="hash2"); + if(nrow(lo_cust[,1]) == 0){ + hasRows = 0; + } +} +# WHERE LO_SUPPKEY = S_SUPPKEY +if(hasRows){ + lo_cust_supp = raJoin::m_raJoin(A=supp_filt, colA=1, B=lo_cust, colB=4, method="hash2"); + if(nrow(lo_cust_supp[,1]) == 0){ + hasRows = 0; + } +} +# WHERE LO_ORDERDATE = D_DATEKEY +# (D_DATEKEY | D_YEAR) | (S_SUPPKEY | S_CITY | C_CUSTKEY | C_CITY | +# LO_CUSTKEY | LO_SUPPKEY | LO_ORDERDATE | LO_REVENUE) +if(hasRows){ + joined_matrix = raJoin::m_raJoin(A=d_filt, colA=1, B=lo_cust_supp, colB=7, method="hash2"); + #print(toString(joined_matrix[1,])) + if(nrow(joined_matrix[,1]) == 0){ + hasRows = 0; + } +} +#print(toString(joined_matrix[1,])) + +# -- Group-By and Aggregation (SUM)-- + +if(hasRows){ + # Group-By + d_year = joined_matrix[,2]; + s_city = joined_matrix[,4]; + c_city = joined_matrix[,6]; + revenue = joined_matrix[,10]; + + # CALCULATING COMBINATION KEY WITH PRIORITY:1 C_CITY, 2 S_CITY, D_YEAR + max_c_city= max(c_city); + max_s_city= max(s_city); + max_d_year = max(d_year); + + c_city_scale_f = ceil(max_c_city) + 1; + s_city_scale_f = ceil(max_s_city) + 1; + d_year_scale_f = ceil(max_d_year) + 1; + + combined_key = c_city * s_city_scale_f * d_year_scale_f + s_city * d_year_scale_f + d_year; + + group_input = cbind(revenue, combined_key) + + agg_result = raGrp::m_raGroupby(X=group_input, col=2, method="nested-loop"); + #print(toString(agg_result[1,])); + + # Aggregation (SUM) + key = agg_result[, 1]; + revenue = rowSums(agg_result[, 2:ncol(agg_result)]); + + # EXTRACTING C_CITY, S_CITY, D_YEAR + c_city = round(floor(key / (s_city_scale_f * d_year_scale_f))); + s_city = round(floor((key %% (s_city_scale_f * d_year_scale_f)) / d_year_scale_f)); + d_year = round(key %% d_year_scale_f); + + result = cbind(c_city, s_city, d_year, revenue, key) + + # -- Sorting -- -- Sorting int columns works, but strings do not. + # ORDER BY D_YEAR ASC, REVENUE DESC + result_ordered = order(target=result, by=4, decreasing=TRUE, index.return=FALSE); + result_ordered = order(target=result_ordered, by=3, decreasing=FALSE, index.return=FALSE); + + c_city_dec = transformdecode(target=result_ordered[,1], spec=general_spec, meta=cust_city_meta); + s_city_dec = transformdecode(target=result_ordered[,2], spec=general_spec, meta=supp_city_meta); + + res = cbind(c_city_dec, s_city_dec, as.frame(result_ordered[,3]), as.frame(result_ordered[,4])) ; + + # Print result + print("c_city | s_city | d_year | REVENUE") + print(res) + + print("\nQ3.2 finished.\n"); +} +else{ + # If the result table has 0 rows, skip group-by and aggregation. + # Print result + print("c_city | s_city | d_year | REVENUE") + print("The result table has 0 rows.") + print("\nQ3.2 finished.\n"); +} \ No newline at end of file diff --git a/scripts/staging/ssb/queries/q3_3.dml b/scripts/staging/ssb/queries/q3_3.dml new file mode 100644 index 00000000000..87c59233e73 --- /dev/null +++ b/scripts/staging/ssb/queries/q3_3.dml @@ -0,0 +1,262 @@ +#------------------------------------------------------------- +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# +#------------------------------------------------------------- + + +/* DML-script implementing the ssb query Q3.3 in SystemDS. +**input_dir="/scripts/ssb/data" + +* Run with docker: +docker run -it --rm -v $PWD:/scripts/ apache/systemds:nightly -f /scripts/queries/q3_2.dml -nvargs input_dir="/scripts/data/" + +SELECT + c_city, + s_city, + d_year, + SUM(lo_revenue) AS REVENUE +FROM customer, lineorder, supplier, date --dates +WHERE + lo_custkey = c_custkey + AND lo_suppkey = s_suppkey + AND lo_orderdate = d_datekey + AND ( + c_city = 'UNITED KI1' + OR c_city = 'UNITED KI5' + ) + AND ( + s_city = 'UNITED KI1' + OR s_city = 'UNITED KI5' + ) + AND d_year >= 1992 + AND d_year <= 1997 +GROUP BY c_city, s_city, d_year +ORDER BY d_year ASC, REVENUE DESC; + +*Please run the original SQL query (eg. in Postgres) +to verify the correctness of DML version. +-> First tests: Works on the dataset with scale factor 0.1. +-> Sorting does not work. + +*Based on older implementations. +https://github.com/ghafek/systemds/blob/feature/ssb-benchmark/scripts/ssb/queries/q1_1.dml +*Especially: +https://github.com/ghafek/systemds/blob/feature/ssb-benchmark/scripts/ssb/queries/q4_3.dml +In comparison to older version the join method was changed +from sort-merge to hash2 to improve the performance. + +Input parameters: +input_dir - Path to input directory containing the table files (e.g., ./data) +*/ + +# Call ra-modules with ra-functions. +source("./scripts/builtin/raSelection.dml") as raSel +source("./scripts/builtin/raJoin.dml") as raJoin +source("./scripts/builtin/raGroupby.dml") as raGrp + +# Set input parameters. +input_dir = ifdef($input_dir, "./data"); +print("Loading tables from directory: " + input_dir); + +# Read and load input CSV files. +lineorder_csv = read(input_dir + "/lineorder.tbl", data_type="frame", format="csv", header=FALSE, sep="|"); +cust_csv = read(input_dir + "/customer.tbl", data_type="frame", format="csv", header=FALSE, sep="|"); +date_csv = read(input_dir + "/date.tbl", data_type="frame", format="csv", header=FALSE, sep="|"); +supp_csv = read(input_dir + "/supplier.tbl", data_type="frame", format="csv", header=FALSE, sep="|"); + +# General variables. +general_spec = "{ \"ids\": false, \"recode\": [\"C1\"] }"; +hasRows = 1; # If hasRows = 0, the result table is empty. + +# -- Data preparation -- + +# Extract only the necessary columns from tables. +# Extracted: COL-3 | COL-5 | COL-6 | COL-13 +# => LO_CUSTKEY | LO_SUPPKEY | LO_ORDERDATE | LO_REVENUE +lineorder_csv_min = cbind(lineorder_csv[, 3], lineorder_csv[, 5], lineorder_csv[, 6], lineorder_csv[, 13]); +lineorder_matrix_min = as.matrix(lineorder_csv_min); + +# Extracted: COL-1 | COL-5 +# => D_DATEKEY | D_YEAR +date_csv_min = cbind(date_csv[, 1], date_csv[, 5]); +date_matrix_min = as.matrix(date_csv_min); + +# -- Filter tables over string values. + +# WHERE D_YEAR >= 1992 AND D_YEAR <= 1997 +d_filt = raSel::m_raSelection(date_matrix_min, col=2, op=">=", val=1992); +d_filt = raSel::m_raSelection(d_filt, col=2, op="<=", val=1997); +if( as.scalar(d_filt[1,1]) == 0){ + hasRows = 0; +} + +# Prepare SUPPLIER table on-the-fly encodings +# Extracted: COL-1 | COL-4 +# S_SUPPKEY | S_CITY +# (only need S_CITY encoding, filter by S_CITY string itself) +[supp_city_enc_f, supp_city_meta] = transformencode(target=supp_csv[,4], spec=general_spec); + +supp_filt_keys = matrix(0, rows=0, cols=1); +supp_filt_city = matrix(0, rows=0, cols=1); +supp_filt = matrix(0, rows=0, cols=1); + +if(hasRows){ + # Build filtered SUPPLIER table (S_CITY = 'UNITED KI1' OR S_CITY = 'UNITED KI5') + for (i in 1:nrow(supp_csv)) { + s_elem = as.scalar(supp_csv[i,4]) + if (s_elem == "UNITED KI1" | s_elem == "UNITED KI5") { + key_val = as.double(as.scalar(supp_csv[i,1])); + city_code = as.double(as.scalar(supp_city_enc_f[i,1])); + supp_filt_keys = rbind(supp_filt_keys, matrix(key_val, rows=1, cols=1)); + supp_filt_city = rbind(supp_filt_city, matrix(city_code, rows=1, cols=1)); + } + } + if (nrow(supp_filt_keys) == 0) { + hasRows = 0; + } + else{ + supp_filt = cbind(supp_filt_keys, supp_filt_city); + } +} + +# Prepare CUSTOMER table on-the-fly encodings +# Extracted: COL-1 | COL-4 +# C_CUSTKEY | C_CITY +# (only need C_CITY encoding, filter by C_CITY string itself) +[cust_city_enc_f, cust_city_meta] = transformencode(target=cust_csv[,4], spec=general_spec); + +cust_filt_keys = matrix(0, rows=0, cols=1); +cust_filt_city = matrix(0, rows=0, cols=1); +cust_filt = matrix(0, rows=0, cols=1); + +if(hasRows){ + # Build filtered CUSTOMER table (C_CITY = 'UNITED KI1' OR C_CITY = 'UNITED KI5') + for (i in 1:nrow(cust_csv)) { + c_elem = as.scalar(cust_csv[i,4]) + if (c_elem == "UNITED KI1" | c_elem == "UNITED KI5") { + key_val = as.double(as.scalar(cust_csv[i,1])); + city_code = as.double(as.scalar(cust_city_enc_f[i,1])); + cust_filt_keys = rbind(cust_filt_keys, matrix(key_val, rows=1, cols=1)); + cust_filt_city = rbind(cust_filt_city, matrix(city_code, rows=1, cols=1)); + } + } + if (nrow(cust_filt_keys) == 0) { + hasRows = 0; + } + else{ + cust_filt = cbind(cust_filt_keys,cust_filt_city); + } +} + +#print("LO,DATE,CUST,SUPP") +#print(toString(lineorder_matrix_min[1,])) +#print(toString(d_filt[1,])) +#print(toString(cust_filt[1,])) +#print(toString(supp_filt[1,])) + +# -- JOIN TABLES WITH RA-JOIN FUNCTION -- +# Join order does matter! +lo_cust = matrix(0, rows=0, cols=1); +lo_cust_supp = matrix(0, rows=0, cols=1); +joined_matrix = matrix(0, rows=0, cols=1); +# Join LINEORDER table with CUST, SUPPLIER, DATE tables (star schema) +# WHERE LO_CUSTKEY = C_CUSTKEY +if(hasRows){ + lo_cust = raJoin::m_raJoin(A=cust_filt, colA=1, B=lineorder_matrix_min, colB=1, method="hash2"); + if(nrow(lo_cust[,1]) == 0){ + hasRows = 0; + } +} +# WHERE LO_SUPPKEY = S_SUPPKEY +if(hasRows){ + lo_cust_supp = raJoin::m_raJoin(A=supp_filt, colA=1, B=lo_cust, colB=4, method="hash2"); + if(nrow(lo_cust_supp[,1]) == 0){ + hasRows = 0; + } +} +# WHERE LO_ORDERDATE = D_DATEKEY +# (D_DATEKEY | D_YEAR) | (S_SUPPKEY | S_CITY | C_CUSTKEY | C_CITY | +# LO_CUSTKEY | LO_SUPPKEY | LO_ORDERDATE | LO_REVENUE) +if(hasRows){ + joined_matrix = raJoin::m_raJoin(A=d_filt, colA=1, B=lo_cust_supp, colB=7, method="hash2"); + #print(toString(joined_matrix[1,])) + if(nrow(joined_matrix[,1]) == 0){ + hasRows = 0; + } +} + +# -- Group-By and Aggregation (SUM)-- + +if(hasRows){ + # Group-By + d_year = joined_matrix[,2]; + s_city = joined_matrix[,4]; + c_city = joined_matrix[,6]; + revenue = joined_matrix[,10]; + + # CALCULATING COMBINATION KEY WITH PRIORITY:1 C_CITY, 2 S_CITY, D_YEAR + max_c_city= max(c_city); + max_s_city= max(s_city); + max_d_year = max(d_year); + + c_city_scale_f = ceil(max_c_city) + 1; + s_city_scale_f = ceil(max_s_city) + 1; + d_year_scale_f = ceil(max_d_year) + 1; + + combined_key = c_city * s_city_scale_f * d_year_scale_f + s_city * d_year_scale_f + d_year; + + group_input = cbind(revenue, combined_key) + + agg_result = raGrp::m_raGroupby(X=group_input, col=2, method="nested-loop"); + #print(toString(agg_result[1,])); + + # Aggregation (SUM) + key = agg_result[, 1]; + revenue = rowSums(agg_result[, 2:ncol(agg_result)]); + + # EXTRACTING C_CITY, S_CITY, D_YEAR + c_city = round(floor(key / (s_city_scale_f * d_year_scale_f))); + s_city = round(floor((key %% (s_city_scale_f * d_year_scale_f)) / d_year_scale_f)); + d_year = round(key %% d_year_scale_f); + + result = cbind(c_city, s_city, d_year, revenue, key) + + # -- Sorting -- -- Sorting int columns works, but strings do not. + # ORDER BY D_YEAR ASC, REVENUE DESC + result_ordered = order(target=result, by=4, decreasing=TRUE, index.return=FALSE); + result_ordered = order(target=result_ordered, by=3, decreasing=FALSE, index.return=FALSE); + + c_city_dec = transformdecode(target=result_ordered[,1], spec=general_spec, meta=cust_city_meta); + s_city_dec = transformdecode(target=result_ordered[,2], spec=general_spec, meta=supp_city_meta); + + res = cbind(c_city_dec, s_city_dec, as.frame(result_ordered[,3]), as.frame(result_ordered[,4])) ; + + # Print result + print("c_city | s_city | d_year | REVENUE") + print(res) + + print("\nQ3.3 finished.\n"); +} +else{ + # If the result table has 0 rows, skip group-by and aggregation. + # Print result + print("c_city | s_city | d_year | REVENUE") + print("The result table has 0 rows.") + print("\nQ3.3 finished.\n"); +} \ No newline at end of file diff --git a/scripts/staging/ssb/queries/q3_4.dml b/scripts/staging/ssb/queries/q3_4.dml new file mode 100644 index 00000000000..278fb2d8c82 --- /dev/null +++ b/scripts/staging/ssb/queries/q3_4.dml @@ -0,0 +1,275 @@ +#------------------------------------------------------------- +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# +#------------------------------------------------------------- + +/* DML-script implementing the ssb query Q3.3 in SystemDS. +**input_dir="/scripts/ssb/data" + +##TO DO +TO CHECK ON EMPTY TABLES with nrows else out of bounds statements. +Expecially for q3_2, q3_3, q3_4. +* Run with docker: +docker run -it --rm -v $PWD:/scripts/ apache/systemds:nightly -f /scripts/queries/q3_2.dml -nvargs input_dir="/scripts/data/" + +SELECT + c_city, + s_city, + d_year, + SUM(lo_revenue) AS REVENUE +FROM customer, lineorder, supplier, date --dates +WHERE + lo_custkey = c_custkey + AND lo_suppkey = s_suppkey + AND lo_orderdate = d_datekey + AND ( + c_city = 'UNITED KI1' + OR c_city = 'UNITED KI5' + ) + AND ( + s_city = 'UNITED KI1' + OR s_city = 'UNITED KI5' + ) + AND d_yearmonth = 'Dec1997' +GROUP BY c_city, s_city, d_year +ORDER BY d_year ASC, REVENUE DESC; + +*Please run the original SQL query (eg. in Postgres) +to verify the correctness of DML version. +-> First tests: Works on the dataset with scale factor 0.1. +-> Sorting does not work. + +*Based on older implementations. +https://github.com/ghafek/systemds/blob/feature/ssb-benchmark/scripts/ssb/queries/q1_1.dml +*Especially: +https://github.com/ghafek/systemds/blob/feature/ssb-benchmark/scripts/ssb/queries/q4_3.dml +In comparison to older version the join method was changed +from sort-merge to hash2 to improve the performance. + +Input parameters: +input_dir - Path to input directory containing the table files (e.g., ./data) +*/ + +# Call ra-modules with ra-functions. +source("./scripts/builtin/raSelection.dml") as raSel +source("./scripts/builtin/raJoin.dml") as raJoin +source("./scripts/builtin/raGroupby.dml") as raGrp + +# Set input parameters. +input_dir = ifdef($input_dir, "./data"); +print("Loading tables from directory: " + input_dir); + +# Read and load input CSV files. +lineorder_csv = read(input_dir + "/lineorder.tbl", data_type="frame", format="csv", header=FALSE, sep="|"); +cust_csv = read(input_dir + "/customer.tbl", data_type="frame", format="csv", header=FALSE, sep="|"); +date_csv = read(input_dir + "/date.tbl", data_type="frame", format="csv", header=FALSE, sep="|"); +supp_csv = read(input_dir + "/supplier.tbl", data_type="frame", format="csv", header=FALSE, sep="|"); + +# General variables. +general_spec = "{ \"ids\": false, \"recode\": [\"C1\"] }"; +hasRows = 1; # If hasRows = 0, the result table is empty. + +# -- Data preparation -- + +# Extract only the necessary columns from tables. +# Extracted: COL-3 | COL-5 | COL-6 | COL-13 +# => LO_CUSTKEY | LO_SUPPKEY | LO_ORDERDATE | LO_REVENUE +lineorder_csv_min = cbind(lineorder_csv[, 3], lineorder_csv[, 5], lineorder_csv[, 6], lineorder_csv[, 13]); +lineorder_matrix_min = as.matrix(lineorder_csv_min); + +# -- Filter tables over string values. + +# Extracted: COL-1 | COL-5 | COL-7 +# D_DATEKEY | D_YEAR | D_YEARMONTH +# (only need D_DATEKEY & D_YEAR, filter by D_YEARMONTH string) +# Build filtered DATE table (D_YEARMONTH = 'Dec1997') +d_filt_keys = matrix(0, rows=0, cols=1); +d_filt_year = matrix(0, rows=0, cols=1); +d_filt = matrix(0, rows=0, cols=1); + +for (i in 1:nrow(date_csv)) { + if (as.scalar(date_csv[i,7]) == "Dec1997") { + key_val = as.double(as.scalar(date_csv[i,1])); + year_val = as.double(as.scalar(date_csv[i,5])); + d_filt_keys = rbind(d_filt_keys, matrix(key_val, rows=1, cols=1)); + d_filt_year = rbind(d_filt_year, matrix(year_val, rows=1, cols=1)); + } + } +if (nrow(d_filt_keys) == 0) { + hasRows = 0; +} +else{ + d_filt = cbind(d_filt_keys, d_filt_year); +} + + +# Prepare SUPPLIER table on-the-fly encodings +# Extracted: COL-1 | COL-4 +# S_SUPPKEY | S_CITY +# (only need S_CITY encoding, filter by S_CITY string itself) +[supp_city_enc_f, supp_city_meta] = transformencode(target=supp_csv[,4], spec=general_spec); + +supp_filt_keys = matrix(0, rows=0, cols=1); +supp_filt_city = matrix(0, rows=0, cols=1); +supp_filt = matrix(0, rows=0, cols=1); + +if(hasRows){ + # Build filtered SUPPLIER table (S_CITY = 'UNITED KI1' OR S_CITY = 'UNITED KI5') + for (i in 1:nrow(supp_csv)) { + s_elem = as.scalar(supp_csv[i,4]) + if (s_elem == "UNITED KI1" | s_elem == "UNITED KI5") { + key_val = as.double(as.scalar(supp_csv[i,1])); + city_code = as.double(as.scalar(supp_city_enc_f[i,1])); + supp_filt_keys = rbind(supp_filt_keys, matrix(key_val, rows=1, cols=1)); + supp_filt_city = rbind(supp_filt_city, matrix(city_code, rows=1, cols=1)); + } + } + if (nrow(supp_filt_keys) == 0) { + hasRows = 0; + } + else{ + supp_filt = cbind(supp_filt_keys, supp_filt_city); + } +} + +# Prepare CUSTOMER table on-the-fly encodings +# Extracted: COL-1 | COL-4 +# C_CUSTKEY | C_CITY +# (only need C_CITY encoding, filter by C_CITY string itself) +[cust_city_enc_f, cust_city_meta] = transformencode(target=cust_csv[,4], spec=general_spec); + +cust_filt_keys = matrix(0, rows=0, cols=1); +cust_filt_city = matrix(0, rows=0, cols=1); +cust_filt = matrix(0, rows=0, cols=1); + +if(hasRows){ + # Build filtered CUSTOMER table (C_CITY = 'UNITED KI1' OR C_CITY = 'UNITED KI5') + for (i in 1:nrow(cust_csv)) { + c_elem = as.scalar(cust_csv[i,4]) + if (c_elem == "UNITED KI1" | c_elem == "UNITED KI5") { + key_val = as.double(as.scalar(cust_csv[i,1])); + city_code = as.double(as.scalar(cust_city_enc_f[i,1])); + cust_filt_keys = rbind(cust_filt_keys, matrix(key_val, rows=1, cols=1)); + cust_filt_city = rbind(cust_filt_city, matrix(city_code, rows=1, cols=1)); + } + } + if (nrow(cust_filt_keys) == 0) { + hasRows = 0; + } + else{ + cust_filt = cbind(cust_filt_keys,cust_filt_city); + } +} + +#print("LO,DATE,CUST,SUPP") +#print(toString(lineorder_matrix_min[1,])) +#print(toString(d_filt[1,])) +#print(toString(cust_filt[1,])) +#print(toString(supp_filt[1,])) + +# -- JOIN TABLES WITH RA-JOIN FUNCTION -- +# Join LINEORDER table with CUST, SUPPLIER, DATE tables (star schema) +# Join order does matter! +lo_cust = matrix(0, rows=0, cols=1); +lo_cust_supp = matrix(0, rows=0, cols=1); +joined_matrix = matrix(0, rows=0, cols=1); +# WHERE LO_CUSTKEY = C_CUSTKEY +if(hasRows){ + lo_cust = raJoin::m_raJoin(A=cust_filt, colA=1, B=lineorder_matrix_min, colB=1, method="hash2"); + if(nrow(lo_cust[,1]) == 0){ + hasRows = 0; + } +} +# WHERE LO_SUPPKEY = S_SUPPKEY +if(hasRows){ + lo_cust_supp = raJoin::m_raJoin(A=supp_filt, colA=1, B=lo_cust, colB=4, method="hash2"); + if(nrow(lo_cust_supp[,1]) == 0){ + hasRows = 0; + } +} + +# WHERE LO_ORDERDATE = D_DATEKEY +# (D_DATEKEY | D_YEAR) | (S_SUPPKEY | S_CITY | C_CUSTKEY | C_CITY | +# LO_CUSTKEY | LO_SUPPKEY | LO_ORDERDATE | LO_REVENUE) +if(hasRows){ + joined_matrix = raJoin::m_raJoin(A=d_filt, colA=1, B=lo_cust_supp, colB=7, method="hash2"); + if(nrow(joined_matrix[,1]) == 0){ + hasRows = 0; + } +} + +# -- Group-By and Aggregation (SUM)-- + +if(hasRows){ + # Group-By + d_year = joined_matrix[,2]; + s_city = joined_matrix[,4]; + c_city = joined_matrix[,6]; + revenue = joined_matrix[,10]; + + # CALCULATING COMBINATION KEY WITH PRIORITY:1 C_CITY, 2 S_CITY, D_YEAR + max_c_city= max(c_city); + max_s_city= max(s_city); + max_d_year = max(d_year); + + c_city_scale_f = ceil(max_c_city) + 1; + s_city_scale_f = ceil(max_s_city) + 1; + d_year_scale_f = ceil(max_d_year) + 1; + + combined_key = c_city * s_city_scale_f * d_year_scale_f + s_city * d_year_scale_f + d_year; + + group_input = cbind(revenue, combined_key) + + agg_result = raGrp::m_raGroupby(X=group_input, col=2, method="nested-loop"); + #print(toString(agg_result[1,])); + + # Aggregation (SUM) + key = agg_result[, 1]; + revenue = rowSums(agg_result[, 2:ncol(agg_result)]); + + # EXTRACTING C_CITY, S_CITY, D_YEAR + c_city = round(floor(key / (s_city_scale_f * d_year_scale_f))); + s_city = round(floor((key %% (s_city_scale_f * d_year_scale_f)) / d_year_scale_f)); + d_year = round(key %% d_year_scale_f); + + result = cbind(c_city, s_city, d_year, revenue, key) + + # -- Sorting -- -- Sorting int columns works, but strings do not. + # ORDER BY D_YEAR ASC, REVENUE DESC + result_ordered = order(target=result, by=4, decreasing=TRUE, index.return=FALSE); + result_ordered = order(target=result_ordered, by=3, decreasing=FALSE, index.return=FALSE); + + c_city_dec = transformdecode(target=result_ordered[,1], spec=general_spec, meta=cust_city_meta); + s_city_dec = transformdecode(target=result_ordered[,2], spec=general_spec, meta=supp_city_meta); + + res = cbind(c_city_dec, s_city_dec, as.frame(result_ordered[,3]), as.frame(result_ordered[,4])) ; + + # Print result + print("c_city | s_city | d_year | REVENUE") + print(res) + + print("\nQ3.4 finished.\n"); +} +else{ + # If the result table has 0 rows, skip group-by and aggregation. + # Print result + print("c_city | s_city | d_year | REVENUE") + print("The result table has 0 rows.") + print("\nQ3.4 finished.\n"); +} \ No newline at end of file diff --git a/scripts/staging/ssb/queries/q4_1.dml b/scripts/staging/ssb/queries/q4_1.dml new file mode 100644 index 00000000000..b3787925c35 --- /dev/null +++ b/scripts/staging/ssb/queries/q4_1.dml @@ -0,0 +1,261 @@ +#------------------------------------------------------------- +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# +#------------------------------------------------------------- + +/* DML-script implementing the ssb query Q4.2 in SystemDS. +**input_dir="/scripts/ssb/data" + +* Run with docker: +docker run -it --rm -v $PWD:/scripts/ apache/systemds:nightly -f /scripts/queries/q4_2.dml -nvargs input_dir="/scripts/data/" + +SELECT + d_year, + c_nation, + SUM(lo_revenue - lo_supplycost) AS PROFIT +FROM date, customer, supplier, part, lineorder -- dates +WHERE + lo_custkey = c_custkey + AND lo_suppkey = s_suppkey + AND lo_partkey = p_partkey + AND lo_orderdate = d_datekey + AND c_region = 'AMERICA' + AND s_region = 'AMERICA' + AND ( + p_mfgr = 'MFGR#1' + OR p_mfgr = 'MFGR#2' + ) +GROUP BY d_year, c_nation +ORDER BY d_year, c_nation; + +*Please run the original SQL query (eg. in Postgres) +to verify the correctness of DML version. +-> First tests: Works on the dataset with scale factor 0.1. +-> Sorting does not work. + +*Based on older implementations. +https://github.com/ghafek/systemds/blob/feature/ssb-benchmark/scripts/ssb/queries/q1_1.dml +*Especially: +https://github.com/ghafek/systemds/blob/feature/ssb-benchmark/scripts/ssb/queries/q4_3.dml +In comparison to older version the join method was changed +from sort-merge to hash2 to improve the performance. + +Input parameters: +input_dir - Path to input directory containing the table files (e.g., ./data) +*/ + +# Call ra-modules with ra-functions. +source("./scripts/builtin/raSelection.dml") as raSel +source("./scripts/builtin/raJoin.dml") as raJoin +source("./scripts/builtin/raGroupby.dml") as raGrp + +# Set input parameters. +input_dir = ifdef($input_dir, "./data"); +print("Loading tables from directory: " + input_dir); + +# Read and load input CSV files. +lineorder_csv = read(input_dir + "/lineorder.tbl", data_type="frame", format="csv", header=FALSE, sep="|"); +cust_csv = read(input_dir + "/customer.tbl", data_type="frame", format="csv", header=FALSE, sep="|"); +date_csv = read(input_dir + "/date.tbl", data_type="frame", format="csv", header=FALSE, sep="|"); +part_csv = read(input_dir + "/part.tbl", data_type="frame", format="csv", header=FALSE, sep="|"); +supp_csv = read(input_dir + "/supplier.tbl", data_type="frame", format="csv", header=FALSE, sep="|"); + +# General variables. +general_spec = "{ \"ids\": false, \"recode\": [\"C1\"] }"; +hasRows = 1; # If hasRows = 0, the result table is empty. + +# -- Data preparation -- + +# Extract only the necessary columns from tables. +# Extracted: COL-3 | COL-4 | COL-5 | COL-6 | COL-13 | COL-14 +# => LO_CUSTKEY | LO_PARTKEY | LO_SUPPKEY | LO_ORDERDATE | +# LO_REVENUE | LO_SUPPLYCOST +lineorder_csv_min = cbind(lineorder_csv[, 3], lineorder_csv[, 4], lineorder_csv[, 5], lineorder_csv[, 6], lineorder_csv[, 13], lineorder_csv[, 14]); +lineorder_matrix_min = as.matrix(lineorder_csv_min); + +# Extracted: COL-1 | COL-5 +# => D_DATEKEY | D_YEAR +date_csv_min = cbind(date_csv[, 1], date_csv[, 5]); +date_matrix_min = as.matrix(date_csv_min); + +# -- Filter tables over string values. + +# Prepare PART table on-the-fly encodings +# Extracted: COL-1 | COL-3 +# P_PARTKEY | P_MFGR + +# Build filtered PART table (P_MFGR = 'MFGR#1' OR P_MFGR = 'MFGR#2'), keeping key +part_filt = matrix(0, rows=0, cols=1); +for (i in 1:nrow(part_csv)) { + p_elem = as.scalar(part_csv[i,3]) + if ( p_elem == "MFGR#1" | p_elem == "MFGR#2" ) { + key_val = as.double(as.scalar(part_csv[i,1])); + part_filt = rbind(part_filt, matrix(key_val, rows=1, cols=1)); + } +} +if (nrow(part_filt) == 0) { + hasRows = 0; +} + +# Extracted: COL-1 | COL-6 +# S_SUPPKEY | S_REGION +# Build filtered SUPPLIER table (S_REGION = 'AMERICA') +supp_filt = matrix(0, rows=0, cols=1); +if(hasRows){ + for (i in 1:nrow(supp_csv)) { + if (as.scalar(supp_csv[i,6]) == "AMERICA") { + key_val = as.double(as.scalar(supp_csv[i,1])); + supp_filt = rbind(supp_filt, matrix(key_val, rows=1, cols=1)); + } + } + if (nrow(supp_filt) == 0) { + hasRows = 0; + } +} + +# Prepare CUSTOMER table on-the-fly encodings +# Extracted: COL-1 | COL-5 | COL-6 +# C_CUSTKEY | C_NATION | C_REGION +# (only need C_NATION encoding, filter by C_REGION string) +[cust_nat_enc_f, cust_nat_meta] = transformencode(target=cust_csv[,5], spec=general_spec); + +cust_filt_keys = matrix(0, rows=0, cols=1); +cust_filt_nat = matrix(0, rows=0, cols=1); +cust_filt = matrix(0, rows=0, cols=1); + +if(hasRows){ + # Build filtered CUSTOMER table (C_REGION = 'AMERICA') + for (i in 1:nrow(cust_csv)) { + if (as.scalar(cust_csv[i,6]) == "AMERICA") { + key_val = as.double(as.scalar(cust_csv[i,1])); + nat_code = as.double(as.scalar(cust_nat_enc_f[i,1])); + cust_filt_keys = rbind(cust_filt_keys, matrix(key_val, rows=1, cols=1)); + cust_filt_nat = rbind(cust_filt_nat, matrix(nat_code, rows=1, cols=1)); + } + } + if (nrow(cust_filt_keys) == 0) { + hasRows = 0; + } + else{ + cust_filt = cbind(cust_filt_keys,cust_filt_nat); + } +} +#print("LO,DATE,CUST,PART,SUPP") +#print(toString(lineorder_matrix_min[1,])) +#print(toString(date_matrix_min[1,])) +#print(toString(cust_filt[1,])) +#print(toString(part_filt[1,])) +#print(toString(supp_filt[1,])) + + +# -- JOIN TABLES WITH RA-JOIN FUNCTION -- +lo_cust = matrix(0, rows=0, cols=1); +lo_cust_supp = matrix(0, rows=0, cols=1); +lo_cust_supp_part = matrix(0, rows=0, cols=1); +joined_matrix = matrix(0, rows=0, cols=1); +# Join LINEORDER table with CUST, SUPPLIER, PART, DATE tables (star schema) +# Join order does matter! +# WHERE LO_CUSTKEY = C_CUSTKEY +if(hasRows){ + lo_cust = raJoin::m_raJoin(A=cust_filt, colA=1, B=lineorder_matrix_min, colB=1, method="hash2"); + if(nrow(lo_cust[,1]) == 0){ + hasRows = 0; + } +} +# WHERE LO_SUPPKEY = S_SUPPKEY +if(hasRows){ + lo_cust_supp = raJoin::m_raJoin(A=supp_filt, colA=1, B=lo_cust, colB=5, method="hash2"); + if(nrow(lo_cust_supp[,1]) == 0){ + hasRows = 0; + } +} +# WHERE LO_PARTKEY = P_PARTKEY +if(hasRows){ + lo_cust_supp_part = raJoin::m_raJoin(A=part_filt, colA=1, B=lo_cust_supp, colB=5, method="hash2"); + if(nrow(lo_cust_supp_part[,1]) == 0){ + hasRows = 0; + } +} +# WHERE LO_ORDERDATE = D_DATEKEY +# (D_DATEKEY | D_YEAR) | (P_PARTKEY | S_SUPPKEY | C_CUSTKEY | C_NATION | +# LO_CUSTKEY | LO_PARTKEY | LO_SUPPKEY | LO_ORDERDATE | LO_REVENUE | LO_SUPPLYCOST) +if(hasRows){ + joined_matrix = raJoin::m_raJoin(A=date_matrix_min, colA=1, B=lo_cust_supp_part, colB=8, method="hash2"); + if(nrow(joined_matrix[,1]) == 0){ + hasRows = 0; + } +} +#print(toString(joined_matrix[1,])) + +# -- Group-By and Aggregation (SUM)-- + +if(hasRows){ + # Group-By + c_nat = joined_matrix[,6] + d_year = joined_matrix[,2] + lo_revenue = joined_matrix[,11] + lo_supplycost = joined_matrix[,12] + profit = lo_revenue - lo_supplycost; + + # CALCULATING COMBINATION KEY WITH PRIORITY:1 D_YEAR, 2 S_NATION + max_d_year = max(d_year); + max_c_nat= max(c_nat); + + d_year_scale_f = ceil(max_d_year) + 1; + c_nat_scale_f = ceil(max_c_nat) + 1; + + combined_key = d_year * c_nat_scale_f + c_nat; + + group_input = cbind(profit, combined_key) + + agg_result = raGrp::m_raGroupby(X=group_input, col=2, method="nested-loop"); + #print(toString(agg_result[1,])); + + # Aggregation (SUM) + key = agg_result[, 1]; + profit = rowSums(agg_result[, 2:ncol(agg_result)]); + + # EXTRACTING D_YEAR, C_NATION + d_year = round(floor(key / (c_nat_scale_f))); + c_nat = round(floor((key %% (c_nat_scale_f)))); + + result = cbind(d_year, c_nat, profit, key); + + # -- Sorting -- -- Sorting int columns works, but strings do not. + # ORDER BY D_YEAR, C_NATION ASC + result_ordered = order(target=result, by=4, decreasing=FALSE, index.return=FALSE); + + c_nat_dec = transformdecode(target=result_ordered[,2], spec=general_spec, meta=cust_nat_meta); + + res = cbind(as.frame(result_ordered[,1]), c_nat_dec, as.frame(result_ordered[,3])) ; + + # Print result + print("d_year | c_nation | PROFIT") + print(res) + + print("\nQ4.1 finished.\n"); +} +else{ + # If the result table has 0 rows, skip group-by and aggregation. + # Print result + print("d_year | c_nation | PROFIT") + print("The result table has 0 rows.") + + print("\nQ4.1 finished.\n"); +} \ No newline at end of file diff --git a/scripts/staging/ssb/queries/q4_2.dml b/scripts/staging/ssb/queries/q4_2.dml new file mode 100644 index 00000000000..c873832a2ee --- /dev/null +++ b/scripts/staging/ssb/queries/q4_2.dml @@ -0,0 +1,285 @@ +#------------------------------------------------------------- +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# +#------------------------------------------------------------- + +/* DML-script implementing the ssb query Q4.2 in SystemDS. +**input_dir="/scripts/ssb/data" + +* Run with docker: +docker run -it --rm -v $PWD:/scripts/ apache/systemds:nightly -f /scripts/queries/q4_2.dml -nvargs input_dir="/scripts/data/" + +SELECT + d_year, + s_nation, + p_category, + SUM(lo_revenue - lo_supplycost) AS PROFIT +FROM date, customer, supplier, part, lineorder --dates +WHERE + lo_custkey = c_custkey + AND lo_suppkey = s_suppkey + AND lo_partkey = p_partkey + AND lo_orderdate = d_datekey + AND c_region = 'AMERICA' + AND s_region = 'AMERICA' + AND ( + d_year = 1997 + OR d_year = 1998 + ) + AND ( + p_mfgr = 'MFGR#1' + OR p_mfgr = 'MFGR#2' + ) +GROUP BY d_year, s_nation, p_category +ORDER BY d_year, s_nation, p_category; + +*Please run the original SQL query (eg. in Postgres) +to verify the correctness of DML version. +-> First tests: Works on the dataset with scale factor 0.1. +-> Sorting does not work. + +*Based on older implementations. +https://github.com/ghafek/systemds/blob/feature/ssb-benchmark/scripts/ssb/queries/q1_1.dml +*Especially: +https://github.com/ghafek/systemds/blob/feature/ssb-benchmark/scripts/ssb/queries/q4_3.dml +In comparison to older version the join method was changed +from sort-merge to hash2 to improve the performance. + +Input parameters: +input_dir - Path to input directory containing the table files (e.g., ./data) +*/ + +# Call ra-modules with ra-functions. +source("./scripts/builtin/raSelection.dml") as raSel +source("./scripts/builtin/raJoin.dml") as raJoin +source("./scripts/builtin/raGroupby.dml") as raGrp + +# Set input parameters. +input_dir = ifdef($input_dir, "./data"); +print("Loading tables from directory: " + input_dir); + +# Read and load input CSV files. +lineorder_csv = read(input_dir + "/lineorder.tbl", data_type="frame", format="csv", header=FALSE, sep="|"); +cust_csv = read(input_dir + "/customer.tbl", data_type="frame", format="csv", header=FALSE, sep="|"); +date_csv = read(input_dir + "/date.tbl", data_type="frame", format="csv", header=FALSE, sep="|"); +part_csv = read(input_dir + "/part.tbl", data_type="frame", format="csv", header=FALSE, sep="|"); +supp_csv = read(input_dir + "/supplier.tbl", data_type="frame", format="csv", header=FALSE, sep="|"); + +# General variables. +general_spec = "{ \"ids\": false, \"recode\": [\"C1\"] }"; +hasRows = 1; # If hasRows = 0, the result table is empty. + +# -- Data preparation -- + +# Extract only the necessary columns from tables. +# Extracted: COL-3 | COL-4 | COL-5 | COL-6 | COL-13 | COL-14 +# => LO_CUSTKEY | LO_PARTKEY | LO_SUPPKEY | LO_ORDERDATE | +# LO_REVENUE | LO_SUPPLYCOST +lineorder_csv_min = cbind(lineorder_csv[, 3], lineorder_csv[, 4], lineorder_csv[, 5], lineorder_csv[, 6], lineorder_csv[, 13], lineorder_csv[, 14]); +lineorder_matrix_min = as.matrix(lineorder_csv_min); + +# Extracted: COL-1 | COL-5 +# => D_DATEKEY | D_YEAR +date_csv_min = cbind(date_csv[, 1], date_csv[, 5]); +date_matrix_min = as.matrix(date_csv_min); + +# -- Filter tables over string values. + +# WHERE D_YEAR = 1997 OR D_YEAR = 1998 +d_filtA = raSel::m_raSelection(date_matrix_min, col=2, op="==", val=1997); +d_filtB = raSel::m_raSelection(date_matrix_min, col=2, op="==", val=1998); +d_filt = matrix(0, rows=0, cols=1); +d_filt = rbind(d_filtA,d_filtB); +if(as.scalar(d_filt[1,1]) == 0){ + hasRows = 0; +} +# Prepare PART table on-the-fly encodings +# Extracted: COL-1 | COL-3 | COL-4 +# P_PARTKEY | P_MFGR | P_CATEGORY +# (only need P_CATEGORY encoding, filter by P_MFGR string) +[part_cat_enc_f, part_cat_meta] = transformencode(target=part_csv[,4], spec=general_spec); + +part_filt_keys = matrix(0, rows=0, cols=1); +part_filt_cat = matrix(0, rows=0, cols=1); +part_filt = matrix(0, rows=0, cols=1); + +if(hasRows){ + # Build filtered PART table (p_category == 'MFGR#1' OR p_category == 'MFGR#2'), keeping key and encoded category + for (i in 1:nrow(part_csv)) { + p_elem = as.scalar(part_csv[i,3]) + if ( p_elem == "MFGR#1" | p_elem == "MFGR#2" ) { + key_val = as.double(as.scalar(part_csv[i,1])); + cat_code = as.double(as.scalar(part_cat_enc_f[i,1])); + part_filt_keys = rbind(part_filt_keys, matrix(key_val, rows=1, cols=1)); + part_filt_cat = rbind(part_filt_cat, matrix(cat_code, rows=1, cols=1)); + } + } + if (nrow(part_filt_keys) == 0) { + hasRows = 0; + } + else{ + part_filt = cbind(part_filt_keys, part_filt_cat); + } + +} +# Prepare SUPPLIER table on-the-fly encodings +# Extracted: COL-1 | COL-5 | COL-6 +# S_SUPPKEY | S_NATION | S_REGION +# (only need S_NATION encoding, filter by S_REGION string) +[supp_nat_enc_f, supp_nat_meta] = transformencode(target=supp_csv[,5], spec=general_spec); + +supp_filt_keys = matrix(0, rows=0, cols=1); +supp_filt_nat = matrix(0, rows=0, cols=1); +supp_filt = matrix(0, rows=0, cols=1); + +if(hasRows){ + # Build filtered SUPPLIER table (S_REGION == 'AMERICA') + for (i in 1:nrow(supp_csv)) { + if (as.scalar(supp_csv[i,6]) == "AMERICA") { + key_val = as.double(as.scalar(supp_csv[i,1])); + nat_code = as.double(as.scalar(supp_nat_enc_f[i,1])); + supp_filt_keys = rbind(supp_filt_keys, matrix(key_val, rows=1, cols=1)); + supp_filt_nat = rbind(supp_filt_nat, matrix(nat_code, rows=1, cols=1)); + } + } + if (nrow(supp_filt_keys) == 0) { + hasRows = 0; + } + else{ + supp_filt = cbind(supp_filt_keys, supp_filt_nat); + + } +} +# Extracted: COL-1 | COL-6 +# C_CUSTKEY | C_REGION +# Build filtered CUSTOMER table (C_REGION == 'AMERICA') +cust_filt = matrix(0, rows=0, cols=1); +if(hasRows){ + for (i in 1:nrow(cust_csv)) { + if (as.scalar(cust_csv[i,6]) == "AMERICA") { + key_val = as.double(as.scalar(cust_csv[i,1])); + cust_filt = rbind(cust_filt, matrix(key_val, rows=1, cols=1)); + } + } + if (nrow(cust_filt) == 0) { + hasRows = 0; + } +} +#print("LO,DATE,CUST,PART,SUPP") +#print(toString(lineorder_matrix_min[1,])) +#print(toString(date_matrix_min[1,])) +#print(toString(cust_filt[1,])) +#print(toString(part_filt[1,])) +#print(toString(supp_filt[1,])) + +# -- JOIN TABLES WITH RA-JOIN FUNCTION -- + +# Join LINEORDER table with CUST, SUPPLIER, PART, DATE tables (star schema) +# Join order does matter! +# WHERE LO_CUSTKEY = C_CUSTKEY +if(hasRows){ + lo_cust = raJoin::m_raJoin(A=cust_filt, colA=1, B=lineorder_matrix_min, colB=1, method="hash2"); + if(nrow(lo_cust[,1]) == 0){ + hasRows = 0; + } +} +# WHERE LO_SUPPKEY = S_SUPPKEY +if(hasRows){ + lo_cust_supp = raJoin::m_raJoin(A=supp_filt, colA=1, B=lo_cust, colB=4, method="hash2"); + if(nrow(lo_cust_supp[,1]) == 0){ + hasRows = 0; + } +} +# WHERE LO_PARTKEY = P_PARTKEY +if(hasRows){ + lo_cust_supp_part = raJoin::m_raJoin(A=part_filt, colA=1, B=lo_cust_supp, colB=5, method="hash2"); + if(nrow(lo_cust_supp_part[,1]) == 0){ + hasRows = 0; + } +} +# WHERE LO_ORDERDATE = D_DATEKEY +# (D_DATEKEY | D_YEAR) | (P_PARTKEY | P_CATEGORY | (S_SUPPKEY | S_NATION | C_CUSTKEY | +# LO_CUSTKEY | LO_PARTKEY | LO_SUPPKEY | LO_ORDERDATE | LO_REVENUE | LO_SUPPLYCOST) +if(hasRows){ + joined_matrix = raJoin::m_raJoin(A=d_filt, colA=1, B=lo_cust_supp_part, colB=9, method="hash2"); + if(nrow(joined_matrix[,1]) == 0){ + hasRows = 0; + } +} +# -- Group-By and Aggregation (SUM)-- + +if(hasRows){ + # Group-By + d_year = joined_matrix[,2] + p_cat = joined_matrix[,4] + s_nat = joined_matrix[,6] + lo_revenue = joined_matrix[,12] + lo_supplycost = joined_matrix[,13] + profit = lo_revenue - lo_supplycost; + + # CALCULATING COMBINATION KEY WITH PRIORITY:1 D_YEAR, 2 S_NATION, 3 P_CATEGORY + max_d_year = max(d_year); + max_s_nat= max(s_nat); + max_p_cat = max(p_cat); + + d_year_scale_f = ceil(max_d_year) + 1; + s_nat_scale_f = ceil(max_s_nat) + 1; + p_cat_scale_f = ceil(max_p_cat) + 1; + + combined_key = d_year * s_nat_scale_f * p_cat_scale_f + s_nat * p_cat_scale_f + p_cat; + + group_input = cbind(profit, combined_key) + + agg_result = raGrp::m_raGroupby(X=group_input, col=2, method="nested-loop"); + #print(toString(agg_result[1,])); + + # Aggregation (SUM) + key = agg_result[, 1]; + profit = rowSums(agg_result[, 2:ncol(agg_result)]); + + # EXTRACTING D_YEAR, S_NATION, P_CATEGORY + d_year = round(floor(key / (s_nat_scale_f * p_cat_scale_f))); + s_nat = round(floor((key %% (s_nat_scale_f * p_cat_scale_f)) / p_cat_scale_f)); + p_cat = round(key %% p_cat_scale_f); + + result = cbind(d_year, s_nat, p_cat, profit, key); + + # -- Sorting -- -- Sorting int columns works, but strings do not. + # ORDER BY D_YEAR, S_NATION, P_CATEGORY ASC + result_ordered = order(target=result, by=5, decreasing=FALSE, index.return=FALSE); + + s_nat_dec = transformdecode(target=result_ordered[,2], spec=general_spec, meta=supp_nat_meta); + p_cat_dec = transformdecode(target=result_ordered[,3], spec=general_spec, meta=part_cat_meta); + + res = cbind(as.frame(result_ordered[,1]), s_nat_dec, p_cat_dec, as.frame(result_ordered[,4])) ; + + # Print result + print("d_year | s_nation | p_category | PROFIT"); + print(res); + + print("\nQ4.2 finished.\n"); +} +else{ + # If the result table has 0 rows, skip group-by and aggregation. + # Print result + print("d_year | s_nation | p_category | PROFIT"); + print("The result table has 0 rows."); + + print("\nQ4.2 finished.\n"); +} \ No newline at end of file diff --git a/scripts/staging/ssb/queries/q4_3.dml b/scripts/staging/ssb/queries/q4_3.dml new file mode 100644 index 00000000000..384411432b6 --- /dev/null +++ b/scripts/staging/ssb/queries/q4_3.dml @@ -0,0 +1,267 @@ +#------------------------------------------------------------- +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# +#------------------------------------------------------------- + +/* DML-script implementing the ssb query Q4.3 in SystemDS. +**input_dir="/scripts/ssb/data" + +* Run with docker: +docker run -it --rm -v $PWD:/scripts/ apache/systemds:nightly -f /scripts/queries/q4_3.dml -nvargs input_dir="/scripts/data/" + +SELECT + d_year, + s_city, + p_brand, + SUM(lo_revenue - lo_supplycost) AS PROFIT +FROM date, customer, supplier, part, lineorder -- dates +WHERE + lo_custkey = c_custkey + AND lo_suppkey = s_suppkey + AND lo_partkey = p_partkey + AND lo_orderdate = d_datekey + AND s_nation = 'UNITED STATES' + AND ( + d_year = 1997 + OR d_year = 1998 + ) + AND p_category = 'MFGR#14' +GROUP BY d_year, s_city, p_brand +ORDER BY d_year, s_city, p_brand; + +*Please run the original SQL query (eg. in Postgres) +to verify the correctness of DML version. +-> First tests: Works on the dataset with scale factor 0.1. +-> Sorting does not work. + +*Based on older implementations. +https://github.com/ghafek/systemds/blob/feature/ssb-benchmark/scripts/ssb/queries/q1_1.dml +*Especially: +https://github.com/ghafek/systemds/blob/feature/ssb-benchmark/scripts/ssb/queries/q4_3.dml +In comparison to older version the join method was changed +from sort-merge to hash2 to improve the performance. + +Input parameters: +input_dir - Path to input directory containing the table files (e.g., ./data) +*/ + +# Call ra-modules with ra-functions. +source("./scripts/builtin/raSelection.dml") as raSel +source("./scripts/builtin/raJoin.dml") as raJoin +source("./scripts/builtin/raGroupby.dml") as raGrp + +# Set input parameters. +input_dir = ifdef($input_dir, "./data"); +print("Loading tables from directory: " + input_dir); + +# Read and load input CSV files from date and lineorder. +lineorder_csv = read(input_dir + "/lineorder.tbl", data_type="frame", format="csv", header=FALSE, sep="|"); +cust_csv = read(input_dir + "/customer.tbl", data_type="frame", format="csv", header=FALSE, sep="|"); +date_csv = read(input_dir + "/date.tbl", data_type="frame", format="csv", header=FALSE, sep="|"); +part_csv = read(input_dir + "/part.tbl", data_type="frame", format="csv", header=FALSE, sep="|"); +supp_csv = read(input_dir + "/supplier.tbl", data_type="frame", format="csv", header=FALSE, sep="|"); + +# General variables. +general_spec = "{ \"ids\": false, \"recode\": [\"C1\"] }"; +hasRows = 1; # If hasRows = 0, the result table is empty. + +# -- Data preparation -- + +# Extract only the necessary columns from tables. +# Extracted: COL-3 | COL-4 | COL-5 | COL-6 | COL-13 | COL-14 +# => LO_CUSTKEY | LO_PARTKEY | LO_SUPPKEY | LO_ORDERDATE | +# LO_REVENUE | LO_SUPPLYCOST +lineorder_csv_min = cbind(lineorder_csv[, 3], lineorder_csv[, 4], lineorder_csv[, 5], lineorder_csv[, 6], lineorder_csv[, 13], lineorder_csv[, 14]); +lineorder_matrix_min = as.matrix(lineorder_csv_min); + +# Extracted: COL-1 | COL-5 +# => D_DATEKEY | D_YEAR +date_csv_min = cbind(date_csv[, 1], date_csv[, 5]); +date_matrix_min = as.matrix(date_csv_min); + +# Extracted: COL-1 +# => C_CUSTKEY +cust_matrix_min = as.matrix(cust_csv[, 1]); + +# -- Filter tables over string values. + +# WHERE D_YEAR = 1997 OR D_YEAR = 1998 +d_filtA = raSel::m_raSelection(date_matrix_min, col=2, op="==", val=1997); +d_filtB = raSel::m_raSelection(date_matrix_min, col=2, op="==", val=1998); +d_filt = rbind(d_filtA,d_filtB) +if(as.scalar(d_filt[1,1]) == 0){ + hasRows = 0; +} + +# Prepare PART table on-the-fly encodings +# Extracted: COL-1 | COL-5 +# P_PARTKEY | P_BRAND +# (only need p_brand encoding, filter by p_category string) +[part_brand_enc_f, part_brand_meta] = transformencode(target=part_csv[,5], spec=general_spec); +#print(toString(part_brand_enc_f)); +part_filt_keys = matrix(0, rows=0, cols=1); +part_filt_brand = matrix(0, rows=0, cols=1); +part_filt = matrix(0, rows=0, cols=1); +if(hasRows){ + # Build filtered PART table (p_brand == 'MFGR#14'), keeping key and encoded brand + for (i in 1:nrow(part_csv)) { + p_elem = as.scalar(part_csv[i,4]) + if ( p_elem == "MFGR#14" ) { + key_val = as.double(as.scalar(part_csv[i,1])); + brand_code = as.double(as.scalar(part_brand_enc_f[i,1])); + part_filt_keys = rbind(part_filt_keys, matrix(key_val, rows=1, cols=1)); + part_filt_brand = rbind(part_filt_brand, matrix(brand_code, rows=1, cols=1)); + } + } + if (nrow(part_filt_keys) == 0) { + hasRows = 0; + } + else{ + part_filt = cbind(part_filt_keys, part_filt_brand); + } +} +#print(part_filt[1,]) + +# Prepare SUPPLIER table on-the-fly encodings +# Extracted: COL-1 | COL-4 | COL-5 +# S_SUPPKEY | S_CITY | S_NATION +# (only need S_CITY encoding, filter by S_NATION string) +[supp_city_enc_f, supp_city_meta] = transformencode(target=supp_csv[,4], spec=general_spec); + +if(hasRows){ + # Build filtered SUPPLIER table (S_NATION == 'UNITED STATES') + supp_filt_keys = matrix(0, rows=0, cols=1); + supp_filt_city = matrix(0, rows=0, cols=1); + for (i in 1:nrow(supp_csv)) { + if (as.scalar(supp_csv[i,5]) == "UNITED STATES") { + key_val = as.double(as.scalar(supp_csv[i,1])); + city_code = as.double(as.scalar(supp_city_enc_f[i,1])); + supp_filt_keys = rbind(supp_filt_keys, matrix(key_val, rows=1, cols=1)); + supp_filt_city = rbind(supp_filt_city, matrix(city_code, rows=1, cols=1)); + } + } + if (nrow(supp_filt_keys) == 0) { + hasRows = 0; + } + else{ + supp_filt = cbind(supp_filt_keys, supp_filt_city); + } +} +#print("LO,DATE,CUST,PART,SUPP") +#print(toString(lineorder_matrix_min[1,])) +#print(toString(date_matrix_min[1,])) +#print(toString(cust_matrix_min[1,])) +#print(toString(part_filt[1,])) +#print(toString(supp_filt[1,])) + +# -- JOIN TABLES WITH RA-JOIN FUNCTION -- + +# Join LINEORDER table with PART, SUPPLIER, DATE, CUST tables (star schema) +# Join order does matter! +# WHERE LO_PARTKEY = P_PARTKEY +if(hasRows){ + lo_part = raJoin::m_raJoin(A=part_filt, colA=1, B=lineorder_matrix_min, colB=2, method="hash2"); + if(nrow(lo_part[,1]) == 0){ + hasRows = 0; + } +} +# WHERE LO_SUPPKEY = S_SUPPKEY +if(hasRows){ + lo_part_supp = raJoin::m_raJoin(A=supp_filt, colA=1, B=lo_part, colB=5, method="hash2"); + if(nrow(lo_part_supp[,1]) == 0){ + hasRows = 0; + } +} +# WHERE LO_ORDERDATE = D_DATEKEY +if(hasRows){ + lo_part_supp_date = raJoin::m_raJoin(A=d_filt, colA=1, B=lo_part_supp, colB=8, method="hash2"); + if(nrow(lo_part_supp_date[,1]) == 0){ + hasRows = 0; + } +} +# WHERE LO_CUSTKEY = C_CUSTKEY +# (C_CUSTKEY) | (D_DATEKEY | D_YEAR | S_SUPPKEY | S_CITY | P_PARTKEY | P_BRAND | +# LO_CUSTKEY | LO_PARTKEY | LO_SUPPKEY | LO_ORDERDATE | LO_REVENUE | LO_SUPPLYCOST) +if(hasRows){ + joined_matrix = raJoin::m_raJoin(A=cust_matrix_min, colA=1, B=lo_part_supp_date, colB=7, method="hash2"); + if(nrow(joined_matrix[,1]) == 0){ + hasRows = 0; + } +} +#print(nrow(joined_matrix[,1])); +#print(toString(joined_matrix[1,])) + +# -- Group-By and Aggregation (SUM)-- +if(hasRows){ + # Group-By + d_year = joined_matrix[,3] + s_city = joined_matrix[,5] + p_brand = joined_matrix[,7] + lo_revenue = joined_matrix[,12] + lo_supplycost = joined_matrix[,13] + profit = lo_revenue - lo_supplycost; + + # CALCULATING COMBINATION KEY WITH PRIORITY:1 D_YEAR, 2 S_CITY, 3 P_BRAND + max_d_year = max(d_year); + max_s_city= max(s_city); + max_p_brand = max(p_brand); + + d_year_scale_f = ceil(max_d_year) + 1; + s_city_scale_f = ceil(max_s_city) + 1; + p_brand_scale_f = ceil(max_p_brand) + 1; + + combined_key = d_year * s_city_scale_f * p_brand_scale_f + s_city * p_brand_scale_f + p_brand; + + group_input = cbind(profit, combined_key) + agg_result = raGrp::m_raGroupby(X=group_input, col=2, method="nested-loop"); + + # Aggregation (SUM) + key = agg_result[, 1]; + profit = rowSums(agg_result[, 2:ncol(agg_result)]); + + # EXTRACTING D_YEAR, S_CITY, P_BRAND + d_year = round(floor(key / (s_city_scale_f * p_brand_scale_f))); + s_city = round(floor((key %% (s_city_scale_f * p_brand_scale_f)) / p_brand_scale_f)); + p_brand = round(key %% p_brand_scale_f); + + result = cbind(d_year, s_city, p_brand, profit, key); + + # -- Sorting -- -- Sorting int columns works, but strings do not. + # ORDER BY D_YEAR, S_CITY, P_BRAND ASC + result_ordered = order(target=result, by=5, decreasing=FALSE, index.return=FALSE); + + s_city_dec = transformdecode(target=result_ordered[,2], spec=general_spec, meta=supp_city_meta); + p_brand_dec = transformdecode(target=result_ordered[,3], spec=general_spec, meta=part_brand_meta); + + res = cbind(as.frame(result_ordered[,1]), s_city_dec, p_brand_dec, as.frame(result_ordered[,4])) ; + + # Print result + print("d_year | s_city | p_brand | PROFIT"); + print(res); + + print("\nQ4.3 finished.\n"); +} +else{ + # If the result table has 0 rows, skip group-by and aggregation. + # Print result + print("d_year | s_city | p_brand | PROFIT"); + print("The result table has 0 rows."); + + print("\nQ4.3 finished.\n"); +} diff --git a/scripts/staging/ssb/shell/run_script.sh b/scripts/staging/ssb/shell/run_script.sh new file mode 100755 index 00000000000..a6b78369a00 --- /dev/null +++ b/scripts/staging/ssb/shell/run_script.sh @@ -0,0 +1,372 @@ +#------------------------------------------------------------- +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# +#------------------------------------------------------------- + +#!/bin/bash +#Mark as executable. +#chmod +x run_script.sh + +# Read the database credentials from .env file. +source $PWD/.env + +# Variables and arguments. +PG_CONTAINER="ssb-postgres-1" + +#https://stackoverflow.com/questions/7069682/how-to-get-arguments-with-flags-in-bash + +#Initial variable values. +QUERY_NAME="q2_1" +SCALE=0.1 +DB_SYSTEM="systemds" + +isQflag=0 +isSflag=0 +isDflag=0 +isGflag=0 + +dml_query_array=("q1_1" "q1_2" "q1_3" "q2_1" "q2_2" "q2_3" "q3_1" "q3_2" "q3_3" "q3_4" "q4_1" "q4_2" "q4_3") +sql_query_array=("q1.1" "q1.2" "q1.3" "q2.1" "q2.2" "q2.3" "q3.1" "q3.2" "q3.3" "q3.4" "q4.1" "q4.2" "q4.3") + +# Colors for output +GREEN='\033[0;32m' +BLUE='\033[0;34m' +RED='\033[0;31m' +NC='\033[0m' # No Color + +echo -e "${BLUE}=== Test environment for SSB Data ===${NC}\n" + +#https://unix.stackexchange.com/questions/129391/passing-named-arguments-to-shell-scripts +# Parsing the argument flags. +while getopts "q:s:d:gh" opt; do + case ${opt} in + q) QUERY_NAME="$OPTARG" + isQflag=1;; + s) SCALE=$OPTARG + isSflag=1;; + d) DB_SYSTEM="$OPTARG" + isDflag=1;; + g) isGflag=1;; + #h (help) without flags + h) echo "Help:" + cat < other/script_flags_help.txt + echo "Thank you.";; + ?) echo "Option ${opt} not found. Try again." + echo "Please use: $0 -q [YOUR_QUERY_NAME] -s [YOUR_SCALE] -d [YOUR_DB_SYSTEM]";; + esac + case $OPTARG in + -*) echo "Option ${opt} should have an argument.";; + esac +done + +#echo "isQflag=$isQflag" +#echo "isSflag=$isSflag" +#echo "isDflag=$isDflag" +#echo "isGflag=$isDflag" +if [ ${isQflag} == 0 ]; then + echo "Warning: q-flag [QUERY_NAME] is empty ${isQflag}. The default q is q2_1." +fi +if [ ${isSflag} == 0 ]; then + echo "Warning: s-flag [SCALE] is empty. The default s is 0.01." +fi +if [ ${isDflag} == 0 ]; then + echo "Warning: d-flag [DATABASE] is empty. The default d is systemds." +fi +if [ ${isGflag} == 1 ]; then + echo "g-flag is set. That means, the docker desktop GUI is used." +fi + +echo "Arg 0 (SHELL_SCRIPT): $0" +echo "Arg 1 (QUERY_NAME): ${QUERY_NAME}" +echo "Arg 2 (SCALE): ${SCALE}" +echo "Arg 3 (DB_SYSTEM): ${DB_SYSTEM}" + +# Check whether the query is valid. +QUERY_NAME=$(echo "${QUERY_NAME}" | sed 's/\./_/') +isQuery_valid=0 +if [ "${QUERY_NAME}" != "all" ]; then + for q in ${dml_query_array[@]}; do + if [ ${QUERY_NAME} == ${q} ]; then + isQuery_valid=1 + break + fi + done + if [ isQuery_valid == 0 ]; then + echo -e "Sorry, this query ${QUERY_NAME} is invalid. Valid query names are 'all' and ${dml_query_array[@]}." + echo -e "${RED}Test bench terminated unsuccessfully.${NC}" + exit + fi +else + echo "All queries: ${dml_query_array[@]}" +fi + +# Check for the existing required packages. If not install them. +isAllowed="no" +echo "==========" +echo -e "${GREEN}Install required packages${NC}" +echo -e "${GREEN}Check whether the following packages exist:${NC}" +echo "If only SystemDS: docker 'docker compose' git gcc cmake make" +echo "For PostgreSQL: 'docker compose'" +echo "For DuckDB: duckdb" +echo "If using g-flag [GUI]: docker desktop" + +if [ ! "$(docker --version)" ]; then + echo "Docker is required for this test bench. Please install it manually using the official documentation." + exit + fi +for package in docker git gcc cmake make; do + if [ ! "$(${package} --version)" ]; then + echo "${package} package is required for this test bench. Do you want to allow the installation? (yes/no)" + read -r isAllowed + while [ "${isAllowed}" != "yes" ] || [ "${isAllowed}" != "y" ]; do + if [ "${isAllowed}" == "yes" ] || [ "${isAllowed}" == "y" ]; then + echo "Your anwser is ${isAllowed}." + echo "sudo apt-get install ${package}" + sudo apt-get install ${package} + elif [ "${isAllowed}" == "no" ] || [ "${isAllowed}" == "n" ]; then + echo -e "${RED}Sorry, we cannot continue with that test bench without the required packages. The test bench is stopped.${NC}" + exit + else + echo "Your answer '${isAllowed}' is neither 'yes' or 'no'. Please try again." + read -r isAllowed + fi + + done + fi +done +isAllowed="no" +if [ "${DB_SYSTEM}" != "systemds" ] && [ ! "$(docker compose version)" ]; then + echo "Docker compose is required for this test bench. Do you want to allow the installation? (yes/no)" + read -r isAllowed + while [ "${isAllowed}" != "yes" ] || [ "${isAllowed}" != "y" ]; do + + if [ ${isAllowed} == "yes" ]; then + echo "sudo apt-get install docker-compose-plugin" + sudo apt-get install docker-compose-plugin + elif [ "${isAllowed}" == "no" ] || [ "${isAllowed}" == "n" ]; then + echo -e "${RED}Sorry, we cannot continue with that test bench without the required packages. The test bench is stopped.${NC}" + exit + else + echo "Your answer '${isAllowed}' is neither 'yes' or 'no'. Please try again." + fi + read -r isAllowed + done +fi +isAllowed="no" +if ([ "${DB_SYSTEM}" == "duckdb" ] || [ "${DB_SYSTEM}" == "all" ] ) && [ ! "$(duckdb --version)" ]; then + echo "Duckdb is required for this test bench. Do you want to allow the installation? (yes/no)" + read -r isAllowed + while [ "${isAllowed}" != "yes" ] || [ "${isAllowed}" != "y" ]; do + if [ ${isAllowed} == "yes" ]; then + echo "Your anwser is ${isAllowed}." + echo "curl https://install.duckdb.org | sh" + curl https://install.duckdb.org | sh + elif [ "${isAllowed}" == "no" ] || [ "${isAllowed}" == "n" ]; then + echo -e "${RED}Sorry, we cannot continue with that test bench without the required packages. The test bench is stopped.${NC}" + exit + else + echo "Your answer '${isAllowed}' is neither 'yes' or 'no'. Please try again." + fi + read -r isAllowed + done +fi + +isAllowed="no" +# Use docker desktop GUI +if [ ${isGflag} == 1 ]; then + if [ ! "$(docker desktop version)" ]; then + echo "Docker desktop is required for this test bench. Please install it manually using the official documentation." + exit + fi +fi + +# Check whether the data directory exists. +echo "==========" +echo -e "${GREEN}Check for existing data directory and prepare the ssb-dbgen${NC}" +if [ ! -d ssb-dbgen ]; then + git clone https://github.com/eyalroz/ssb-dbgen.git --depth 1 + cd ssb-dbgen +else + cd ssb-dbgen + echo "Can we look for new updates of the datagen repository?. If there are, do you want to pull it? (yes/no)" + read -r isAllowed + while [ "${isAllowed}" != "yes" ] || [ "${isAllowed}" != "y" ]; do + if [ "${isAllowed}" == "yes" ] || [ "${isAllowed}" == "y" ]; then + echo "Your answer is '${isAllowed}'" + echo "git pull" + git pull + break + elif [ "${isAllowed}" == "no" ] || [ "${isAllowed}" == "n" ]; then + echo "Your answer is '${isAllowed}'. No pulls. Use the currently existing version locally." + break + else + echo "Your answer '${isAllowed}' is neither 'yes' or 'no'. Please try again." + read -r isAllowed + fi + done +fi + +echo "==========" +echo -e "${GREEN}Build ssb-dbgen and generate data with a given scale factor${NC}" +# Build the generator +cmake -B ./build && cmake --build ./build +# Run the generator (with -s ) +build/dbgen -b dists.dss -v -s $SCALE +mkdir -p ../data_dir +mv *.tbl ../data_dir + +# Go back to ssb home directory +cd .. +echo "Number of rows of created tables." +for table in customer part supplier date lineorder; do + str1=`wc --lines < data_dir/${table}.tbl` + echo "Table ${table} has ${str1} rows." +done + +# Execute queries in SystemDS docker container. +if [ "${DB_SYSTEM}" == "systemds" ] || [ "${DB_SYSTEM}" == "systemds_stats" ] || [ "${DB_SYSTEM}" == "all" ] ; then + echo "==========" + + echo -e "${GREEN}Start the SystemDS docker container.${NC}" + if [ ${isGflag} == 1 ]; then + docker desktop start + else + sudo systemctl start docker + fi + + if [ ! "$(docker images apache/systemds:latest)" ]; then + docker pull apache/systemds:latest + fi + + echo "==========" + + echo -e "${GREEN}Execute DML queries in SystemDS${NC}" + QUERY_NAME=$(echo "${QUERY_NAME}" | sed 's/\./_/') + + #Enable extended outputs with stats in SystemDs + useStats="" + if [ "${DB_SYSTEM}" == "systemds_stats" ]; then + useStats="--stats" + fi + ##all: {"q1_1","q1_2","q1_3","q2_1","q2_2","q2_3","q3_1","q3_2","q3_3","q3_4","q4_1","q4_2","q4_3"} + if [ "${QUERY_NAME}" == "all" ]; then + echo "Execute all 13 queries." + + for q in ${dml_query_array[@]} ; do + echo "Execute query ${q}.dml" + docker run -it --rm -v $PWD:/scripts/ apache/systemds:latest -f /scripts/queries/${q}.dml ${useStats} -nvargs input_dir="/scripts/data_dir" + done + else + echo "Execute query ${QUERY_NAME}.dml" + docker run -it --rm -v $PWD:/scripts/ apache/systemds:latest -f /scripts/queries/${QUERY_NAME}.dml ${useStats} -nvargs input_dir="/scripts/data_dir" + fi +fi + +# Execute queries in PostgreSQL docker container. +if [ "${DB_SYSTEM}" == "postgres" ] || [ "${DB_SYSTEM}" == "all" ] ; then + echo "==========" + echo -e "${GREEN}Start the PostgreSQL Docker containter and load data.${NC}" + + if [ ${isGflag} == 1 ]; then + docker desktop start + else + sudo systemctl start docker + fi + + if [ ! "$(docker images postgres:latest)" ]; then + docker pull postgres:latest + fi + + #Look more in the documentation. + #https://docs.docker.com/reference/cli/docker/container/ls/ + + if [ "$(docker ps -aq --filter name=${PG_CONTAINER})" ]; then + if [ ! "$(docker ps -q --filter name=${PG_CONTAINER})" ]; then + echo "Starting existing container..." + docker start ${PG_CONTAINER} + fi + else + echo "Creating new PostgreSQL container..." + echo "$PWD/docker-compose.yaml" + docker compose -f "$PWD/docker-compose.yaml" up -d --build + sleep 3 + fi + # Load data and copy into the database + + for table in customer part supplier date lineorder; do + #docker exec -i ${PG_CONTAINER} ls + docker cp data_dir/${table}.tbl ${PG_CONTAINER}:/tmp + echo "Load ${table} table with number_of_rows:" + docker exec -i ${PG_CONTAINER} sed -i 's/|$//' "tmp/${table}.tbl" + docker exec -i ${PG_CONTAINER} psql -U ${POSTGRES_USER} -d ${POSTGRES_DB} -c "TRUNCATE TABLE ${table} CASCADE; COPY ${table} FROM '/tmp/${table}.tbl' DELIMITER '|';" + done + # Change query_name e.g. from q1_1 to q1.1 + QUERY_NAME=$(echo "${QUERY_NAME}" | sed 's/_/./') + echo "==========" + echo -e "${GREEN}Execute SQL queries in PostgresSQL${NC}" + #all: {"q1.1","q1.2","q1.3","q2.1","q2.2","q2.3","q3.1","q3.2","q3.3","q3.4","q4.1","q4.2","q4.3"} + if [ "${QUERY_NAME}" = "all" ]; then + echo "Execute all 13 queries." + for q in ${sql_query_array[@]}; do + echo "Execute query ${q}.sql" + echo "docker exec -i ${PG_CONTAINER} psql -U ${POSTGRES_USER} -d ${POSTGRES_DB} < sql/${QUERY_NAME}.sql" + docker exec -i ${PG_CONTAINER} psql -U ${POSTGRES_USER} -d ${POSTGRES_DB} < sql/${q}.sql + done + else + echo "Execute query ${QUERY_NAME}.sql" + echo "docker exec -i ${PG_CONTAINER} psql -U ${POSTGRES_USER} -d ${POSTGRES_DB} < sql/${QUERY_NAME}.sql" + docker exec -i ${PG_CONTAINER} psql -U ${POSTGRES_USER} -d ${POSTGRES_DB} < sql/${QUERY_NAME}.sql + fi +fi + +# Execute queries in DuckDB locally. +if [ "${DB_SYSTEM}" == "duckdb" ] || [ "${DB_SYSTEM}" == "all" ]; then + + echo "==========" + echo -e "${GREEN}Start a DuckDB persistent database and load data.${NC}" + #https://duckdbsnippets.com/snippets/198/run-sql-file-in-duckdb-cli + # Create a duckdb persistent database file. + duckdb shell/test_ssb.duckdb < other/ssb_init.sql + + # Load data and copy into the database. + for table in customer part supplier date lineorder; do + echo "Load ${table} table" + duckdb shell/test_ssb.duckdb -c "TRUNCATE TABLE ${table} CASCADE;" + duckdb shell/test_ssb.duckdb -c "COPY ${table} FROM 'data_dir/${table}.tbl'; SELECT COUNT(*) AS number_of_rows FROM ${table};" + done + +# # Change query_name e.g. from q1_1 to q1.1 + QUERY_NAME=$(echo "${QUERY_NAME}" | sed 's/_/./') + echo "==========" + echo -e "${GREEN}Execute SQL queries in DuckDB${NC}" + #all: {"q1.1","q1.2","q1.3","q2.1","q2.2","q2.3","q3.1","q3.2","q3.3","q3.4","q4.1","q4.2","q4.3"} + if [ "${QUERY_NAME}" = "all" ]; then + echo "Execute all 13 queries." + for q in ${sql_query_array[@]}; do + echo "Execute query ${q}.sql" + duckdb shell/test_ssb.duckdb < sql/${q}.sql + done + else + echo "Execute query ${QUERY_NAME}.sql" + duckdb shell/test_ssb.duckdb < sql/${QUERY_NAME}.sql + fi + +fi +echo "==========" +echo -e "${GREEN}Test bench finished successfully.${NC}" diff --git a/scripts/staging/ssb/sql/q1.1.sql b/scripts/staging/ssb/sql/q1.1.sql new file mode 100644 index 00000000000..728c63121bc --- /dev/null +++ b/scripts/staging/ssb/sql/q1.1.sql @@ -0,0 +1,24 @@ +-- Licensed to the Apache Software Foundation (ASF) under one +-- Licensed to the Apache Software Foundation (ASF) under one +-- or more contributor license agreements. See the NOTICE file +-- distributed with this work for additional information +-- regarding copyright ownership. The ASF licenses this file +-- to you under the Apache License, Version 2.0 (the +-- "License"); you may not use this file except in compliance +-- with the License. You may obtain a copy of the License at +-- +-- http://www.apache.org/licenses/LICENSE-2.0 +-- +-- Unless required by applicable law or agreed to in writing, +-- software distributed under the License is distributed on an +-- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +-- KIND, either express or implied. See the License for the +-- specific language governing permissions and limitations +-- under the License. +SELECT SUM(lo_extendedprice * lo_discount) AS REVENUE +FROM lineorder, date --dates (Ssb-dbgen dataset uses "date" instead of "dates") +WHERE + lo_orderdate = d_datekey + AND d_year = 1993 + AND lo_discount BETWEEN 1 AND 3 + AND lo_quantity < 25; diff --git a/scripts/staging/ssb/sql/q1.2.sql b/scripts/staging/ssb/sql/q1.2.sql new file mode 100644 index 00000000000..7445c53e4fc --- /dev/null +++ b/scripts/staging/ssb/sql/q1.2.sql @@ -0,0 +1,23 @@ +-- Licensed to the Apache Software Foundation (ASF) under one +-- or more contributor license agreements. See the NOTICE file +-- distributed with this work for additional information +-- regarding copyright ownership. The ASF licenses this file +-- to you under the Apache License, Version 2.0 (the +-- "License"); you may not use this file except in compliance +-- with the License. You may obtain a copy of the License at +-- +-- http://www.apache.org/licenses/LICENSE-2.0 +-- +-- Unless required by applicable law or agreed to in writing, +-- software distributed under the License is distributed on an +-- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +-- KIND, either express or implied. See the License for the +-- specific language governing permissions and limitations +-- under the License. +SELECT SUM(lo_extendedprice * lo_discount) AS REVENUE +FROM lineorder, date --dates +WHERE + lo_orderdate = d_datekey + AND d_yearmonth = 'Jan1994' + AND lo_discount BETWEEN 4 AND 6 + AND lo_quantity BETWEEN 26 AND 35; \ No newline at end of file diff --git a/scripts/staging/ssb/sql/q1.3.sql b/scripts/staging/ssb/sql/q1.3.sql new file mode 100644 index 00000000000..4f44b0d9f2f --- /dev/null +++ b/scripts/staging/ssb/sql/q1.3.sql @@ -0,0 +1,25 @@ +-- Licensed to the Apache Software Foundation (ASF) under one +-- or more contributor license agreements. See the NOTICE file +-- distributed with this work for additional information +-- regarding copyright ownership. The ASF licenses this file +-- to you under the Apache License, Version 2.0 (the +-- "License"); you may not use this file except in compliance +-- with the License. You may obtain a copy of the License at +-- +-- http://www.apache.org/licenses/LICENSE-2.0 +-- +-- Unless required by applicable law or agreed to in writing, +-- software distributed under the License is distributed on an +-- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +-- KIND, either express or implied. See the License for the +-- specific language governing permissions and limitations +-- under the License. +SELECT + SUM(lo_extendedprice * lo_discount) AS REVENUE +FROM lineorder, date --dates +WHERE + lo_orderdate = d_datekey + AND d_weeknuminyear = 6 + AND d_year = 1994 + AND lo_discount BETWEEN 5 AND 7 + AND lo_quantity BETWEEN 26 AND 35; \ No newline at end of file diff --git a/scripts/staging/ssb/sql/q2.1.sql b/scripts/staging/ssb/sql/q2.1.sql new file mode 100644 index 00000000000..785327bbddd --- /dev/null +++ b/scripts/staging/ssb/sql/q2.1.sql @@ -0,0 +1,26 @@ +-- Licensed to the Apache Software Foundation (ASF) under one +-- or more contributor license agreements. See the NOTICE file +-- distributed with this work for additional information +-- regarding copyright ownership. The ASF licenses this file +-- to you under the Apache License, Version 2.0 (the +-- "License"); you may not use this file except in compliance +-- with the License. You may obtain a copy of the License at +-- +-- http://www.apache.org/licenses/LICENSE-2.0 +-- +-- Unless required by applicable law or agreed to in writing, +-- software distributed under the License is distributed on an +-- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +-- KIND, either express or implied. See the License for the +-- specific language governing permissions and limitations +-- under the License. +SELECT SUM(lo_revenue), d_year, p_brand +FROM lineorder, date, part, supplier --dates +WHERE + lo_orderdate = d_datekey + AND lo_partkey = p_partkey + AND lo_suppkey = s_suppkey + AND p_category = 'MFGR#12' + AND s_region = 'AMERICA' +GROUP BY d_year, p_brand +ORDER BY p_brand; \ No newline at end of file diff --git a/scripts/staging/ssb/sql/q2.2.sql b/scripts/staging/ssb/sql/q2.2.sql new file mode 100644 index 00000000000..739459b4980 --- /dev/null +++ b/scripts/staging/ssb/sql/q2.2.sql @@ -0,0 +1,26 @@ +-- Licensed to the Apache Software Foundation (ASF) under one +-- or more contributor license agreements. See the NOTICE file +-- distributed with this work for additional information +-- regarding copyright ownership. The ASF licenses this file +-- to you under the Apache License, Version 2.0 (the +-- "License"); you may not use this file except in compliance +-- with the License. You may obtain a copy of the License at +-- +-- http://www.apache.org/licenses/LICENSE-2.0 +-- +-- Unless required by applicable law or agreed to in writing, +-- software distributed under the License is distributed on an +-- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +-- KIND, either express or implied. See the License for the +-- specific language governing permissions and limitations +-- under the License. +SELECT SUM(lo_revenue), d_year, p_brand +FROM lineorder, date, part, supplier --dates +WHERE + lo_orderdate = d_datekey + AND lo_partkey = p_partkey + AND lo_suppkey = s_suppkey + AND p_brand BETWEEN 'MFGR#2221' AND 'MFGR#2228' + AND s_region = 'ASIA' +GROUP BY d_year, p_brand +ORDER BY d_year, p_brand; \ No newline at end of file diff --git a/scripts/staging/ssb/sql/q2.3.sql b/scripts/staging/ssb/sql/q2.3.sql new file mode 100644 index 00000000000..deeb6e64448 --- /dev/null +++ b/scripts/staging/ssb/sql/q2.3.sql @@ -0,0 +1,26 @@ +-- Licensed to the Apache Software Foundation (ASF) under one +-- or more contributor license agreements. See the NOTICE file +-- distributed with this work for additional information +-- regarding copyright ownership. The ASF licenses this file +-- to you under the Apache License, Version 2.0 (the +-- "License"); you may not use this file except in compliance +-- with the License. You may obtain a copy of the License at +-- +-- http://www.apache.org/licenses/LICENSE-2.0 +-- +-- Unless required by applicable law or agreed to in writing, +-- software distributed under the License is distributed on an +-- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +-- KIND, either express or implied. See the License for the +-- specific language governing permissions and limitations +-- under the License. +SELECT SUM(lo_revenue), d_year, p_brand +FROM lineorder, date, part, supplier --dates +WHERE + lo_orderdate = d_datekey + AND lo_partkey = p_partkey + AND lo_suppkey = s_suppkey + AND p_brand = 'MFGR#2239' + AND s_region = 'EUROPE' +GROUP BY d_year, p_brand +ORDER BY d_year, p_brand; \ No newline at end of file diff --git a/scripts/staging/ssb/sql/q3.1.sql b/scripts/staging/ssb/sql/q3.1.sql new file mode 100644 index 00000000000..62ef25f4351 --- /dev/null +++ b/scripts/staging/ssb/sql/q3.1.sql @@ -0,0 +1,32 @@ +-- Licensed to the Apache Software Foundation (ASF) under one +-- or more contributor license agreements. See the NOTICE file +-- distributed with this work for additional information +-- regarding copyright ownership. The ASF licenses this file +-- to you under the Apache License, Version 2.0 (the +-- "License"); you may not use this file except in compliance +-- with the License. You may obtain a copy of the License at +-- +-- http://www.apache.org/licenses/LICENSE-2.0 +-- +-- Unless required by applicable law or agreed to in writing, +-- software distributed under the License is distributed on an +-- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +-- KIND, either express or implied. See the License for the +-- specific language governing permissions and limitations +-- under the License. +SELECT + c_nation, + s_nation, + d_year, + SUM(lo_revenue) AS REVENUE +FROM customer, lineorder, supplier, date --dates +WHERE + lo_custkey = c_custkey + AND lo_suppkey = s_suppkey + AND lo_orderdate = d_datekey + AND c_region = 'ASIA' + AND s_region = 'ASIA' + AND d_year >= 1992 + AND d_year <= 1997 +GROUP BY c_nation, s_nation, d_year +ORDER BY d_year ASC, REVENUE DESC; \ No newline at end of file diff --git a/scripts/staging/ssb/sql/q3.2.sql b/scripts/staging/ssb/sql/q3.2.sql new file mode 100644 index 00000000000..c961d612e43 --- /dev/null +++ b/scripts/staging/ssb/sql/q3.2.sql @@ -0,0 +1,32 @@ +-- Licensed to the Apache Software Foundation (ASF) under one +-- or more contributor license agreements. See the NOTICE file +-- distributed with this work for additional information +-- regarding copyright ownership. The ASF licenses this file +-- to you under the Apache License, Version 2.0 (the +-- "License"); you may not use this file except in compliance +-- with the License. You may obtain a copy of the License at +-- +-- http://www.apache.org/licenses/LICENSE-2.0 +-- +-- Unless required by applicable law or agreed to in writing, +-- software distributed under the License is distributed on an +-- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +-- KIND, either express or implied. See the License for the +-- specific language governing permissions and limitations +-- under the License. +SELECT + c_city, + s_city, + d_year, + SUM(lo_revenue) AS REVENUE +FROM customer, lineorder, supplier, date -- dates +WHERE + lo_custkey = c_custkey + AND lo_suppkey = s_suppkey + AND lo_orderdate = d_datekey + AND c_nation = 'UNITED STATES' + AND s_nation = 'UNITED STATES' + AND d_year >= 1992 + AND d_year <= 1997 +GROUP BY c_city, s_city, d_year +ORDER BY d_year ASC, REVENUE DESC; \ No newline at end of file diff --git a/scripts/staging/ssb/sql/q3.3.sql b/scripts/staging/ssb/sql/q3.3.sql new file mode 100644 index 00000000000..9cabdcc3164 --- /dev/null +++ b/scripts/staging/ssb/sql/q3.3.sql @@ -0,0 +1,38 @@ +-- Licensed to the Apache Software Foundation (ASF) under one +-- or more contributor license agreements. See the NOTICE file +-- distributed with this work for additional information +-- regarding copyright ownership. The ASF licenses this file +-- to you under the Apache License, Version 2.0 (the +-- "License"); you may not use this file except in compliance +-- with the License. You may obtain a copy of the License at +-- +-- http://www.apache.org/licenses/LICENSE-2.0 +-- +-- Unless required by applicable law or agreed to in writing, +-- software distributed under the License is distributed on an +-- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +-- KIND, either express or implied. See the License for the +-- specific language governing permissions and limitations +-- under the License. +SELECT + c_city, + s_city, + d_year, + SUM(lo_revenue) AS REVENUE +FROM customer, lineorder, supplier, date --dates +WHERE + lo_custkey = c_custkey + AND lo_suppkey = s_suppkey + AND lo_orderdate = d_datekey + AND ( + c_city = 'UNITED KI1' + OR c_city = 'UNITED KI5' + ) + AND ( + s_city = 'UNITED KI1' + OR s_city = 'UNITED KI5' + ) + AND d_year >= 1992 + AND d_year <= 1997 +GROUP BY c_city, s_city, d_year +ORDER BY d_year ASC, REVENUE DESC; \ No newline at end of file diff --git a/scripts/staging/ssb/sql/q3.4.sql b/scripts/staging/ssb/sql/q3.4.sql new file mode 100644 index 00000000000..093e01c42e5 --- /dev/null +++ b/scripts/staging/ssb/sql/q3.4.sql @@ -0,0 +1,37 @@ +-- Licensed to the Apache Software Foundation (ASF) under one +-- or more contributor license agreements. See the NOTICE file +-- distributed with this work for additional information +-- regarding copyright ownership. The ASF licenses this file +-- to you under the Apache License, Version 2.0 (the +-- "License"); you may not use this file except in compliance +-- with the License. You may obtain a copy of the License at +-- +-- http://www.apache.org/licenses/LICENSE-2.0 +-- +-- Unless required by applicable law or agreed to in writing, +-- software distributed under the License is distributed on an +-- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +-- KIND, either express or implied. See the License for the +-- specific language governing permissions and limitations +-- under the License. +SELECT + c_city, + s_city, + d_year, + SUM(lo_revenue) AS REVENUE +FROM customer, lineorder, supplier, date -- dates +WHERE + lo_custkey = c_custkey + AND lo_suppkey = s_suppkey + AND lo_orderdate = d_datekey + AND ( + c_city = 'UNITED KI1' + OR c_city = 'UNITED KI5' + ) + AND ( + s_city = 'UNITED KI1' + OR s_city = 'UNITED KI5' + ) + AND d_yearmonth = 'Dec1997' +GROUP BY c_city, s_city, d_year +ORDER BY d_year ASC, REVENUE DESC; \ No newline at end of file diff --git a/scripts/staging/ssb/sql/q4.1.sql b/scripts/staging/ssb/sql/q4.1.sql new file mode 100644 index 00000000000..6c4dbeb4f21 --- /dev/null +++ b/scripts/staging/ssb/sql/q4.1.sql @@ -0,0 +1,34 @@ +-- Licensed to the Apache Software Foundation (ASF) under one +-- or more contributor license agreements. See the NOTICE file +-- distributed with this work for additional information +-- regarding copyright ownership. The ASF licenses this file +-- to you under the Apache License, Version 2.0 (the +-- "License"); you may not use this file except in compliance +-- with the License. You may obtain a copy of the License at +-- +-- http://www.apache.org/licenses/LICENSE-2.0 +-- +-- Unless required by applicable law or agreed to in writing, +-- software distributed under the License is distributed on an +-- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +-- KIND, either express or implied. See the License for the +-- specific language governing permissions and limitations +-- under the License. +SELECT + d_year, + c_nation, + SUM(lo_revenue - lo_supplycost) AS PROFIT +FROM date, customer, supplier, part, lineorder -- dates +WHERE + lo_custkey = c_custkey + AND lo_suppkey = s_suppkey + AND lo_partkey = p_partkey + AND lo_orderdate = d_datekey + AND c_region = 'AMERICA' + AND s_region = 'AMERICA' + AND ( + p_mfgr = 'MFGR#1' + OR p_mfgr = 'MFGR#2' + ) +GROUP BY d_year, c_nation +ORDER BY d_year, c_nation; diff --git a/scripts/staging/ssb/sql/q4.2.sql b/scripts/staging/ssb/sql/q4.2.sql new file mode 100644 index 00000000000..6183b75ee04 --- /dev/null +++ b/scripts/staging/ssb/sql/q4.2.sql @@ -0,0 +1,39 @@ +-- Licensed to the Apache Software Foundation (ASF) under one +-- or more contributor license agreements. See the NOTICE file +-- distributed with this work for additional information +-- regarding copyright ownership. The ASF licenses this file +-- to you under the Apache License, Version 2.0 (the +-- "License"); you may not use this file except in compliance +-- with the License. You may obtain a copy of the License at +-- +-- http://www.apache.org/licenses/LICENSE-2.0 +-- +-- Unless required by applicable law or agreed to in writing, +-- software distributed under the License is distributed on an +-- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +-- KIND, either express or implied. See the License for the +-- specific language governing permissions and limitations +-- under the License. +SELECT + d_year, + s_nation, + p_category, + SUM(lo_revenue - lo_supplycost) AS PROFIT +FROM date, customer, supplier, part, lineorder --dates +WHERE + lo_custkey = c_custkey + AND lo_suppkey = s_suppkey + AND lo_partkey = p_partkey + AND lo_orderdate = d_datekey + AND c_region = 'AMERICA' + AND s_region = 'AMERICA' + AND ( + d_year = 1997 + OR d_year = 1998 + ) + AND ( + p_mfgr = 'MFGR#1' + OR p_mfgr = 'MFGR#2' + ) +GROUP BY d_year, s_nation, p_category +ORDER BY d_year, s_nation, p_category; diff --git a/scripts/staging/ssb/sql/q4.3.sql b/scripts/staging/ssb/sql/q4.3.sql new file mode 100644 index 00000000000..20692b043c7 --- /dev/null +++ b/scripts/staging/ssb/sql/q4.3.sql @@ -0,0 +1,35 @@ +-- Licensed to the Apache Software Foundation (ASF) under one +-- or more contributor license agreements. See the NOTICE file +-- distributed with this work for additional information +-- regarding copyright ownership. The ASF licenses this file +-- to you under the Apache License, Version 2.0 (the +-- "License"); you may not use this file except in compliance +-- with the License. You may obtain a copy of the License at +-- +-- http://www.apache.org/licenses/LICENSE-2.0 +-- +-- Unless required by applicable law or agreed to in writing, +-- software distributed under the License is distributed on an +-- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +-- KIND, either express or implied. See the License for the +-- specific language governing permissions and limitations +-- under the License. +SELECT + d_year, + s_city, + p_brand, + SUM(lo_revenue - lo_supplycost) AS PROFIT +FROM date, customer, supplier, part, lineorder -- dates +WHERE + lo_custkey = c_custkey + AND lo_suppkey = s_suppkey + AND lo_partkey = p_partkey + AND lo_orderdate = d_datekey + AND s_nation = 'UNITED STATES' + AND ( + d_year = 1997 + OR d_year = 1998 + ) + AND p_category = 'MFGR#14' +GROUP BY d_year, s_city, p_brand +ORDER BY d_year, s_city, p_brand;