diff --git a/README.md b/README.md index 9eefed61f..bd59331f2 100644 --- a/README.md +++ b/README.md @@ -38,8 +38,74 @@ Additionaly, we found that having a close integration between the data warehouse - **Bring Your Own Storage:** We felt that customers should own their data and not be locked into a particular storage engine. ## Quickstart -Have +1. Dependencies: + - Make sure that you have [Docker Engine](https://docs.docker.com/engine/install/) installed. + - Install [Python](https://www.python.org/downloads/) if you haven't already. + - Install a [MySQL client](https://dev.mysql.com/downloads/mysql/) on your system. + - An AWS account with S3 access. + +2. Clone the repository: + +```bash +git clone https://github.com/buster-so/warehouse.git +``` + +3. Run the warehouse: + +```bash +docker compose up -d +``` + +4. Populate the `.env` file with AWS credentials provisioned for S3 access. **Note: You can use any S3 compatible storage, you might just need to tweak some of the configs.** Feel free to look at the Starrocks [docs](https://docs.starrocks.com/en-us/main/loading/iceberg/iceberg_external_catalog) or PyIceberg [docs](https://iceberg.apache.org/docs/latest/spark-configuration/) for more information. + +5. Connect to the warehouse with any MySQL client. + +6. Create the external catalog: + +```sql +CREATE EXTERNAL CATALOG 'public' +PROPERTIES +( + "type"="iceberg", + "iceberg.catalog.type"="rest", + "iceberg.catalog.uri"="http://iceberg-rest:8181", + "iceberg.catalog.warehouse"="", + "aws.s3.access_key"="", + "aws.s3.secret_key"="", + "aws.s3.region" = "", + "aws.s3.enable_path_style_access"="true", + "client.factory"="com.starrocks.connector.iceberg.IcebergAwsClientFactory" +); +``` + +7. Seed the data. If you want to populate a table with 75m records, you can run the notebook found [here](/notebooks/populate_warehouse.ipynb). + +8. Set the catalog + +```sql +SET CATALOG 'public'; +``` + +9. Set the database + +```sql +USE DATABASE 'public'; +``` + +10. Run a query + +```sql +SELECT COUNT(*) FROM public.nyc_taxi; +``` + +### Optimizations + +For data that you think will be accessed frequently, you can cache it on disk for faster access with: + +```sql +CACHE SELECT * FROM public.nyc_taxi WHERE tpep_pickup_datetime > '2022-03-01'; +``` ## Roadmap diff --git a/python/populate_warehouse.ipynb b/python/populate_warehouse.ipynb index e90131c91..5f3306b8b 100644 --- a/python/populate_warehouse.ipynb +++ b/python/populate_warehouse.ipynb @@ -2,7 +2,7 @@ "cells": [ { "cell_type": "code", - "execution_count": 162, + "execution_count": 1, "metadata": {}, "outputs": [ { @@ -19,7 +19,7 @@ }, { "cell_type": "code", - "execution_count": 163, + "execution_count": 2, "metadata": {}, "outputs": [], "source": [ @@ -28,7 +28,7 @@ }, { "cell_type": "code", - "execution_count": 164, + "execution_count": 3, "metadata": {}, "outputs": [], "source": [ @@ -37,7 +37,7 @@ }, { "cell_type": "code", - "execution_count": 165, + "execution_count": 4, "metadata": {}, "outputs": [], "source": [ @@ -48,7 +48,7 @@ }, { "cell_type": "code", - "execution_count": 166, + "execution_count": 5, "metadata": {}, "outputs": [], "source": [ @@ -57,7 +57,7 @@ }, { "cell_type": "code", - "execution_count": 167, + "execution_count": 6, "metadata": {}, "outputs": [], "source": [ @@ -95,7 +95,7 @@ }, { "cell_type": "code", - "execution_count": 168, + "execution_count": 7, "metadata": {}, "outputs": [], "source": [ @@ -109,7 +109,7 @@ }, { "cell_type": "code", - "execution_count": 169, + "execution_count": 8, "metadata": {}, "outputs": [], "source": [ @@ -123,7 +123,7 @@ }, { "cell_type": "code", - "execution_count": 170, + "execution_count": 9, "metadata": {}, "outputs": [], "source": [ @@ -143,14 +143,21 @@ }, { "cell_type": "code", - "execution_count": 171, + "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "Appending files: 100%|██████████| 26/26 [10:19<00:00, 23.83s/it, Appended yellow_tripdata_2021-09.parquet]\n" + "Appending files: 0%| | 0/26 [00:00