Anant example-cassandra-dsbulk-with-sed-and-awk
License: No License Provided
Language: Awk
Learn how to combine the DataStax Bulk Loader with sed and awk to do Cassandra data operations.
In this walkthrough, we will be using dsbulk to unload data from a DataStax Astra instance, do data transformations using awk and sed, and then load it into a Dockerized Apache Cassandra instance. You can do this walkthrough with any combination of 2 Cassandra distrubutions of the following types: Apache Cassandra, DataStax Enterprise, and DataStax Astra as well as moving data within the same instance. More on that here. If you are working with deployed instances, you can use their contact points and more on that can be found here.
cd into itgit clone https://github.com/Anant/example-cassandra-dsbulk-with-sed-and-awk.git
cd example-cassandra-dsbulk-with-sed-and-awk
Create Database button on the dashboardGet Started button on the dashboardThis will be a pay-as-you-go method, but they won't ask for a payment method until you exceed $25 worth of operations on your account. We won't be using nearly that amount, so it's essentially a free Cassandra database in the cloud.
testcreate database and wait a couple minutes for it to spin up and become active.dashboard/<your-db-name>, click the Settings menu tab. Admin User for role and hit generate token. Secure BundleConnect tab in the menuNode.js (doesn't matter which option under Connect using a driver)Secure BundleSecure Bundle into the cloned directory.Upload Data buttonprevious_employees_by_title.csv file into the section.Next buttonprevious_employees_by_titleemployee_id from text to uuidfirst_day and last_day to timestampjob_title as the Partition Keyemployee_id as the Clustering ColumnNext buttontest as the target keyspaceNext button to begin loading the csv. docker run --name cassandra -p 9042:9042 -d -v "$(pwd)":/example-cassandra-dsbulk-with-sed-and-awk cassandra:latest
docker exec -it cassandra cqlsh
source '/example-cassandra-dsbulk-with-sed-and-awk/days_worked_by_previous_employees_by_job_title.cql'
We will cover 2 methods for loading data into Dockerized Apache Cassandra after unloading it from DataStax Astra. The first method breaks up the unloading and loading steps into 2 seperate methods. The second method does everything in one pipe. The scenario is as follows: Someone on our team wants us to take the previous_employees_by_title table and create a new table that has the time worked in days instead of a start time and end time. NOTE: This can be done within the same instance, but for the purposes of showing more of what dsbulk can do, we opted for moving data between DataStax Astra and a Dockerized instance of Apache Cassandra.
In this command, we are unloading from DataStax Astra, running an awk script, doing a sed transformation, and then writing the output to a CSV. The awk script cleans up the timestamp format of the first_day and last_day columns so that we can use the mktime() function to calculate the time in number of seconds. Then, the script prints out the first 3 columns and then the calculated duration of time worked in days per row. We also have some conditionals saying if that the value returned is negative, then make it positive. This occurred because we used a CSV generator and calculated random datetimes for those 2 columns for ~100,000 rows. Then, we use sed to fix the header that awk spits out with what we want for the destination table and write the output to a new CSV called days_worked_by_previous_employees_by_job_title.csv.
NOTE: Input your specific variables for the placeholders in the command
dsbulk unload -k test -t previous_employees_by_title -b "secure-connect-<db>.zip" -u <Client ID> -p <Client Secret> | gawk -F, -f duration_calc.awk | sed 's/job_title,employee_id,employee_name,0/job_title,employee_id,employee_name,number_of_days_worked/' > days_worked_by_previous_employees_by_job_title.csv
dsbulk load -url days_worked_by_previous_employees_by_job_title.csv -k test -t days_worked_by_previous_employees_by_job_title
select count(*) from test.days_worked_by_previous_employees_by_job_title ;
truncate table test.days_worked_by_previous_employees_by_job_title ;
dsbulk unload -k test -t previous_employees_by_title -b "secure-connect-<db>.zip" -u <Client ID> -p <Client Secret> | gawk -F, -f duration_calc.awk | sed 's/job_title,employee_id,employee_name,0/job_title,employee_id,employee_name,number_of_days_worked/' | dsbulk load -k test -t days_worked_by_previous_employees_by_job_title
select count(*) from test.days_worked_by_previous_employees_by_job_title ;
And that wraps up how we can quickly do some Cassandra data operations using dsbulk, awk, and sed.
Join Our Newsletter!
Sign up below to receive email updates and see what's going on with our company.