Skip to main content

Posts

Featured post

How to test AWS Glue jobs locally / run AWS Glue jobs locally – Part 1

AWS Glue – Local Testing using Apache Spark 2.4.3 Recently, AWS release its glue libs on GitHub AWS Glue GitHub -  https://github.com/awslabs/aws-glue-libs You can either download Glue 0.9 or Glue 1.0 from the GitHub branch. glue-1.0:  git clone -b glue-1.0 https://github.com/awslabs/aws-glue-libs.git glue 0.9:   git clone https://github.com/awslabs/aws-glue-libs.git Prerequisites: Maven 3.6.0 or higher –  https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-common/apache-maven-3.6.0-bin.tar.gz Spark 2.2x or higher –  https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-1.0/spark-2.4.3-bin-hadoop2.8.tgz https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-0.9/spark-2.2.1-bin-hadoop2.7.tgz Step 1: Install and configure Maven and Apache Spark – configure it as per your installation >> wget https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-1.0/spark-2.4.3-bin-hadoop2.8.tgz >> vi ~/.bashrc export M2
Recent posts

Databricks Integration with Azure Application Insight (Log4j) logs

In this cloud era, there is always the case where we need to capture our application logger logs to some central location. We are using Azure Application Insight to collect and analyze application logs. Reference – https://github.com/AnalyticJeremy https://medium.com/analytics-vidhya/configure-azure-data-bricks-to-send-events-to-application-insights-simplified-c6effbc3ed6a I would like to thank you Jeremy Peach for creating such a beautiful Listener and Balamurugan Balakreshnan for medium post. This post is the simplified summary of Spark Databricks application logs to Azure Application Insight. Step 1 – Get the required credentials, we need 3 things. Application Insights > Instrumentation Key Log Analytics workspace > Workspace ID Log Analytics workspace > Agents management > Linux servers > Primary key Step 2 – Clone the git repo git clone https://github.com/AnalyticJeremy/Azure-Databricks-Monitoring.git Step 3 – Edit appinsights_log

Must know things about Apache Spark Structured Streaming before using it!

Spark Structured Streaming in Depth – Introduction Structured streaming is a new high-level streaming API It is pure declarative based on automatically incrementalizing a static relational query. (It supports SparkSQL or DataFrames ) It supports end to end Realtime application as well as Batch . (We can even use same Realtime code for batch jobs with just one line of code change) Once messages are received in structured streaming engine it is directly available in DataFrame and it achieves high performance using Spark SQL’s code generation engine. As per research paper it outperforms Apache Flink 2X and Apache Kafka Streams by 90X. Structured streaming declarative API also supports Rollbacks , Code updates . Challenges we faced with older streaming APIs are – Depending on the workload or messaging queues (Kafka/RabbitMQ/Kinesis) developers need to think on job triggering modes, stage storage in checkpointing and at-least-once message processing. Focused was more on streamin

How to run or install PySpark 3 locally Windows 10 / Mac / Linux / Ubuntu

 PySpark 3 - Windows 10 / Mac / Ubuntu 1. Install jupyter and pyspark pip install jupyter pip install pyspark 2. Start jupyter server and run sample pi example code # ref - https://github.com/apache/spark/blob/master/examples/src/main/python/pi.py import sys from random import random from operator import add from pyspark.sql import SparkSession if __name__ == "__main__" : spark = SparkSession\ . builder\ . appName( "PythonPi" )\ . getOrCreate() partitions = 100 n = 100000 * partitions def f (_): x = random() * 2 - 1 y = random() * 2 - 1 return 1 if x ** 2 + y ** 2 <= 1 else 0 count = spark . sparkContext . parallelize( range ( 1 , n + 1 ), partitions) . map(f) . reduce(add) print ( "Pi is roughly %f" % ( 4.0 * count / n)) spark . stop() 3. Check your Spark UI from Jupyter

Run AWS Glue Job in PyCharm Community Edition – Part 2

Run AWS Glue Job in PyCharm IDE - Community Edition Step 1: PyCharm  Install PySpark using  >> pip install pyspark==2.4.3 Step 2: Prebuild AWS Glue-1.0 Jar with Python dependencies: >>  Download_Prebuild_Glue_Jar Step 3 : Copy awsglue folder and Jar file into your pycharm project >> https://github.com/awslabs/aws-glue-libs/tree/glue-1.0/awsglue Step 4 : Copy python code from my git repository >> https://github.com/sardetushar/awsglue-pycharm-local-dev Step 5 : Project Structure Step 6:  On console type – Make sure to type your own path >> python com/mypackage/pack/glue-spark-pycharm-example.py Step 6 : Any issues comment me here :) In Part 3 , we’ll see more advanced example like AWS Glue-1.0 and Snowflake database.

Create Cloud SQL Instance on Google Cloud

In this tutorial we will understand how we can create Cloud SQL instance on google cloud. From the Google Cloud Console menu go to > SQL and click on Create instance Next chose the MySQL or PostgreSQL option and chose the latest second generation Next fill the Instance id, root password, select region and zone . click create It will take some time to create SQL instance, once you see green light just click on instance name and go to overview tab. copy the Public IP address somewhere in Notepad. we need to set this public IP address to SQL instance public IP address so that we can access it from Google Cloud Compute VM. Next let's create MySQL instance users, we can create databases as well. I am going to create "mysqluser", just click on Create user account button enter username and password . Below we can use mysqluser had been created Now, final step is to set IP address to our SQL instance, for that go to connections tab and f

Accessing phpmyadmin and MySQL on Google Cloud using SSH putty

In the previous tutorial Part 1 we have learnt how to install LAMP (PHP, MySQL, phpmyadmin etc.) using google cloud launcher. Now, Let's understand how to access phpmyadmin using putty SSH. First download puttygen (We need this to generate keys for google cloud compute engine) Download PuttyGen.exe Click on generate button, remember to move your cursor left and right till the green loading is completed. Once loading is finish you will see the key, now change the key comment : to bitnami (I am using bitnami user on google cloud VM) then save private key somewhere we need to refer this key in putty Now copy public key (above image first red box) from PuTTY Key generator box, and go to Google cloud console > Compute Engine > VM Instances and click Edit Paste ssh public key in SSH Keys option. Open Putty on your local machine and enter Hotname IP, session name and click on save. On putty go to Connection > Data , enter Auto-

Creating cloud storage bucket on Google Cloud and Upload data using web console

Let's understand step by step how to create google cloud bucket using web console. From the left menu go to Storage > Browser and click on Create bucket Enter bucket Name, look at the storage class options carefully. You will see the charges for each storage class are changing.   Once you finalize storage class click on create , you will see the Upload file, Upload folder and Create folder button . Now let's upload files