Skip to main content

Posts

Showing posts from December, 2020

Must know things about Apache Spark Structured Streaming before using it!

Spark Structured Streaming in Depth – Introduction Structured streaming is a new high-level streaming API It is pure declarative based on automatically incrementalizing a static relational query. (It supports SparkSQL or DataFrames ) It supports end to end Realtime application as well as Batch . (We can even use same Realtime code for batch jobs with just one line of code change) Once messages are received in structured streaming engine it is directly available in DataFrame and it achieves high performance using Spark SQL’s code generation engine. As per research paper it outperforms Apache Flink 2X and Apache Kafka Streams by 90X. Structured streaming declarative API also supports Rollbacks , Code updates . Challenges we faced with older streaming APIs are – Depending on the workload or messaging queues (Kafka/RabbitMQ/Kinesis) developers need to think on job triggering modes, stage storage in checkpointing and at-least-once message processing. Focused was more on streamin

How to run or install PySpark 3 locally Windows 10 / Mac / Linux / Ubuntu

 PySpark 3 - Windows 10 / Mac / Ubuntu 1. Install jupyter and pyspark pip install jupyter pip install pyspark 2. Start jupyter server and run sample pi example code # ref - https://github.com/apache/spark/blob/master/examples/src/main/python/pi.py import sys from random import random from operator import add from pyspark.sql import SparkSession if __name__ == "__main__" : spark = SparkSession\ . builder\ . appName( "PythonPi" )\ . getOrCreate() partitions = 100 n = 100000 * partitions def f (_): x = random() * 2 - 1 y = random() * 2 - 1 return 1 if x ** 2 + y ** 2 <= 1 else 0 count = spark . sparkContext . parallelize( range ( 1 , n + 1 ), partitions) . map(f) . reduce(add) print ( "Pi is roughly %f" % ( 4.0 * count / n)) spark . stop() 3. Check your Spark UI from Jupyter