Supercomputing for Big Data :computer: SPARK KAFKA :sparkles:
Stream Processing with Kafka :surfer:
ETL Is Dead, Long Live Streams talk by Neha Narkhede
Generate real-time insights from a bunch of stuff!
- Topics
- Each message that is fed must be apart of a topic.
- Each topic can have multiple partitions. (So each instance just needs to be able to host a partition instead of the entire topic)
- Each partition is also replicated for fault tolerance
- Each partition has a leader(for read/writes) and a few followers(who passively replicate the leader).
- Hence ordering possible only within a partition
- In partitions, the records have a sequence assigned to them (called offsets)
- Producers
- These apps publish messages on one of the topics
- Produces choses the partition to publish to (or round robin).
- Consumers
- These apps subscribe to one/more topics and consume data
- Broker
- Every instance of Kafka (could be on a single machine or on a cluster)
- If Kafka brokers are godowns for a restaurant, chefs are consumers
- Streams API
- Reads a stream(from one or more topics) and produces a stream back to a(one or more) topic(s).
- Uses Kafka stateful storage
- Aggregations etc. :bomb:
- Simple is Beautiful
- It is the T in ETL
- Connector API
- Allows building reusable producers/consumers to connect to existing Data solutions/systems.
- It is the E and L in ETL
- Kafka > Traditional Queues
- It allows both
queueing
(load shared among consumers in a consumer-group) andpublish-subscribe
(messages broadcasted to all consumer-groups) instead of chosing one.
- It allows both
- Kafka relies on
zookeeper
to run. WOW
: Treat the past and incoming stream data the same way! And replace everything else for data processing.
Lecture 2
- MapReduce:
- Map - Loading of data and defining a set of keys
- Shuffling is automatic
- Reduce - Collect key based data to process and output
- For complex work, chain jobs together (each job will have map-reduce)
- Use a higher level language (Luigi :snake:)
- MapReduce is slow and difficult to master
- Apache Spark to the rescue:
- Advantages
- Parallel distribution
- High level API
- Fault tolerance
- In memory computing
- Spark’s cool libraries: (discussed in next lecture)
- Spark SQL
- Spark Streaming (real time streaming)
- MLib (Machine Learning)
- GraphX (Graph processing) - Standalone scheduler, Yarn, Mesos - Resilient Distributed Dataset (RDDs) (USP of spark) (Primary Data Abstraction)
- Advantages
- RDD
- Distributed collection of elements
- Parallelize data across cluster
- Enables Caching
- Data Spills to disk if memory exceeds
- Tracks computation for recomputing lost data. (Called lineage)
- Types of operations:
- Transformations - Creates a DAG and lazy evaluation. (
map, filter, flatMap
) - Actions - Actually performs the transformations and returns a value (
collect, take, count, reduce, saveAsTextFile
)
- Transformations - Creates a DAG and lazy evaluation. (
.toDebugString
to view the rdd transformations DAG- Use
rdd.cache() = rdd.persist(StorageLevel.MEMORY_ONLY)
to store intermediate transformed result to memory where we can apply actions on. - Awesome tuple transformations:
groupByKey(), reduceByKey((x,y) => x+y), sortByKey()
rdd1.join(rdd2)
joins tables on the key
Lab 1
Useful Spark commands and notes:
- Narrow Dependency: (operations like map), wide dependency : (operations like reduce)
- Lazy evaluation: Not computed till requested. (Till then only the DAG is calculated)
- Create RDD using
sc.parallelize
to enter Spark’s programming paradigm - Load CSVs using
sc.textFile('<name>')
- Filtered representations of the RDD using command like:
val filterRDD = raw_data.map(_.split(",")).filter(x => x(1) == "3/10/14:1:01") filterRDD.foreach(a => {println(a.mkString(" "))})
- Use Dataframes, and Datasets(type checked dataframes) for defining schema of data being read and type checking etc. ``` case class SensorData ( sensorName: String, timestamp: Timestamp, numA: Double, numB: Double, numC: Long, numD: Double, numE: Long, numF: Double )
val ds = spark.read.schema(schema) .option(“timestampFormat”, “MM/dd/yy:hh:mm”) .csv(“./lab1/sensordata.csv”) .as[SensorData] ```
Lecture 1
- 3V model
- Volume
- Velocity
- Variety
- 80% of the data generated is Unstructured
- Big data pipeline phases
- Sense/Acquire
- Store/Ingest
- Retrieve/filter
- Analyze
- Visualize
- Spark uses in RDD’s (in memory) so are significantly faster than traditional map reduce (at-least for batch processing)
LAb 0
- Hadoop is not a replacement for Relational Database
- It complements online transaction processing and online analytical processing
- Used for structured and unstructured data (large quantities)
- Not good for:
- Hadoop is not good to process transactions due to its lack random access.
- It is not good when the work cannot be parallelized or when there are dependencies within the data, that is, record one must be processed before record two.
- It is not good for low latency data access.
- Not good for processing lots of small files
- Ambari GUI for managing hadoop cluster.