As the big data space expands and the relevant technology matures, it is necessary to distinguish the roles of the data scientist and data engineer. Among different and equally correct approaches, one can argue that data scientists have a math/statistics background that enable them to create efficient algorithms that fit well to certain big data sets and are able to extract valuable information. On the contrary, data engineers understand various technologies and frameworks in depth, with emphasis in distributed /clustered systems. Using these skills they create data pipelines, which surely is not trivial in a computational clustered environment.
As organizations must nowadays embrace a more distributed model to deal with everything from content management to big data, the highly centralized data center is becoming a thing of the past due to a number of unavoidable factors. Therefore, a computational cluster designed specifically for storing and analyzing huge amounts of unstructured data in a distributed computing environment seems to be necessary. Such clusters run specialized distributed processing software on low-cost commodity machines and boost the speed of data analysis. They also are highly scalable; if a cluster’s processing power is overwhelmed by growing volumes of data, additional cluster nodes can be added to increase throughput. Clusters also are highly resistant to failure because each piece of data is replicated to other nodes, which ensures data integrity if one node fails. The above justify the fact that all big data frameworks are based on the architectural principle of clustering. In this context, the universal adoption of messaging systems is a fact. Messaging is the key technology that carries the greatest promise to efficiently integrate independent applications that run in parallel in a cluster mode. The importance of messaging is such that recently it has been elevated to a standalone distributed communication infrastructure that offers high throughput, reliable communication and much more advanced features that are very crucial in the new big data era.
Apache Kafka is a popular distributed streaming platform that acts as a messaging queue or an enterprise messaging system. It lets you publish and subscribe to a stream of records and process them in a fault-tolerant way as they occur. Kafka is a solution to the real-time problems of any software solution, that is, to deal with real-time volumes of information and route it to multiple consumers quickly. Kafka provides seamless integration between producers and consumers without blocking the producers of data, and without letting producers know who the final data consumers are. Apache Kafka is an open source, distributed publish-subscribe messaging system, mainly designed with the following characteristics:
- Persistent messaging: To derive the real value from big data, any kind of information loss cannot be afforded. Apache Kafka is designed with a so-called retention mechanism that persists all messages to its internal log structures for certain amount of time
- High throughput: Keeping big data in mind, Kafka is designed to work on commodity hardware organized in clusters and to support millions of messages per second.
- Distributed: Apache Kafka explicitly supports messages partitioning over Kafka servers and distributing consumption over a cluster of consumer machines while maintaining per-partition ordering semantics.
- Multiple client support: Apache Kafka system supports easy integration of clients from different platforms such as Java, .NET, PHP, Ruby, and Python.
- Real time: Messages produced by the producer threads should be immediately visible to consumer threads; this feature is critical to event-based systems such as Complex Event Processing (CEP) systems.
Overall, Apache Kafka aims to unify offline and online processing by providing a mechanism for parallel load in Hadoop systems as well as the ability to partition real-time consumption over a cluster of machines. Kafka can be efficiently used for processing of stream data, but from the architecture perspective, it is closer to traditional messaging systems. A typical big data aggregation-and-analysis scenario supported by the Apache Kafka messaging system could include different kinds of producers that produce any kind of real time log data (e.g. web logs, analytics logs, etc.), and different kinds of consumers, such as the following:
- Offline consumers that are consuming messages and storing them in Hadoop or traditional data warehouse for offline analysis
- Near real-time consumers that are consuming messages and storing them in any NoSQL datastore such as HBase or Cassandra for near real-time analytics
- Real-time consumers that filter messages and trigger alert events for related groups
Although Kafka spans a wide area of applications and use cases, the main objective of this code.learn program is to provide a guide to designing and architecting enterprise-grade streaming applications using Apache Kafka. It includes best practices for building such applications and tackles some common challenges such as how to use Kafka efficiently to handle high data volumes with ease. This program first takes the participant through understanding what a messaging system is and then provides a thorough introduction to Apache Kafka and its internal details. Once the participant grasps the basics, it continuous with more advanced concepts in Apache Kafka such as capacity planning, fault tolerance and security.
In particular, the main learning objectives are the following:
- Introduction to Messaging Systems.
- Introduction to Kafka – the distributed/clustered messaging platform.
- Deep dive into Kafka producers / consumers.
- Build ETL pipelines using Kafka Connect & in-depth technical concepts surrounding it.
- Kafka Cluster deployment.
- Kafka Security aspects (e.g. Setup and use SSL authentication in Kafka, Configure Kafka Clients to make them work with security).
- Learn the Kafka Streams data processing library and build streaming applications with exactly once semantics support out of the box.
- Using Kafka in Big Data Applications, which covers how to manage high volumes in Kafka, how to ensure guaranteed message delivery, the best ways to handle failures without any data loss, and some governance principles that can be applied while using Kafka in big data pipelines
By the end of this Seminar, the participant is expected to describe in depth Kafka capabilities and flexibility as a messaging system as a strong candidate solution for big data applications, as well as to design and implement robust and reliable big data messaging / streaming solutions.
- This specific Code.Learn program lasts 3 days (Thursday, Friday & Saturday) with 16 hours of lectures and hands-on exercise on a real life project.
Data engineers, BI engineers, DW developers, data integration developers, computer scientists, software engineers and developers are welcome to participate to this code.learn program and unlock the full potentiality of the topics taught by upskilling their future career.