Cluster computing goes local with Spark Connect

Gone are days when Data/ML engineers have to repeatedly package their data processing logic in a Spark App and submit it to the cluster with a purpose to test, customize and optimize the information logic. Spark connect can now power the local computing environment across all platforms with seamless access to Spark’s cluster computing engine.

Cluster computing on Spark is accessible mainly either thru Spark shell being launched on a node that has access to the cluster, or by packaging the specified data processing logic in a Spark App and submitting the later to the cluster manager via spark-submit command, however the submission again has to occur on a node that has access to cluster.

These constraints poses problems to data engineers to seamlessly test their code on an actual cluster while they construct their data processing logic using Spark APIs. Also, due to these, data applications cannot seamlessly leverage Spark’s cluster computing capabilities on a on-demand basis.

With a view to address these constraints to a certain extent, there are some standard solutions that exists today, like Spark thrift server and Apache Livy. Spark thrift server ( principally a Thrift service implemented by the Apache Spark community based on HiveServer2) allows data applications to harness the facility of Spark SQL remotely in a regular SQL way based on the usual JDBC interface, whereas Livy allows to send code snippets and submit applications to a Spark cluster remotely via REST and programmatic APIs.

Nonetheless, none of those solutions provides native execution experience of Spark’s wealthy Dataframe API(s) across all platforms. This execution experience is analogous to what is usually being experienced on a Spark shell. Further, these solutions require some learning curve, might need some custom tweaks in a Spark native application and might require installation/maintenance of some extra components.

But, with Spark Connect coming out in the newest Spark release, 3.4, one can natively experience and leverage the facility of Spark’s cluster computing from a distant setup. Spark Connect is predicated on a decoupled gRPC based client-server architecture where unresolved logical plans serve because the common contract between client and server.

The architecture is depicted below (Reference: Spark Docs):

The gRPC service (the server) is hosted on the driving force in type of a plugin. Multiple Spark connect clients can connect with it to execute their respective query plans. Generally, the connect service analyze, optimize and execute received logical plans from various clients, and stream back results back to the respective clients.

Further, Spark connect would supply you with a skinny client library that will be embedded in application servers, IDEs, notebooks, and programming languages. The skinny client library allow developers to write down data processing logic of their favorite Dataframe APIs, and mechanically trigger distant evaluation of the underlying query plan when an motion is invoked. Once distant execution is accomplished, desired output is out there in the identical scope.

Spark connect client library actually provides applications with a special SparkSession object that points to a distant Spark driver. This special SparkSession instance encapsulates all of the logic to package/send unresolved query execution plans via gRPC contract to the configured driver when required, collect back streamed results from the driving force against the successful execution of a plan after which provide the collected results to the applying.

To summarize, It ought to be now easy to know that with Spark connect enabled, productivity and development experience for data engineers goes to extend multifold. It could also enable anyone to interactively explore large data sets from distant, and lastly it might open opportunities to develop wealthy data applications that may leverage distant cluster computing paradigm seamlessly to complement customer experience and interactions.

In case there’s a doubt/queries or you might have any feedback on this story, you might reach out to me @ LinkedIn

Cluster computing goes local with Spark Connect

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Tips on how to Evaluate Retrieval Quality in RAG Pipelines (Part 3): DCG@k and NDCG@k

OpenAI Is Quietly Constructing Your Next Health Assistant

Meta’s chief AI scientist maps his exit

Improving VMware migration workflows with agentic AI

The Three Ages of Data Science: When to Use Traditional Machine Learning, Deep Learning, or an LLM (Explained with One Example)

Cluster computing goes local with Spark Connect

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.