When it comes to writing machine learning algorithms leveraging the Apache Spark framework, the data science community is fairly divided as to which language is best suited for writing programs and applications. As a widely used open source engine for performing in-memory large-scale data processing and machine learning computations, Apache Spark supports applications written in Scala, Python, Java, and R. The Spark engine itself is written in Scala.
Often, the language of choice is decided by the level of comfort, expertise and prior experience of the developer. While this helps with rapid prototyping and development, in the long run it may not be the best choice for developing models based on the business problem being solved and the amount of data processing that would be needed.
Here are a few key considerations to keep in mind before deciding on the language of choice for your machine learning application and modelling for the business problem at hand.
Java and Scala are compiled languages, wherein the code gets compiled to byte code first prior to execution. The Spark engine itself is written in Scala. Any code written in Scala runs natively on Java Virtual Machine (JVM).
Python and R on the other hand are interpreted languages. The interpreter executes the program directly, translating each statement into a sequence of one or more subroutines, and then into another language for compiling to successfully run on a Java Virtual Machine (JVM).
From a strictly performance perspective, the compiled languages (Java and Scala) provide better general performance than the interpreted languages (Python and R) in most cases. However, it would be prudent to profile the application at some point to determine if language is going to be a big factor, especially for small applications.
Concurrency means that a task is not finished until all the working threads and sub threads have completed processing. Thus, all the threads start and finish at the same time.
As Scala runs on the Java Virtual Machine (JVM), it has full access to the JVM’s multi-threading capabilities. However, unlike Java, Scala is not just limited by default to the concept of Threads for achieving concurrency. There are other advanced options to achieve concurrency like Futures, and Actors too.
Both Python and R on the other hand do not support true concurrency and multi-threading. Multi-threading can only run in parallel for some IO tasks, but can only run one at a time for CPU-bound multiple core tasks. Thus, there’s more overhead in managing memory and data processing tasks.
From an ease of use perspective, for dynamically typed languages like Python and R, variable type checking happens at run time, allowing developers to quickly develop applications. Type checking is the process of verifying and enforcing the constraints of variables and data types. Statically typed languages, like Scala and Java perform type checking at compile time . However, refactoring and maintaining applications over a period is a lot easier with Scala and Java than Python and R.
Java is verbose, meaning Java based applications would need a lot more lines of code to perform the same operations than Scala or Python or R. Additionally, Java does not support Read-Evaluate-Print-Loop making it impossible to use with popular data science tools like Jupyter Notebook.
As Apache Spark is written in Scala, having a good knowledge of this can help developers understand and potentially extend what Spark does internally. Moreover, new upcoming features will initially have their APIs in Scala and Java. Python APIs usually evolve and get updated in the later versions.
Scala, Python and Java are all object oriented and functional languages. R on the other hand is functional and procedural in nature.
Python is more analytical oriented and is easier in terms of learning curve and ease of use. Python is less verbose and more readable (easier to understand syntax) than Scala or Java, making it ideal for those who don’t have much programming experience or expertise. Scala and Java are more engineering oriented and are ideal for those with a programming background, especially Java. R is developed with academics, statisticians and data science in mind, and is often used for data visualization and data plotting.
This is one area where both Python, R and Java have a clear advantage over Scala.
Python and R have a much more mature ecosystem with readily available out-of-the-box packages implementing most of the standard procedures and models that are already broadly adopted across various industries and academia.
With its MLlib, GraphX and Spark Streaming libraries, Scala has made huge strides in making standard implementations available out-of-the-box. However, it still has some way to go before matching the maturity level like Python and R.
In the end it all depends on the problem you are trying to solve, prior experience, capabilities and the amount of data to be processing. Another consideration is whether you are building a quick prototype application or an enterprise wide application. Python offers quick prototyping and development, while Scala and Java are better choices for processing large data-sets and enterprise wide deployments. R is good for specific requirements that must be best addressed by R, or if you want to move an existing R environment in to Apache Spark platform.
OpenText™ Magellan™ is a flexible artificial intelligence (AI) and analytics platform that combines machine learning, advanced analytics, and enterprise-grade business intelligence (BI) with the ability to acquire, merge, manage, and analyze structured and unstructured big data. The platform combines open source machine learning with advanced analytics, enterprise-grade BI, and capabilities to acquire, merge, manage and analyze Big Data and Big Content stored in your Enterprise Information Management (EIM) systems. Magellan enables machine-assisted decision making, automation, and business optimization.