As we continue looking at Azure Government service updates, it’s a good reminder about how important Azure’s HDInsight is in bringing big data to governmental organizations like yours. It’s the only fully managed cloud Hadoop offering that provides open-sourced clusters for applications like Apache Spark.
With the latter above being a popular open-source engine for handling large-scale data processing and analytics, it’s become a major part of HDInsight. Regardless, it’s ready for an update.
Microsoft recently announced they’re going to update HDInsight with an upcoming 3.6. While doing so, Microsoft plans to gain feedback on their new Apache Spark 2.1.
Thanks to being open-source, Microsoft invites you to try out the new features to see if you’ll want the update.
What HDInsight Already Provides for Government Organizations
Those of you new to HDInsight should know it allows you to create a big data analysis cluster in mere minutes. You can do this without any upfront costs, something helping you immensely if on a limited budget.
For governmental offices like yours, HDInsight’s use of Spark is effective for creating real-time data. Streaming and processing data in this manner is quite easy with Apache Spark, as well as Kafka and Storm.
Using machine learning is possible as well through Spark, plus R Server. Even being able to build applications for more personalized client experiences is possible.
Now with Spark 2.1 being available, how easy is it to start using? The good news is it has a lot of great new improvements your development team are going to love.
Getting Started With the New Preview
It’s simple to start using the new HDInsight and Spark 2.1. Just go to your Azure portal and create an HDInsight service. Going into the portal, you’ll see a list of featured apps with the above service at the top.
Once you select this, the program asks you the Spark cluster type you want. Merely select 2.1 from a drop-down menu.
After creating your cluster, you’ll have access to some useful tools, including Jupyter Notebook, a popular open-source web application. This allows you to create and share documents with live code. You can even add equations, visualizations, and text.
With this alone you can do data cleaning, statistical modeling, or machine learning as just a few features.
Structured Streaming in Spark Update
Some of the greatest new features in Spark include improved structured streaming, and using Apache Kafka when doing the streaming. You’ll find some details about this on the official Apache Spark page.
You’ll notice various improvements, including Spark SQL. With structured streaming being the centerpiece, you’ll notice the Kafka 0.10 support, complete with an integration guide from Microsoft.
Other improved details related to structured streaming include better metrics, a stable format for offset logs, and observed delay-based event time watermarks.
You can support all file formats in this edition as well for more stability.
Other Details on the Spark Update
Keep an eye out for Spark’s machine learning capabilities, something that’s extremely scalable and easy to use. The Spark site provides a library showing the useful features it brings to your developers. For instance, they’ll be able to use common learning algorithms for a variety of application development situations.
Machine learning also provides featurization, machine learning pipelines, persistence, and numerous utilities.
The update additionally offers SparkR (a lightweight front-end), giving you extra machine learning support for algorithms. It provides new machine learning algorithms like LDA and Gaussian Mixture Models as just a couple.
A couple of known bugs in the previous edition of Spark are getting addressed in 2.1, assuring more stability you expect from Microsoft’s Azure platform.