Sure, Apache Spark looks cool, but does it live up to the hype? Is there anything you can actually do with it? Actually, there are some pretty cool use cases going on right now.
Exploratory Analytics
Sure, Apache Spark looks cool, but does it live up to the hype? Is there anything you can actually do with it? Actually, there are some pretty cool use cases going on right now.
Exploratory Analytics
One of the best features of modern programming languages is that many of them offer interactive shells, from Bash to Python to Scala. Instead of a time-consuming write/compile/test/debug cycle, you can try out your ideas in the shell immediately.
Spark takes this idea and applies it to Big Data. You can explore your data interactively using either Python or Scala without having to wait on batch queries. Spark lets you use any kind of data, whether it’s structured, semi-structured, or unstructured. You can also use any kind of programming model you want: imperative, functional, or object-oriented.
The key to this is Spark’s use of Resilient Distributed Datasets, or RDDs. RDDs are stored in memory, which is much faster than using a disk. It can additionally use the disks if there is more data than can fit in memory. If you think this would be a recipe for slow performance with Big Data, think again. Spark uses lazy evaluation, which only performs computation when you need a result—such as printing a value. You can set up complex queries and then run them later.
RDDs are immutable, which means that there’s no risk from exploring datasets. The lineage feature lets you recover from errors with a complete history of the RDDs. This makes exploring large datasets safe.You can also connect your other databases using SQL drivers.
Machine Learning
Spark offers some powerful machine learning tools. As with exploratory analytics, you can use the interactive REPL (common acronym for an interactive shell meaning run-evaluate-print-loop) to develop algorithms in real time. Spark also caches frequently accessed datasets for maximum efficiency. You can develop your own algorithms or use some efficient algorithms from MLlib.
Machine learning is becoming important for threat detection. A client of MapR Technologies is a credit card company who uses Spark to detect potential credit card fraud. Another client uses it to detect possible network threats.
Real-Time Dashboards
Big Data is no good if you have no way to see it. Apache Spark offers the ability to power real-time dashboards. The goal of Big Data is to sift through large amounts of data to find insights that people in your organization can act on.
While a programmer might be able to use the REPL described earlier to explore data, most people are not going to be willing to learn SQL, Scala, Python, or Spark in order to look for trends.
Spark Streaming can be leveraged to perform low-latency, window-based aggregations of your data. Spark can combine both streaming and offline databases for an optimal view of a company’s data, enabling dashboards which let users drill down to get an easy, graphical, intuitive view of their data. The ability to connect to other databases using SQL drivers gives a holistic view of an organization.
ETL
With the ability to process massive amounts of data quickly, Apache Spark is ideal for data warehouses. While your databases may be structured, in the real world, data can be anything but. You might be looking for a way to clean and transform data coming from sources inside and outside your organization. Apache Spark makes the task much less daunting.
Spark offers a variety of ETL (Extract, Transform, and Load) tools. Sparks includes optimized scheduling for the most efficient I/O on the large datasets that data warehousing employs. The in-memory nature of Spark lets you perform aggregation, shuffles, and other operations on your data.
Spark lets you use tools you’re already familiar with. You can also use SQL to perform ETL, flattening the learning curve for you and administrators in getting your data into Spark. You can also port PIG scripts to Spark, as well as run HIVE queries.
Conclusion
With fast in-memory processing, Apache Spark offers up a whole new way to explore and act on your data. The MapR distribution of Spark gives you everything you need to make the best use of your data right out of the box.
For a more in-depth introduction to Spark, read Getting Started with Spark: From Inception to Production, a free interactive eBook by James A. Scott.