5 Ways To Get More Out Of Hadoop
Ashley Stirrup, CMO at Talend, shares his top tips for honing Hadoop skills
For every organisation, time is of the essence – a minute can mean the difference between earning or losing millions of pounds in some cases.
As organisations increasingly look to speed time to market, anticipate and respond to customers’ needs, and introduce new products and services, they need to have peace of mind in knowing that their decisions are based on information that’s fresh and true. Operating with data that’s a week, day or even hours old can be fatal.
Hadoop, the big data processing tool, is now being used extensively to help businesses achieve business insight in real-time. Thus, growing numbers of developers are looking at ways to optimise its use to increase business insight and further competitive advantages.
If you’re a developer, here are five ways you can sharpen your use of the Hadoop framework:
Just by moving from data integration jobs built with MapReduce to Apache Spark you will be able to complete those jobs two and a half times faster.
Once you convert the jobs, if you add Spark-specific components for caching and positioning, you can increase performance an additional five times. From there, if you increase the amount of RAM on your hardware you can do more things in-memory and actually experience a 10-fold improvement in productivity.
Overall, when you combine Hadoop with your traditional bulk-and-batch data integration jobs, you can dramatically improve performance.
2. Go Real-Time
It’s one thing to be able to do things in bulk and batch; it’s another thing to be able to do them in real-time. This is not about understanding what your customers did on your website yesterday; it’s about knowing what they are doing right now – and being able to influence those customers’ interactions immediately – before they leave your site.
One of the best things about Spark – and Spark streaming – is that you now have one tool set that allows you to operate in bulk, batch and in real-time.
With data integration tools, you can design integration flows with one tool set across all of these systems, so that you can pull in data from historical data sources, from Oracle and Salesforce, and then come in with real-time streaming data from websites, mobile devices and sensors.
The bulk-and-batch information may be stored in Hadoop, while real-time information can be stored in NoSQL databases. Regardless of the data source, you can use a single query interface using Spark SQL from mobile, analytic and web apps to search across all data sources for the right information.
3. Get smart
So, now you can process data in real-time – but how about intelligently processing data in real-time?
To improve the IQ of your query, Spark utilises machine-learning which, for example, allows you to personalise web content to each shopper in order to nearly triple the number of page views. Spark’s machine learning capabilities also allows you to deliver targeted offers which can help double conversion rates. So, you are not only creating a better customer experience, but you are also driving more revenue.
For example, German retailer OTTO Group, is using Spark to predict, which online customers will abandon their shopping carts and then present them with incentive offers. If you are a £12bn company and have the industry-standard rate of a 50 to 70 percent abandonment of carts, then even a small improvement can result in millions of pounds in extra revenue, or even billions of dollars.
These simple design tools make is possible for companies of any size – not just those large retailers like OTTO – to do real-time analytics and deliver an enhanced customer experience.
4. Stop hand coding
Everything discussed in the tips above can be programmed in Spark, in Java or in Scala. But there’s a better way. If you are using a visual design interface, you can increase development productivity 10 times or more.
When you’re designing jobs with a visual UI, it makes it so much easier to share work with colleagues. People can look at it and understand what the integration job is doing – making collaboration straightforward and the ability to re-use development work simple.
5. Get a head start
You can start straight away by using a big data sandbox, a virtual machine with Spark pre-loaded and with a real-time streaming use case. And, if you need it, there’s a simple guide that walks you through a step-by-step process, making it easy to pick things up and hit the ground running.
How much do you know about open source technology? Take our quiz to find out!