Quantcast
Channel: Security – Hortonworks
Viewing all 143 articles
Browse latest View live

Announcing Apache Falcon 0.6.0

$
0
0

With YARN as its architectural center, Apache Hadoop continues to attract new engines to run within the data platform, as organizations want to efficiently store their data in a single repository and interact with it for batch, interactive and real-time streaming use cases. As more data flows into and through a Hadoop cluster to feed these engines, Apache Falcon is a crucial framework for simplifying data management and pipeline processing.

Falcon enables data architects to automate the movement and processing of datasets for ingest, pipeline, disaster recovery and data retention use cases.

We recently released Apache Falcon 0.6.0. With this release, the community addressed more than 220 JIRA issues. Among these many bug fixes, improvements and new features, four stand out as particularly important:

  • Authorization with ACLs for entities
  • Enhancements to lineage metadata
  • Cloud archival
  • Falcon recipes

Screen Shot 2015-01-05 at 11.05.03 AMThis blog gives an overview of these new features and how they integrate with other Hadoop services. We’ll also touch on additional innovation we plan for upcoming releases.

Authorization with ACL for entities

Now Apache Falcon supports an access control list (ACL) that provides authorization for Feed, Cluster and Process entities. This allows Falcon to leverage existing security work to maintain consist controls throughout the HDP stack. This security enhancement lays the foundation for broader enterprise adoption and a variety of new use cases that will flow from that.

Enhancements to lineage metadata

This Falcon release provides better access to lineage metadata. This facilitates the quick and efficient search and retrieval of lineage information, which makes it easier to comply with data retention and discoverability regulations.

Cloud archival

Allows leveraging of Cloud infrastructure such as Amazon S3 and Microsoft Azure. We’re excited about this change because it extends the archive use case for continuity and ad hoc analysis.

Falcon recipes

A Falcon recipe is a static process template with parameterized workflow to realize a specific use case. Recipes are defined in the user space. All recipes can be modeled as a Process within Falcon, which then periodically executes the user workflow. As the process and its associated workflow are parameterized, the user will provide a properties file with name/value pairs that are substituted by Falcon before scheduling. Falcon translates these recipes as a process entity by replacing the parameters in the workflow definition. Recipes enable non-programmers to capture and re-use very complex business logic.

Plans for the Future of Falcon

We want to thank the Apache Falcon community for all of its hard work delivering this release. Looking forward to future releases, the Apache Falcon team plans:

  • Usability improvements, with a new UI, REST API additions and enhanced documentation
  • Further strengthening of HA capabilities, allowing Falcon to meet ever more stringent SLAs
  • Integration with Apache Knox for tighter perimeter security and proxy access

Download Apache Falcon and Learn More

The post Announcing Apache Falcon 0.6.0 appeared first on Hortonworks.


Hadoop Security: Is it a Different Paradigm?

$
0
0

This guest blog post is from Srikanth Venkat, director of product management at Dataguise, a Hortonworks security partner.

AlphonseKarr1Plus ça change, plus c’est la même chose
As Jean-Baptiste Alphonse Karr noted “The more things change, the more they stay the same.” Often, that’s not what we hear when looking at Hadoop security: people tend to call out how different Hadoop is, and how different its security solutions need to be. For protecting data in Hadoop, people will highlight that with big changes in Big Data existing enterprise security models don’t hold, and that a whole new paradigm is required. They’ll call out that the data is different, the processing is different, and the access and data sharing are different. As a security startup focused on the Hadoop market, we are as guilty of this at Dataguise as the next guys.

At the recent Big Data Security and Governance Meetup, Balaji Ganesan of Hortonworks (formerly co-founder XA Secure) gave a presentation entitled “Apache Ranger: Current State and Future Directions”. The talk focused on how Apache Ranger provides differentiated capabilities around authorization and auditing from an enterprise wide compliance and governance perspective. I’ll cover some thoughts on those capabilities below. But in hearing Balaji present on authorization models, I realized how many parallels one could draw from other security contexts to what I’ll call necessary “security non-inventions” in the Hadoop context. We’ve been so busy highlighting the new, that we’ve forgotten the well known and honed. Yes, there are some new architectural requirements needed around authentication, authorization, and auditing in big data, but the foundational principles stay the same. Some of those fundamentals include:

  • Simplicity matters – I liked some of the decisions made by the Ranger team to balance scalability (inheritance and hierarchical support for users/groups) with simplicity (positive permission model with a borrowed “only-one” applicable model from XACML).
  • Performance matters – By making the embedded policy enforcement points (PEPs) local with a plugin style framework, Ranger nicely solves any latency or performance issues with local security processing for authorization.
  • Coverage matters – Perhaps most interesting to me, as an enterprise security enthusiast, was the breadth and speed with which Ranger – as a 100% open source project – could cover authorization plugins quickly across the fast evolving Hadoop ecosystem (YARN, Solr, Kafka etc. are all expected shortly).

So, back to authorization models, why not re-use existing authorization frameworks in Hadoop? These standards, such as XACML, (the de facto standard for centralized entitlement management and authorization policy evaluation and enforcement) do not apply well to the Hadoop context due to an overly complex policy framework that is hard to setup, and an inability to serve distributed deployments. Apache Ranger overcomes these limitations by offering a centralized security framework to manage fine-grained access control over Hadoop data access easily via policies for access to files, folders, databases, tables, or columns at individual users or group level that can be synchronized with external enterprise directories such as AD or LDAP.

As enterprises gradually move to adopting Hadoop as the Enterprise Data Platform, comprehensive Hadoop security that is pluggable and easy to use not only becomes a necessity for survival, but also for satisfying the increasingly complex compliance and regulatory context as higher risk data get stored in Hadoop. This need transcends basic authorization via file permissions in HDFS, resource level access control (via ACL) for MapReduce and coarser grained access control at a service level. It is precisely on this market need for pluggable security that Apache Ranger, a project that is the community offshoot of Hortonwork’s XA Secure acquisition, focuses. Apache Ranger delivers a ‘single pane of glass’ for centralized, consistent administration of security policy across Hadoop ecosystem tools and all Hadoop workloads including batch, interactive SQL and real-time. Ranger’s architectural components include:

  • Ranger portal where users can create and update policies, which are then stored in a policy database and an audit server that sends audit data collected across Hadoop components (Hive, HDFS, Hbase etc.) via Ranger plugins.
  • Ranger plugins that enforce policies for the components; and
  • A user synchronization utility for users and groups from Unix, LDAP or Active Directory.

In the Hadoop universe, Apache Ranger now enables security personnel to perform richer audit tracking and deeper policy analytics streamlining governance across an enterprise.

ranger

Balaji also outlined several exciting future capabilities that are being planned for Apache Ranger including:

  • Integration with Apache Falcon to leverage data lineage within the cluster
  • Deeper integration with Apache Ambari
  • New and improved permission schemes for cluster components
  • Interactive audit querying through Solr
  • Global tag based policies
  • Tighter support for administration of data protection policies with partner solutions, such as Dataguise.

As enterprises increasingly adopt Hadoop for complex business analytics and decision-making, sensitive customer and corporate data burgeons in the Hadoop ecosystem. Balaji remarked that, in addition to Hadoop toolsets for authentication and auditing such as Apache Ranger, enterprises increasingly need robust data-centric protection (masking, encryption, tokenization) in order to effectively reduce data privacy risks in a practical, repeatable, and efficient manner and adequately address insider threat issues for sensitive data stored in Hadoop. In this arena, partner solutions that offer data centric protection for Hadoop data such as Dataguise DgSecure for Hadoop complement the enterprise ready Hadoop distributions (such as those from Hortonworks). Dataguise’s DgSecure for Hadoop solution helps enterprises fully address their nuanced compliance and risk management needs by providing complete protection across your Hadoop data lifecycle whether during ingest, in storage, or in usage as shown below.

DgSecure Architecture-Diagram-V8

Ultimately, I think that Hadoop security has its challenges, and some look exactly like existing security challenges, and some are brand-spanking new and introduced by the nature of the architecture, features, and functions of Big Data. For managing authorization, continued innovation through efforts such as Apache Ranger are tackling many of these issues, but ultimately, it’s comforting to know that some things never change.

Learn More

Voltage SecureStorage Complements HDP for Compliance and Data Protection

$
0
0

Our guest blogger is Carole Murphy, director of product marketing for Voltage SecureStorage at Voltage Security, a Hortonworks Certified Technology Partner.

The demand for Hadoop is accelerating, as enterprises move from proof of concept to full production implementations. With the move to modern data architecture, data security and compliance has become a growing concern.

Securing data in Hadoop is a hot topic and the Hadoop community is investing and providing value-added capabilities in security and governance. A great example of this is the leading position Hortonworks takes with the authentication, authorization, audit and data protection capabilities delivered by Apache Ranger and Apache Knox.

Now Voltage Security®, the global leader in data-centric security for Hadoop, and a certified Technology Partner with Hortonworks, has announced Voltage SecureStorage™ for volume-level encryption in Hadoop. Voltage SecureStorage™ is available as a stand-alone option on a subscription licensing basis, for those looking for “data-at-rest” encryption. Voltage SecureStorage protects against loss of storage media–through human error or physical theft of the hard drive–and offers an initial security response to meet compliance requirements for data protection in Hadoop.

A significant value-added feature, Voltage Stateless Key Management™, is also included in the standalone subscription offer of Voltage SecureStorage. Voltage Stateless Key Management technology provides keys automatically, enabling granular, role-based access to data, and mapping to existing enterprise policies for data access. It eliminates another requirement of traditional security solutions, the key management database and key storage. Voltage Stateless Key Management saves on server costs and administration overhead by doing away with issues such as key roll-over, back-up, recovery and audit, and delivers high performance and scalability well-matched with Hadoop speeds.

For those customers electing to use basic level, data-at-rest protection now, Voltage SecureStorage provides this coverage economically while giving them the ability to grow their capacity and expand their protection in the future.

For those taking a longer view of the journey toward the data lake, Voltage provides the key management behind volume encryption, but this also enables expansion to other use cases for securing Hadoop data as well as other platforms, making Voltage SecureStorage the first step toward full data-centric security for Hadoop in the enterprise. Voltage delivers data-centric security with the Voltage SecureData Suite for Hadoop, which provides field-level data security in all modes of operation, for data-at-rest, in motion, and in use, and can be extended beyond Hadoop to other platforms and databases. The Voltage SecureData Suite for Hadoop includes both Voltage Stateless Key Management and Voltage SecureStorage.

Voltage offers subscription pricing options to support and align with customer purchasing preferences and different customer needs in procurement, from pilot phase to enterprise-wide deployments. Many customers begin by purchasing Hadoop on a departmental basis and they are looking for pricing options such as this new subscription pricing from Voltage. Subscription pricing includes all necessary infrastructure components, for a single, per node price. Subscription pricing can make it easier to configure, and easier to purchase, by reducing the up-front outlay. This pricing also aligns with the Hortonworks Data Platform (HDP) from a pricing perspective so the customer can make a TCO-based decision for the entire stack.

Voltage SecureStorage provides data-at-rest encryption integrated with Voltage Stateless Key Management, as a standalone option for $500/node/year. (Volume discounts are available upon request. Per node subscription pricing includes standard Voltage support.) Go to voltage.com/hadoop for more information and to purchase Voltage SecureStorage.

Learn More

The post Voltage SecureStorage Complements HDP for Compliance and Data Protection appeared first on Hortonworks.

Open Source Communities: The Developer Kingdom

$
0
0

Since our founding in 2011, Hortonworks has had a fundamental belief: the only way to deliver infrastructure platform technology is completely in open source. Moreover, we believe that collaborative open source software development under the governance model of an entity like the Apache Software Foundation (ASF) is the best way to accelerate innovation that targets enterprise end users since it brings the largest number of developers together in a way that enables innovation to happen far faster than any single vendor could achieve and in a way that is free of friction for the enterprise.

In the past, open source development was thought of as a bunch of hackers living in basements or working out of garages. Fast-forward to 2015 and, as Stephen O’Grady puts it, developers “have become the most important constituency in business seemingly overnight.” They are some of the highest paid and most valued assets in any modern organization. Developers are the new kingmakers.

No matter if they are individual developers or rockstars in a mega corporation, many of these key talents have found each other in a common and very open community. They are contributing and building out key platform technologies such as Apache Hadoop.

But why? Why contribute? Why give away such important intellectual property?
With platform technologies, the developer just wants it to work and the mega corporation just wants to pursue adjacent opportunities built on it. Every company is becoming a software company and the platform on which they build needs to be stable, reliable and complete. So, in the end, what is good for the platform is GREAT for all.

None of this can happen without the ASF

The ASF is critical in providing the right environment where meritocracy rules and platform technologies can be advanced. It provides valuable stewardship and guide-rails for projects interested in attracting the broadest community of involvement as possible.

The ASF recently published the Apache Project Maturity Model document that provides a GREAT overview of “how Apache projects operate, in a concise and high-level way”. It is an important piece of work in this new world of software development, and we encourage open source projects without a clearly defined governance model like the ASF’s to read through and use some of these items in support of your own project’s goals.

Harmonizing upstream project innovation and downstream enterprise product

The Hortonworks development model complements what the ASF provides by defining how the dozens of ASF projects included within the Hortonworks Data Platform (HDP) should work together in concert as part of an enterprise-viable data platform. Understanding and satisfying enterprise requirements that span the ASF projects and/or making tradeoffs about which projects are better suited for specific use cases are ways we enable value above and beyond what the ASF focuses on.

Our development model enables us to get the latest stable innovation into our end users’ hands quickly. It’s been designed to enable a virtuous cycle of open source innovation that flows naturally from upstream ASF projects into downstream enterprise grade HDP releases. This minimizes and/or eliminates drift from the corresponding stable ASF project releases. We also work diligently to ensure bug fixes included in HDP 1.x and 2.x maintenance releases are managed in a way that avoid regressions in future releases.

Our certification process is underpinned by a test suite unique to Hortonworks that’s comprised of 10’s of thousands of functional, system, and integration tests that are run on some of the world’s largest clusters within our joint engineering partners’ environments across a wide range of Linux and Windows systems. By doing all feature development and bug fix work in the upstream ASF projects, our downstream HDP process can cleanly harmonize a set of stable ASF project releases into a rigorously tested and certified platform suitable for the enterprise.

Focusing on enabling an equitable balance of power, not lock-in

End users have learned that open source can not only meet their requirements, but can also enable them to get out from under the yoke of any single vendor’s agenda. Moreover, active participation in open source is not just the domain of web-scale end users like Yahoo!, Google, Facebook, Netflix, and LinkedIn.

Mainstream enterprises such as Aetna, Merck, and Target are participating in open source initiatives that address such needs as data governance since they are vital to their businesses and industries. While our customers are normally downstream consumers of HDP and our subscription services, efforts like the Data Governance Initiative, are enabling customers to move upstream into the ASF process to help prioritize features and develop/test the ASF project code so the released technologies are highly aligned with their requirements.

Since we make HDP available free of charge and derive the majority of our revenue from annual support subscriptions, we realize that “every year is an election year” for our customers. We feel our open source development model yields unmatched efficiency and customer satisfaction since it provides a direct way for customers and partners to participate, drive value, and capture adjacent opportunities.

And we feel our business model establishes an equitable balance of power that de-risks investments and focuses on mutual success.

A Benevolent Benefactor

We at Hortonworks have been part of the open source Apache Hadoop movement from its very beginnings. Our team members have helped incubate many new open source technologies and are well versed in how communities work and how organizations can benefit from them. We endeavor to help guide and extend the Hadoop ecosystem as a benevolent benefactor, ensuring all those in the data center can benefit from it.

Another key principle we were founded on was to enable the ecosystem to adopt and extend Hadoop. We do so in the way we help architect solutions. We do so in the way we partner with the technology leaders. We do so in the way we get involved and represent our customer in the community. And it’s all fueled by the rock star developers we employ and the communities in which they work.

If you’d like to see the community in action for yourself, then please join us at Hadoop Summit Brussels and San Jose where you can engage with end users and the broader Hadoop community to help shape the future of Apache Hadoop!

The post Open Source Communities: The Developer Kingdom appeared first on Hortonworks.

Get a Jumpstart on your competition with Hadoop

$
0
0

Forrester recently called Apache Hadoop adoption “mandatory” for the enterprise. For most organizations, moving forward with Hadoop is no longer a question of if, but when. Hadoop-powered insight into big data is enabling market disruption in every industry and the market winners are those who handle that data most effectively and at the lowest cost.

As with any new platform, making decisions on how best to implement and for what purpose can be challenging. What data? Which use case? How does this thing work anyway?

Talk to Us About Jumpstart

Success with HDP Jumpstart

Hortonworks created HDP Jumpstart to address these questions. Based on experience working with hundreds of customers worldwide, HDP Jumpstart packages the support, training and services needed to get up and running on Hadoop quickly. Since introducing Jumpstart in late 2014, we’ve honed the offering to match what our most successful customers have done to accelerate their use of Hadoop.

Jumpstart includes an HDP Subscription that bundles expert support with access to self-paced learning on Hortonworks University, and adds up to five days professional services to install and configure HDP. It’s a formula we’ve seen work over and over again.

Use Case Realization in 90 Days

Recently, a large US insurer got started on their Hadoop journey with Jumpstart. They made a focused effort to deliver an initial use case in 90 days. Based on that project, merging customer information residing in multiple systems with new data sources like web activity into a single view, they saw immediate results in improved customer retention and revenue per customer that translated into faster revenue growth for the company. This success set them up to deploy a series of use cases focused on everything from agent productivity to fraud detection.

The way Hadoop – and HDP in particular – is designed makes it possible to start small and grow the scope and value of a deployment quickly. A Hadoop cluster scales linearly, with new commodity servers added as data storage and processing needs grow. And thanks to HDP’s YARN-based centralized architecture, new applications can plug into the cluster and inherit the resource management, security, governance, and manageability of the cluster.

5-boxes

Steps for Success

Nearly every customer we see starts in one of two ways:

  1. Data architecture optimization – to reduce the cost of their data infrastructure by offloading data storage and processing to Hadoop
  2. Advanced analytic application – to realize the new value from their data by deploying a single application

They continue to add new use cases, each delivering tangible return within a defined timeline, as they progress toward a data lake. Today’s competitive environment demands fast results, incremental progress, and flexibility to alter course as conditions change. And that’s exactly what HDP is designed to support.

datalake

Jumpstart Now

If you are considering how to get started with Hadoop, talk to us about HDP Jumpstart. From day one you will have access to Hortonworks support engineers who know how to guide you through your architecture decisions and help you avoid problems down the line. Meanwhile, your team can get started building the Hadoop skills that will be critical for long-term success.

It’s your first step toward using data to disrupt your industry and win.

The post Get a Jumpstart on your competition with Hadoop appeared first on Hortonworks.

Secure Analytics in the Modern Data Architecture

$
0
0

At the beginning of February, HP announced their intent to acquire Voltage Security to expand data encryption security solutions for Cloud and Big Data. Today, both companies share their thoughts about the acquisition. Carole Murphy, Director Product Marketing at Voltage Security, and Albert Biketi, Vice President and General Manager at HP Atalla, tell us more about how HP extends the capabilities of every product in the Voltage portfolio, including Voltage’s leadership in securing Hadoop data with data-centric, standards-based technologies.

SIGN UP FOR THE WEBINAR

Voltage’s powerful data-centric protection solutions will join the HP Atalla portfolio, expanding HP’s offerings in data classification, payments security, encryption, tokenization and enterprise key management. With Voltage, HP plans to offer customers unparalleled data protection capabilities built to close the gaps that exist in traditional encryption and tokenization approaches. This is particularly important for enterprises that interact with financial payments systems, manage workloads in the cloud, or whose sensitive data flows into Hadoop for analytics.

HP Security Voltage augments the Hadoop security tools developed by Hortonworks and the open source community, with a data-centric security platform to protect sensitive data while at rest, in motion and in use, maintaining the value of the data for analytics, even in its protected form. Hortonworks and HP Security Voltage together are collaborating to provide comprehensive security for the enterprise to enable rapid and successful adoption of Hadoop.

Join the webinar on March 31st, and learn more about the collaboration between HP Security Voltage and Hortonworks, the industry’s only 100% open source Apache Hadoop based platform.

Learn More

The post Secure Analytics in the Modern Data Architecture appeared first on Hortonworks.

Going from Hadoop Adoption to Hadoop Everywhere

$
0
0

As we are finalizing our preparations for what will surely be another successful Hadoop Summit Europe event, one thing has become unequivocally clear: the Hadoop challenge is no longer about acceptance. It’s no longer about adoption. It’s about Hadoop being pervasive. Hadoop is everywhere.

As Mike Gualtieri of Forrester wrote in a recent report:

Hadoop is a must-have for large enterprises

I couldn’t agree more with Mike’s assessment, and I encourage you to read the report: “Predictions 2015: Hadoop Will Become a Cornerstone of Your Business Technology Agenda”. And if you are attending Hadoop Summit Europe, you’ll get to hear Mike’s thoughts firsthand in his keynote session entitled “Adoption Is the Only Option – 5 Ways Hadoop Is Changing The World And 2 Ways It Will Change Yours.”

Hadoop is Transforming Every Industry

At Hortonworks, we know it’s not just about the technology, but the way technology is utilized to transform industries. An abundance of new types of data combined with Hadoop’s ability to store and process it at lower costs is opening up new opportunities in a variety of industries.

hs_launch

As a leader in the Hadoop ecosystem, we try to shine a light on innovative ways that enterprises utilize HDP to transform businesses in different industries. I encourage you to check out the Vertical Industry Solutions section of our website that highlights a wide range of transformational use cases across the advertising, financial, government, healthcare, insurance, manufacturing, oil & gas, retail, and telecom industries.

There is also an entire track at Hadoop Summit dedicated to the Applications of Hadoop and the Data Driven Business, as well as a number of other sessions in other tracks that highlight ways that enterprises are optimizing their data architectures and building advanced analytic applications that create new business value. I have a hunch that these presentations will be the most well attended sessions of all.

Hadoop Everywhere Means Any Data, Any Application, Anywhere

The vertical industry use cases mentioned above hinge on consolidating existing and new data sources and joining those sources in ways that deliver insights in ways that were previously both technically and economically impossible.

hs_launch_2

Being able to process “Any Data” has always been a hallmark feature of Hadoop; be it new data sources such as clickstream, web and social, geo-location, IoT, server logs and other high volume data sets, as well as traditional data sets from ERP, CRM, SCM or other existing applications and data systems.

Any Application” has been made possible by YARN, which opens up Hadoop to existing and new applications. With Apache Hadoop YARN and the proliferation of many types of access methods, from batch to interactive to real-time, it’s never been easier to process and analyze this data. Hortonworks continues to work closely with the ecosystem, via our YARN Ready program, to ensure that popular analytical applications can interoperate with YARN and HDP in a consistent manner. The more than 70 YARN Ready solutions help enterprises leverage their existing skill sets and investments while they make better business decisions by having access to a more complete view of their data.

hs_launch_3

A number of YARN Ready partners will be represented at Hadoop Summit Europe, as exhibitors or presenters, and there is also an entire Hadoop Access Engines track that will highlight a wide range of batch, interactive, and real-time applications running in Hadoop.

Anywhere” means that Hadoop can be deployed to your environment of choice. HDP supports Linux and Windows operating systems on commodity servers, appliances and all major cloud platforms, including Microsoft Azure, Amazon Web Services, Google Cloud Platform and OpenStack.

hs_launch_4itled

We’re also investing in making the deployment process even simpler for enterprises. Stay tuned for upcoming announcements on this topic and consider attending the Microsoft and Google cloud presentations at Hadoop Summit Europe.

Enterprise Readiness Helps Enable Hadoop Everywhere

While it’s clear that Hadoop is ready for the enterprise, that doesn’t mean that we stop our work on enterprise readiness. In fact, it’s just the opposite. There are more security, governance and operational advancements taking place in the Hadoop ecosystem now than ever before.

hs_launch_5

Over the coming days, please pay attention to blog posts discussing important new operations, security and governance functionality supported by Hortonworks.

In fact, there is an entire track at Hadoop Summit Europe dedicated to enterprise readiness topics, with speakers from enterprises, ecosystem partners and, of course, Hortonworks.

See the Momentum for Yourself!

This is an exciting time for the Apache Hadoop ecosystem. The community has been at this for a while, and we now see Hadoop being used everywhere. If you’d like to see the Hadoop momentum for yourself and you can’t join us next week in Brussels, then come join us at Hadoop Summit in San Jose starting June 9th.

The post Going from Hadoop Adoption to Hadoop Everywhere appeared first on Hortonworks.

Announcing Apache Ambari 2.0

$
0
0

Advances in Hadoop security, governance and operations have accelerated adoption of the platform by enterprises everywhere. Apache Ambari is the open source operational platform for provisioning, managing and monitoring Hadoop clusters from a single pane of glass, and with the Apache Ambari 1.7.0 release last year, Ambari made it far easier for enterprises to adopt Hadoop.

Today, we are excited to announce the community release of Apache Ambari 2.0, which will further accelerate enterprise Hadoop usage by simplifying the technical challenges that slow adoption the most. Ambari 2.0 includes many features, most notable of which are:

Many thanks to all of the contributors and committers who collaborated on this release and resolved more than 1,700 JIRA issues. For the complete list of new features, check out this What’s New in Ambari 2.0 presentation.

Enough of the chit-chat. Here are some details of the exciting new features in Apache Ambari 2.0.

Automated Rolling Upgrades for the HDP Stack

The Hortonworks Dev team did a great job describing rolling upgrades in this blog post. To highlight, as enterprises everywhere adopt Hadoop, they deploy more and more mission-critical analytic applications. Because of these mission critical workloads, the platform must undergo minimal to no cluster downtime during upgrades from one version to the next. That means the Hadoop platform needs to be “rolling upgradeable.”

The effort in the open source community to make the Hadoop platform rolling upgradeable goes beyond packaging (even though that is one of the key components of rolling upgrades). The developers need to consider the API compatibility between components, the components need an ability to restart jobs underway on the cluster and the system needs to maintain high availability among the Hadoop components for seamless master component switches during upgrades.

That’s a lot of work. But the Hortonworks Dev team brought it all together with Hortonworks Data Platform 2.2 and the Ambari Automated Rolling Upgrade for HDP Stack capability. This allows Hadoop operators to perform a rolling upgrade from one version of HDP to the next with minimal disruption to the cluster. Ambari orchestrates a series of operations on the cluster (with checks along the way) that help you move components to a newer version.

This only scratches the surface. Stay tuned for subsequent blogs with more details on automated rolling upgrades.

Simplified, Comprehensive Hadoop Security

Ambari 2.0 helps provision, manage and monitor Hadoop security in two ways. First, Ambari now simplifies the setup, configuration and maintenance of Kerberos for strong authentication in the cluster. Secondly, Ambari now includes support for installing and configuring Apache Ranger for centralized security administration, authorization and audit.

Kerberos has long been the central technology for enabling strong authentication for Hadoop, but Kerberos configuration posed quite a challenge creating the principals and keytabs. Ongoing maintenance of those artifacts could be cumbersome.

Ambari 2.0 makes this easier with an automated wizard-driven Kerberos configuration approach that eliminates time-consuming administration tasks. Ambari can work with your existing Kerberos infrastructure, including Active Directory, to automatically generate your cluster’s requisite principals and keytabs. Then, as you expand your cluster with more hosts or new services, Ambari can talk to your Kerberos infrastructure and automatically adjust the cluster configuration.

Apache Ranger is the other side of the security equation, providing centralized management of access control services for administration, authorization and audit. Ranger was added as a GA component in Hortonworks Data Platform 2.2 and now with Ambari 2.0, Ranger can be automatically installed and configured with the rest of your cluster components.

Watch this blog for future posts digging deeper into Kerberos, Apache Ranger and comprehensive security support with Ambari 2.0.

Ambari Alerts

The enterprise Hadoop operator needs maximum visibility into the health of the cluster. As the operational framework for Hadoop, Ambari must provide that visibility out-of-the-box and also flexibly integrate with existing enterprise monitoring systems. Ambari Alerts aims to strike that balance between ease and flexibility.

ambari_2.0_1

Ambari Alerts provides centralized management of health alerts and checks for the services in your cluster. Ambari automatically configures the particular set of alerts based on the services installed. As a Hadoop operator, you have control over which alerts are enabled, their thresholds and their reporting output. For maximum flexibility, alert groups and multiple notification targets give you very granular control of the “who, what, why and how” around alerts. This puts both flexibility and power in the hands of the Hadoop operator, who can now:

  • Create and manage multiple notification targets and control who gets notified for which alerts.
  • Filter notification by alert severity and send certain notifications to specific targets based on that severity.
  • Control notification target methods, including support for EMAIL + SNMP so the person being notified can be alerted via their preferred method.

Ambari also exposes alerts REST API endpoints to enable integration with existing systems. There are a few integration patterns in the What’s New in Ambari 2.0 slides to give you a better sense of the possibilities. As one example of the Ambari community rallying around Alerts, our partners at SequenceIQ dove in head-first and have already integrated alerts with Periscope. Be sure to check out what they have done, since it’s a great example of community innovation in action.

Download Ambari and Learn More

The Ambari community is already hard at work improving Apache Ambari capabilities to provision, manage and monitor Hadoop clusters. Watch this blog for more news on enhancements to core features and extensibility features. But in the meantime, checkout the community release of Ambari 2.0 with the following resources:

The post Announcing Apache Ambari 2.0 appeared first on Hortonworks.


Introducing Automated Rolling Upgrades with Apache Ambari 2.0

$
0
0

The recent post by Jayush Luniya announced the community release of Apache Ambari 2.0. One of the three key Ambari features that Jayush discussed was Rolling Upgrades, enabling Hadoop operators to upgrade from one version of HDP to the next, with minimal disruption to the cluster.

The Hortonworks development team worked long and hard to make the Hadoop platform “rolling upgradeable”. That groundwork was available in Hortonworks Data Platform 2.2 as described in this previous post. That breakthrough made rolling upgrades possible, but they were not automated; users still had to follow a set of manual steps to upgrade the cluster in a rolling fashion.

But now with Apache Ambari 2.0, you can automate rolling upgrades, and this post walks through how that works.

Overview

HDP 2.2 takes care of cluster API compatibility, seamless job execution, component high availability and software packaging. The “automation of steps” was the final piece to the puzzle, which just fell into place with Ambari 2.0.

HDP has a certified process for executing the upgrade in a certain order so component interdependencies are handled correctly and the software versions can be switched in a rolling fashion. The process follows this order:

  1. Cross-cutting components (such as ZooKeeper and Ranger)
  2. Core master components (such as HDFS NameNode and YARN ResourceManager)
  3. Cluster slave components (such as HDFS DataNodes and YARN NodeManagers)
  4. Ancillary components and clients

am_1

The Four-Step Process

Ambari 2.0 automates the process defined by HDP to expose a method for automating a cluster rolling upgrade through the Ambari Web UI.

am_2

Step 1: Register the New Version

The first part of a rolling upgrade is to register the new version with Ambari. This lets Ambari know that there is new software available and where to get it, which is particularly important in the case of installs when there is no Internet access.

Step 2: Install the software on all hosts

Second, the operator instructs Ambari to install the software on all the hosts in the cluster. This step is deliberately called out because with any cluster install, the most time-consuming task is getting the software on the machines. This install happens outside of the range of normal cluster operations, in order to prevent downtime. The end result is that the new software is placed side-by-side the currently running software (of an older version).

Step 3: Perform the upgrade

Once the software is installed, Ambari performs the upgrade by orchestrating the process defined by HDP to roll through the cluster and switch to the new software version. This experience is wizard-driven and shows the user each step in the process as it goes from ZooKeeper Servers and Core Master Components all the way through Client Components.

Ambari has built-in guardrails. At periodic stages of the upgrade, Ambari will automatically run Service checks to validate and verify expected functionality. As an added precaution, there are also points where Ambari will prompt (and encourage) you to confirm operations that are indeed running smoothly. If it at any point it seems that things have gone awry, you have the option to Downgrade, which orders Ambari to reverse the operations already performed to get you safely to the original state.

am_4

Step 4: Finalize the upgrade

Once the upgrade process is complete, the operator is prompted to finalize. This completes the landing process on the new version. Alternatively, you are given a final chance to Downgrade and remain on the original version.

The automated rolling upgrade feature of Ambari 2.0 also introduces a new concept of an Upgrade Pack to Ambari Stacks. Check out the upgrade pack, and you can see that it defines the steps, tasks and rules to guide the automation. This approach was important to make sure we have future ability to extend Ambari as more ecosystem components come under Ambari management.

Download Ambari and Learn More

We think you’ll agree this capability is a huge step forward for Ambari. We encourage you to tryout Ambari 2.0 today. Checkout these resources for more information:

The post Introducing Automated Rolling Upgrades with Apache Ambari 2.0 appeared first on Hortonworks.

Ambari 2.0 for Deploying Comprehensive Hadoop Security

$
0
0

Hortonworks Data Platform (HDP) provides centralized enterprise services for comprehensive security to enable end-to-end protection, access, compliance and auditing of data in motion and at rest. HDP’s centralized architecture—with Apache Hadoop YARN at its core—also enables consistent operations to enable provisioning, management, monitoring and deployment of Hadoop clusters for a reliable enterprise-ready data lake.

But comprehensive security and consistent operations go together, and neither is possible in isolation.

We published two blogs recently announcing Ambari 2.0 and its new ability to manage rolling upgrades. This post will look at those innovations through the security lens, because security, like operations, is a core requirement for enterprise-ready Hadoop.

Security in Hadoop Today

HDP offers comprehensive security, across all batch, interactive, or real-time workloads and access patterns. Hortonworks is focused on delivering comprehensive security across 5 pillars, namely centralized administration, authentication, authorization, audit, and data protection.

sec_1

HDP provides comprehensive security by way of three key services:

  • Kerberos is an MIT standard adopted by the open source community to authenticate users attempting to access Hadoop.
  • Apache Ranger provides centralized security administration for HDFS, Hive, HBase, Storm and Knox as well as fine-grain access control.
  • Apache Knox provides perimeter security for API access and REST services.

Security Setup with Ambari 2.0

Ambari 2.0 represents a significant milestone in the community’s ongoing work to make Hadoop enterprise-ready with easy security setup and administration. Now Ambari 2.0 can help administrators automate Kerberos setup for a cluster, install KDC and create service principles. Administrators can also use Ambari to install Ranger admin and enable the Ranger plugin with a few clicks.

Automated Kerberos integration

Before Ambari 2.0, the Kerberos integration in Hadoop required a combination of manual steps to install and manage these important components:

  • KDC (key distribution center),
  • User and service principles (identities) and
  • Respective keytabs (tokens).

With Ambari 2.0, the entire Kerberos setup process is automated, now with the following:

  • A step-by-step wizard to setup the Kerberos infrastructure
  • Integration with existing MIT KDC or Active Directory infrastructure
  • Deployment, configuration and management of Kerberos Clients
  • First time setup as well as ongoing management for adding new services or nodes
  • Automated creation of principals
  • Automated generation and distribution of keytabs
  • Support for regeneration of keytabs

Ambari 2.0 can automate Kerberos deployment and management for existing clusters already using Kerberos, as well as for users looking to install a new cluster.

Figure 1: Initial screen for Kerberos setup

Figure 1: Initial screen for Kerberos setup

This Kerberos Overview documentation for Ambari 2.0 contains an overview and step-by-step details on Kerberos setup.

Automated Ranger deployment

Hortonworks introduced Apache Ranger to deliver the vision of coordinated security across Hadoop with centralized administration, fine-grain access control and audit. Apache Ranger’s first release included enhancements to existing capabilities in the original code base developed at XA Secure and added support for audit storage in HDFS, support for Apache Storm and Knox authorization and auditing, and also REST APIs for managing policies.

With Ambari 2.0, administrators can now easily add comprehensive security through Ranger to either an existing or new cluster. Ambari 2.0 adds in the following benefits to Ranger:

  • Automated install of Ranger policy administrator and user sync. The policy database (mySQL or Oracle) can be configured and user sync can be integrated with LDAP/AD or Unix.
  • Easy one-click setup of the Ranger plugin for HDFS, Hive , HBase, Storm and Knox
  • Ability to start/stop services through the Ambari UI
  • Ability to disable plugins through the Ambari UI

The following screen shots show a user adding Ranger service via Ambari.

Figure 2. Ambari screen to add Ranger service

Figure 2. Ambari screen to add Ranger service

Figure 3: Ambari screen showing already installed and running Ranger service

Figure 3: Ambari screen showing already installed and running Ranger service

Hortonworks continues to lead open-source innovation to enable comprehensive data security for Hadoop—making it easier for security administrators to protect their clusters. With Ambari 2.0, we added the automated install and administration of the HDP cluster’s security infrastructure, with support for installing Kerberos, Apache Knox and Apache Ranger.

This innovation highlights what Hortonworks customers appreciate about our 100% open-source Apache Hadoop platform. HDP provides centralized enterprise services for comprehensive security and consistent operations to enable provisioning, management, monitoring and deployment of secure Hadoop clusters.

Hadoop is ready for the enterprise—providing any data, for any application, anywhere.

More About Comprehensive Security and Consistent Operations in HDP

Read recent Ambari posts

Learn more about the Apache projects

The post Ambari 2.0 for Deploying Comprehensive Hadoop Security appeared first on Hortonworks.

Apache Hadoop 2.7.0 Released!

$
0
0

The Apache Hadoop community is happy to announce the release of Apache Hadoop 2.7.0! We want to express our gratitude to every contributor, reviewer and committer.

The Hadoop community fixed 923 JIRAs in total as part of the 2.7.0 release. Of the 923 fixes:

  • 259 were in Hadoop Common
  • 350 were in HDFS
  • 253 were in YARN
  • 61 were in MapReduce

Hadoop 2.7.0 is the first Hadoop release in 2015, following late last year’s 2.6.0. While Hadoop 2.7.0 is not yet ready for production, it enables the community to execute extensive testing and downstream adoption in order to find and address potential incompatibilities and critical issues. A more stable and production ready release of Hadoop 2.7.x will follow soon.

Starting with Hadoop 2.7.0, Apache Hadoop drops support for the JDK6 runtime and supports only JDK 7+ versions.

The release contains a number of significant enhancements. A few notable ones are:

Hadoop Common

  • Windows Azure Storage Blob support (HADOOP-9629), available in trunk for a while, is now integrated into branch-2 and released as part of 2.7.0.

Hadoop HDFS

  • Enable new read/write scenarios in HDFS by adding support for truncate (HDFS-3107) and support for files with variable-length blocks (HDFS-3689)
  • Enforce quotas at the Heterogeneous Storage Type granularity in HDFS (HDFS-7584)
  • Enhance management (HDFS-7424) and monitoring (HDFS-7449) for the NFS Gateway Server

Hadoop YARN

  • YARN-3100 – Makes YARN authorization pluggable so that tools like Apache Ranger can provide authorization for YARN job submission operations
  • YARN-1492 – Automatic shared, global caching of YARN localized resources (beta)

Hadoop MapReduce

  • MAPREDUCE-5583 – Allows the ability to limit the size of a running MapReduce job by restricting the maximum number of Map or Reduce tasks running at any point in time
  • MAPREDUCE-4815 – Speed up Hive, Pig and MapReduce jobs that deal with many output files by making enhancements to the FileOutputCommitter

Additional enhancements include “nntop,” a top­-like tool for NameNode (HDFS-6982) and a fast leveldb-based implementation (YARN-2765) for ResourceManager StateStore. Please see the Hadoop 2.7.0 Release Notes for the full list of features, improvements and bug-fixes.

Many thanks to everyone who contributed to the release, and everyone in the Apache Hadoop community!

The post Apache Hadoop 2.7.0 Released! appeared first on Hortonworks.

Announcing Apache Knox Gateway 0.6.0

$
0
0

With Apache Hadoop YARN as its architectural center, Apache Hadoop continues to attract new engines to run within the data platform, as organizations want to efficiently store their data in a single repository and interact with it in different ways. As YARN propels Hadoop’s emergence as a business-critical data platform, the enterprise requires more stringent data security capabilities. The Apache Knox Gateway (“Knox”) provides HTTP based access to resources of the Hadoop ecosystem so that enterprises can confidently extend Hadoop access to more users, while maintaining compliance with enterprise security policies.

On May 6th, the community announced the release of Apache Knox Gateway 0.6.0. With this release, the community addressed over 40 JIRA issues. Knox 0.6.0 delivers many new features, fixes and enhancements. Among these improvements, five stand out as particularly importan

This blog provides an overview of the new features and how they integrate with other Hadoop services, as well as provide a preview of enhancements that we have planned for upcoming releases.

Support for Storm UI REST API

The Storm UI daemon provides a REST API that allows users to interact with a Storm cluster. This includes retrieving metrics data and information related to cluster configuration and operations management such as starting or stopping topologies. By providing access to this API, through a new Knox routing service for Storm, users now have the ability to securely access these resources. This feature will be available as a technical preview in HDP 2.3, to be released in later half of this year.

Optimized LDAP Authentication through Caching

In this release of Apache Knox, we reduce the load on the LDAP or Active Directory server and optimize consecutive API calls by reducing the number of authentication attempts. By adding support for cached LDAP authentication information in Knox, eliminates the need for client sessions via JSESSIONID.

Configuration Driven REST API Integration

The objective of this improvement is for contributors to easily provide routing services to Knox in order to support additional APIs. A basic integration now simply consists of providing a Service Definition file and a Rewrite file. In case of more involved integrations, users may have to implement service specific protocols for requests by providing a custom Dispatch class.

The Knox community will benefit from simpler and dynamic REST API integrations through this feature.

SSL Mutual Authentication

SSL Mutual Authentication provides a mechanism for establishing a strong trust between clients and servers based on exchanging certificates to prove their identity. The explicit trust of a client’s certificate using mutual authentication allows Knox to unambiguously identify a client that is presenting a HTTP header or other assertion that needs to be trusted. This feature is especially beneficial for deployments that leverage the pre-authenticated SSO feature.

Improved Configuration for Load Balancers

This feature provides a configuration facility to indicate the “frontend” URL that is used for URL rewriting for deployments with load balancers. This enhancement reduces the processing that load balancer needs to perform for the rewriting task.

Preview of Features to Come

The Knox release would not have been possible without excellent contributions from the dedicated, talented community members who have done a great job understanding the needs of the user community and trying to deliver on them. Based on demand from the user community, we will continue to focus our efforts in three primary areas:

  • Providing access to the evolving REST API of the modern data architecture

    In the future releases, we plan to expose more APIs to new and interesting clients and applications. We believe that our configuration-driven approach for API integrations will facilitate accelerated support for new API integrations.

  • Authentication and SSO capabilities for the REST APIs and web UIs within the Hadoop ecosystem
  • We will continue to expand authentication, federation and SSO capabilities to meet user and developer needs. We plan to deliver support for Keystone tokens and JWT capabilities in the upcoming releases.

  • Reducing integration burden for authentication and load balancing solutions

    Introducing new authentication and federation providers to Knox and providing support for advanced load balancing features such as relevant aspects of the Forwarded HTTP Extension.

Download Apache Knox Gateway and Learn More

The post Announcing Apache Knox Gateway 0.6.0 appeared first on Hortonworks.

Announcing HDP Developer Portal

$
0
0

Historically, the strength of a platform lies in the abilities of developers to learn, try, and build against the platform APIs and capabilities. As Apache Hadoop matures as a platform, it’s the creativity and efforts of the developer community that is driving the innovation that makes Hadoop a vibrant and impactful foundation of a modern data architecture.

A successful developer community leads to a successful platform, and at Hortonworks we are committed to reducing the friction to speed up the success of our customers. Today we’re announcing the release of developer.hortonworks.com, our new starting point for Hadoop developers looking to learn Hortonworks Data Platform (HDP), roll up their sleeves and start building their first applications. Next, they can tap into portal resources to speed the successful development of new data-centric applications, or extend existing systems with the power of Apache Hadoop.

Screen Shot 2015-05-13 at 1.38.14 PM

Innovating while Collaborating

Application development today isn’t about rigid curriculums and certifications that “qualify” you to build applications. Developers are constantly exploring, learning, trying new things, hearing what others have to say about development patterns, methodologies and solutions, and sharing their own discoveries with their peers, in the open source world. This dynamic nature of application development is one of the key enablers of innovation across the IT industry. One characteristic of organic evolution is absence a centralized curation helping developers “get started” with a new technology or enabling rapid discovery of the depth and breadth of relevant content. This problem is particularly acute when you have a rapidly growing and evolving technology such as Apache Hadoop.

Developer.hortonworks.com has two core areas: first, a “New To Hadoop” program that helps developers new to Hadoop learn about Hadoop, try it out with hands-on tutorials on a fully-featured virtualized Hadoop environment (either on your PC or in the cloud), and finally, the resources necessary to start building your own applications. The second area, Hadoop Developer Resources, gathers the critical content and code for the seasoned HDP developer.

Sharing the Wealth of Knowledge

Our team has gathered the reference applications, tutorials, community forums, events and meetups, Stackoverflow, and documentation repositories that speed the time to application delivery. In the coming months, you will see a steady stream of new samples, Ambari Views, Stacks and Blueprints, and tutorials that support new and existing scenarios and functionality supported by the Hadoop Data Platform as well.

Need Something?

If you have any resources you would like to see added, suggestions or comments on the site, or other feedback, please send me an email at mcarter@hortonworks.com.

Keep on coding!

The post Announcing HDP Developer Portal appeared first on Hortonworks.

How to Leverage Big Data Security with Informatica and Hortonworks?

$
0
0

In this guest blog, Sumeet Kumar Agrawal, principal product manager for Big Data Edition product at Informatica, explains how Informatica’s Big Data Edition integrates with Hortonworks’ security projects, and how you can secure your big data projects.

Many companies already use big data technology like Hadoop for their production environments, so they can store and analyze petabytes of data including transactional data, weblog data, and social media content to gain better insights about their customers and business. Accenture found in a recent survey that 79 percent of respondents agree that “companies that do not embrace big data will lose their competitive position and may even face extinction.”

However, without proper security, your Big Data solution might very well open the doors to breaches that have the potential to cause serious reputational damage and legal repercussions. Hortonworks has led the community in bringing comprehensive security, in open source, to Apache Hadoop. Partners like Informatica can leverage security frameworks in Hadoop to enable users to securely bring in data from external sources, transform it and load it into the different Hadoop components.

Informatica Big Data Edition Integration with Hortonworks Security Projects

Informatica Big Data Edition’s codeless, visual development environment accelerates the ability of organizations to put their Hortonworks Data Platform clusters into production. As an alternative to implementing complex hand coded data movement and transformation, Informatica Big Data Edition enables high-performance data integration and quality pipelines that leverage the full power of each node of your Hadoop cluster with team-based visual tools that can be used by any ETL developer.

Informatica Big Data Edition integrates with security framework offered within HDP. The following figure shows the security offerings within the latest version of HDP:

hwx_inf

Authentication

Kerberos

Kerberos is the most widely adopted authentication technology in the Big Data space. Kerberos is an authentication protocol for trusted hosts on untrusted networks. It provides secure authentication between clients/nodes/services. Starting with Ambari 2.0, Kerberos can be fully deployed using Ambari (http://hortonworks.com/blog/announcing-apache-ambari-2-0/#security)

Informatica Big Data Edition integrates completely with Kerberos. A key aspect in Kerberos integration is the Kerberos Domain Controller (KDC). Informatica supports both Active Directory and MIT-based KDC.

KNOX:

Knox is designed for applications for using REST APIs and using JDBC/ODBC over http to access or update data. It is not currently recommended for performance intensive applications such as Informatica. Knox is also not designed for RPC based access (Hadoop clients), in which case it is recommended to use Kerberos to authentication system and end users.

Here is a representative architecture where Knox is deployed over a Hadoop cluster. Knox, in this example, provides perimeter security for users accessing data through applications leveraging REST or HTTP based services.

hwx_inf_1

Informatica Big Data Edition provides several rich functionalities like mass data ingestion, data preparation on Hadoop etc. KNOX may not be recommended for some of these functionalities. It is suggested to use Kerberos for authentication when Informatica ETL tools are being leveraged for

Authorization

Apache Ranger

Apache Ranger offers a centralized security framework to manage fine-grained access control over Hadoop data access components like Apache Hive and Apache HBase. Within Hive, there are recommended best practices for setting up policies in Hiveserver2 and Hive CLI. You can find more details in this blog: http://hortonworks.com/blog/best-practices-for-hive-authorization-using-apache-ranger-in-hdp-2-2/

For Hiveserver2, the Hive authorization does not allow the transform function for SQL std authorization (and thus Ranger). Informatica BDE has plans to support Hiveserver2 in near future and recommends customers use storage-based authorization for protecting metaserver when using Hive with Informatica. More details on storage based authorization can be found here.

Summary

In summary, Informatica and Hortonworks expand the big data security space and ensure that all organizations are secure while implementing their big data projects.

About the Author

Sumeet Kumar Agrawal is a Principal Product Manager for Big Data Edition product at Informatica. Based in the Bay Area, Sumeet has over 8 years of experience working on different Informatica technologies. Sumeet is responsible for defining Informatica’s big data product strategy, roadmap & working with customers to define their big data platform. Sumeet expertise includes Hadoop ecosystem, security, as well as development oriented technologies such as Java & web services. Sumeet is also responsible for evaluating Hadoop partner integration technologies for Informatica.

The post How to Leverage Big Data Security with Informatica and Hortonworks? appeared first on Hortonworks.

Announcing Apache Hive 1.2

$
0
0

SQL is the most popular use case for the Hadoop user community, and Apache Hive is still the defacto standard. Early this week, the Apache Hive community released Apache Hive 1.2.0.

Already the third release this year, the Hive developer community continues to improve the release and grow its team, with 11 Hive contributors promoted to committers in the last three months. Dedicated to make Hive enterprise-ready, the community has made improvements in the following areas:

  1. Additional SQL functionality
  2. Security enhancements
  3. Performance gains
  4. Stability and usability

For the complete list of features, improvements, and bug fixes, see the release notes. Here are notable improvements:

SQL

  • Support for SQL Union (Union Distinct) functionality (HIVE-9039)
  • Support for specifying column list in insert statement. Eg- insert into target(y,z) select * from source (HIVE-9481)

Performance and Optimizer Improvements

  • Grace hash join algorithm for Hive so that Map Joins use disk on overflow instead of failing (HIVE-9277)
  • Predicate PushDown enhancements (HIVE-9069)
  • Improvements in stats for better MapJoin selection, Reducer Parallelism (HIVE-9392, HIVE-10107)
  • Count Distinct Performance Improvements (HIVE-10568)
  • CBO – Better Windowing support (HIVE-10627, HIVE-10686)
  • Changes to comply with SQL:2011 standard for reserved/non-reserved keywords (HIVE-6617)
  • Caching of statistics in HiveServer2 (HIVE-10382)
  • Improve performance of Vector Map Join by using more vectorization techniques (HIVE-9937, HIVE-9824)

Security Improvements

  • Improvements to the hive authorization plugin api, to allow implementations such as ranger to filter results of metadata operations such as show tables.
  • Support for cookie based authentication in HiveServer2 HTTP transport mode (HIVE-9709, HIVE-9710)
  • Support for JDBC driver to enable 2-way SSL / pass additional HTTP headers via intermediate servers such as Knox (HIVE-10477, HIVE-10339)

Administration

  • Cross-cluster warehouse replication support, in conjunction with Falcon.

Usability

  • Improve HS2 logging, Allow logging verbosity to be set at session level (HIVE-10119)
  • New explain plan output geared towards traditional RDBMS users (HIVE-9780)

The post Announcing Apache Hive 1.2 appeared first on Hortonworks.


Hortonworks Q1 Earnings Reflect Hadoop’s Momentum in the Enterprise

$
0
0

Hadoop really is everywhere. In his recent post, “Going from Hadoop Adoption to Hadoop Everywhere” Shaun Connolly made this point and also quoted Forrester’s Mike Gualtieri:

Hadoop is a must-have for large enterprises

Shaun mentioned these key trends in his post:

  • Hadoop is transforming every industry
  • Enterprises are building applications to make use of all kinds of data
  • Hadoop is ready for the enterprise

Earlier this month, we released Hortonworks’ first quarter earnings. We are proud of many of the results from the first quarter, but our 105 new subscription customers in Q1 is one of the best indicators of both the momentum of Open Enterprise Hadoop and our powerful Hortonworks model for extending Hadoop adoption across all industries.

For enterprise software sales, the last quarter is typically the most important, and we did feel proud of our Q4 results—the first we announced as a publicly-traded company. But we actually signed 10 more customers this past quarter than we did in Q4 and our total subscriber base passed 400. This increasing momentum is a tribute to the value of the open source community, to our dedicated partners throughout the ecosystem and to all the hard working Hortonworkers.

While not all of the 105 new customers are Fortune 100 companies, the majority of our new customers are very large corporations with established brands, including new customers such as Schlumberger, Hess, Fannie Mae and Yahoo! JAPAN. The opportunity we are seeing is largest in the $10B+ companies, but we are seeing companies of all sizes benefit from adopting HDP to offload data warehouse costs or grow revenues with new analytic applications.

Other indicators highlight the rapid growth we see in our company and in the Hadoop industry:

  • 40% of the Fortune 100 are Hortonworks subscribers, including: 71% of F100 retailers, 75% of F100 telcos, and 43% of F100 banks
  • 167% expansion in Q1 2015 GAAP Total Revenue over the first quarter of 2014

Our successful partnerships with other datacenter technology vendors also show how well our model is working and they highlight the groundswell of interest throughout the extended Hadoop ecosystem:

  • Growing momentum behind the Open Data Platform (ODP) initiative, with IBM and Pivotal joining us to announce that their respective Hadoop-based platforms (IBM Open Platform 4.0 and Pivotal HD 3.0) are now aligned on a common ODP core of Apache Hadoop 2.6 and Apache Ambari. The ODP core simplifies adoption of Apache Hadoop for the enterprise, improves ecosystem interoperability and unlocks customer choice.
  • JPMorgan Chase & Co. and Schlumberger joined the Data Governance Initiative (DGI) to address data stewardship and data lifecycle management in their respective financial services and oil and gas industries. Recall other industry-leading companies like Aetna (insurance), Target (retail), Merck (healthcare) and SAS (technology) joined the collaboration earlier this year.
  • Pivotal announced that Pivotal HAWQ is available and certified on the Hortonworks Data Platform.
  • EMC‘s Isilon OneFS file system is now certified with HDP allowing EMC customers to use HDP on their existing Isilon implementations.
  • Cisco announced it established a software resale agreement with Hortonworks including marketing, sales and training worldwide for Cisco sales and support organizations.

Come Feel the Momentum at Hadoop Summit on June 9th in San Jose!

If you’d like to see the Hadoop momentum for yourself, then come join us at Hadoop Summit in San Jose starting June 9th. This year’s summit will feature thousands of Hadoop users and Hortonworks customers discussing their successes with Hadoop.

In addition to keynote speakers, Summit will host 163 breakout sessions featuring 199 speakers representing 79 organizations.

We are particularly excited to hear from 75 end-user presenters from enterprises using Hadoop to solve real-world business challenges across every major industry. This represents a huge increase in the number of end-user voices compared to what we heard at last year’s Hadoop Summit.

Some of those companies will be represented on the main stage during keynotes. Others will speak on a customer panel that I will moderate. And many more are presenting in breakout sessions.

Here’s a small sample of those breakouts that offer opportunities to hear from successful Hadoop customers:

  • Ernst & Young will discuss anomaly detection for online security, using machine learning
  • Noble Energy plans to share their journey to becoming a data-driven oil & gas company
  • Mercy will share their success with Hadoop for real-time healthcare analytics
  • Aetna will describe their Mosaic project that delivers a single view for data discovery and profiling
  • Verizon Wireless presents how it turned its IT corridor into a multibillion dollar source of revenue

San Jose Summit 2015 promises to be an informational, innovative and entertaining experience for everyone.

Come join us and experience the momentum for yourself.

The post Hortonworks Q1 Earnings Reflect Hadoop’s Momentum in the Enterprise appeared first on Hortonworks.

Hortonworks Data Platform 2.3 – Delivering Transformational Outcomes

$
0
0

Over the past two quarters, Hortonworks has been able to attract over 200 new customers. We are attempting to feed the hunger our customers have shown for Hadoop over the past two years. We are seeing truly transformational business outcomes delivered through the use of Hadoop across all industries. The most prominent use cases are focused on:

  • Data Architecture Optimization – keeping 100% of the data at up to 1/100th of the cost while enriching traditional data warehouse analytics
  • A Single View of customers, products, and supply chains
  • Predictive Analytics – delivering behavioral insight, preventative maintenance, and resource optimization
  • Data Discovery – exploring datasets, uncovering new findings, and operationalizing insights

What we have consistently heard from our customers and partners, as they adopt Hadoop, is that they would like Hortonworks to focus our engineering activities on three key themes: Ease of Use, Enterprise Readiness, and Simplification. During the first half of 2015, we made significant progress on each of these themes and we are ready to share the results. Keep in mind there is much more work to be done and we plan on continuing our efforts throughout the remainder of 2015.

Today Hortonworks proudly announces Hortonworks Data Platform 2.3 – which delivers a new breakthrough user experience along with increased enterprise readiness across security, governance, and operations. In addition, we are enhancing our support subscription with a new service called Hortonworks® Smartsense

Breakthrough User Experience

HDP 2.3 eliminates much of the complexity administering Hadoop and improves developer productivity.

Hortonworks has been leading a truly Open Source and Open Community effort to put a new face on Hadoop, with the goal of eliminating the need for any cluster administrator, developer, or data architect to interact with a command-line interface (CLI). I know there are folks who love their CLI tools and I’m not saying that we should deprecate and remove those, but there are a large number of potential Hadoop users who would prefer to interact with the cluster entirely through a browser.

We actually started this effort with the introduction of Ambari 1.7.0, which delivered an underlying framework to support the development of new Web-based Views. We’ve built on that progress, leveraging the Views framework to deliver a breakthrough user experience for both Hadoop operators and developers. Here are some of the details…

Smart Configuration

We have worked to develop two critical new capabilities for the Hadoop operator. The first is Smart Configuration. For configuration of HDFS, YARN, HBase, and Hive, we have provided an entirely new user experience within Apache Ambari. It is guided, opinionated (in a good way), and more digestible than ever before.

hdp2.3_1

Our customers have told us that this approach is a giant leap forward from previous approach of configuring these parameters, and we hope you agree. For those experts out there, don’t worry, we still allow you to change, configure, and manipulate all of your favorite settings (which may not appear here) via the Advanced tab. But Smart Configuration provides a much simpler way to configure the most critical and frequently used parameters.

YARN Capacity Scheduler

With the introduction of Hadoop 2, YARN was introduced along with a new pluggable scheduler known as Capacity Scheduler. The Capacity Scheduler allows for multiple-tenants to securely share a large cluster such that their applications are allocated resources in a timely manner under constraints of allocated capacities. As organizations embrace the concept of a data lake which may support datasets and work loads from many different teams or parts of their organization, Capacity Scheduler allows the Hadoop operator to define minimum capacity guarantees within the cluster while each organization can access any excess capacity not being used by others. This approach ultimately provides cost-effective elasticity and ensures service-level agreements can be met for the queries, jobs, and tasks being run on Hadoop.

However, it can be complex to configure the Capacity Scheduler, and it requires manipulation of a fairly sophisticated XML document. The new experience delivers a user interface with sliders to scale values (very similar to how you might manage allocations to your retirement accounts). It delivers a dramatically simpler way to setup queues, and we believe that Hadoop operators will be thrilled with this new approach.

hdp2.3_2

 

Customizable Dashboards

In addition to Smart Configuration, we also spent time sitting side-by-side with a number of our customers’ Hadoop operators to understand what kinds of dashboards and metrics they typically monitor as part of their job to maintain the overall health of their cluster.

Based on those experiences, we have developed customizable dashboards for a number of the most frequently requested components. Customizable dashboards allow each customer to develop a tailored experience for their environment and decide which metrics are most important to track visually.

Here is an example of an HDFS Dashboard in Ambari:

hdp2.3_3

While we care very much about making the lives of Hadoop operators easier, what are we doing about for folks who write SQL queries, develop Pig Scripts, and develop data pipelines? How does HDP 2.3 make their lives easier?

We spent many hours partnering with developers and data stewards to answer those questions. We looked at the tools they currently use, listened to their requests, and started down the path of delivering a breakthrough experience for them in open source and within an open community – allowing others to contribute to this effort as well.

The initial focus was on the SQL developer and looking at the most common tasks they perform. Based on what we learned, we developed an integrated experience to:

  • build SQL queries,
  • provide a visual “explain plan,” and
  • allow an extended debugging experience when using the Tez execution engine.

Here is a screenshot of what we’ve developed to make life easier for the SQL developer:

hdp2.3_4

HDP 2.3 also provides a Pig Latin Editor that brings a modern browser-based IDE experience to Apache Pig. There is also a File Browser for HDFS and an entirely new user experience for Apache Falcon with a web-forms approach to rapidly develop feeds and processes. The new Falcon UI also allows you to search and browse processes that have executed, to visualize lineage and to setup mirroring jobs to replicate files and databases between clusters or to cloud storage such as Microsoft Azure Storage.

While we at Hortonworks have worked within the community to develop and advance these new user experiences, we are far from done. Some compelling new user experiences are still on the horizon, including a hosted version of Cloudbreak which will allow you to launch HDP into cloud-based environments and Apache Zeppelin (incubating) which provides a breakthrough user experience for Apache Spark. Stay tuned for more developments in this area.

Enterprise Readiness: Enhancements to Security, Governance, and Operations

HDP 2.3 delivers new encryption of data-at-rest, extends the data governance initiative with Apache Atlas, and drives forward operational simplification for both on-premise and cloud-based deployments.

Before we go into the details on Security, Governance, and Operations, I want to highlight a couple of critical additions in HDP 2.3.

YARN continues to be the architectural center of Hadoop and HDP 2.3 provides a technical preview of the Solr search engine running on YARN. As we work within the community to harden and complete this critical advancement for search over big data, it will allow customers to reduce their total cost of ownership by deploying Apache Solr within the same cluster as their other workloads – eliminating the need for a “side cluster” dedicated to indexing data and delivering search results. We encourage customers to try this in their non-production clusters and provide feedback on their experience. Thanks to the team at Lucidworks for making this happen and working with us to do this via Apache Slider!

Another important new capability is high availability (HA) configuration options for Apache Storm, Apache Ranger, and Apache Falcon that power many mission-critical applications and services. Each of these components now provides an HA configuration to support business continuity when failures occur.

HDP 2.3 delivers a number of significant security enhancements. The first is HDFS Transparent Data at Rest Encryption. This is a critical feature for Hadoop and we have been performing extensive testing with our customers as part of an extended technical preview.

As part of providing support for HDFS Transparent Data at Rest Encryption, Apache Ranger provides a key management service (KMS) that leverages the Hadoop Key Provider API and it can provides a central key service for Hadoop.

hdp2.3_5

There is more work to be done related to encrypting data at rest, but we feel confident customers can already adopt a core set of security use cases. We will continue to expand the capabilities and eliminate some remaining limitations over the coming months.

Other important addition to Apache Ranger included centralized authorization for Apache Solr, Apache Kafka and YARN. Security administrators can now define and manage security policies and capture security audit information for HDFS, Hive, HBase, Knox, and Storm along with Solr, Kafka and YARN.

On the auditing front, Ranger now supports using Solr as the backend for indexing the audit information and serving real-time query results. The Ranger team has also optimized audit data to summarize audit at the source, reducing audit noise and volume.

This screen shot gives you the sense for that simplification:

hdp2.3_6

For the partner ecosystem, we have been thrilled by the success of Ambari Stacks and the extensibility and flexibility it provided.

As we look to fuel an ever-expanding partner ecosystem, we decided to take a page out of the Ambari extensibility guide and apply it to both Apache Ranger and Apache Knox. In HDP 2.3, Ranger and Knox both provide the ability for partners to define a “stack”. The stack definition allows partners to leverage Ranger’s centralized authorization and auditing capabilities and Knox’s API gateway capabilities without extensive coding.

Hortonworks believes in this kind of open and extensible approach as the best way to maximize the value for both our partners and customers. Expect to see the proof of this in the coming months.

Shifting to data governance, we launched the Data Governance Initiative (DGI) in January of 2015 and then delivered the first set of technology along with an incubator proposal to the Apache Software Foundation in April. HDP 2.3 delivers the core set of metadata services as an outcome of this effort.

This is really the first step on a journey to address data governance in a holistic way for Hadoop. Some of these initial capabilities will ease data discovery with a focus on Hive and establish a strong foundation for future feature additions as we look to tackle Kafka, Storm, and integrating dynamic security policies based on the available metadata tags.

In addition to the new user interface elements described earlier, Apache Falcon also enables Apache Hive database replication in HDP 2.3. Previously, Falcon provided support for replication of files (and incremental Hive partitions) between clusters, primarily to support disaster recovery scenarios. Now customers can use Falcon to replicate Hive databases, tables and their underlying metadata–complete with bootstrapping and reliably applying transactions to targets.

Finally on to operations. The pace of innovation in Apache Ambari continues to astonish. As part of HDP 2.3, Ambari supports a significantly wider range of component deployment and monitoring than ever before. This includes the ability to install and manage: Accumulo, Atlas, DataFu, Mahout, and the Phoenix Query Server (in Tech Preview). It also includes an extended ability to configure the NFS Gateway of HDFS. In addition, Ambari now provides support for rack awareness–allowing you to define and manage your data topology by rack.

We introduced the automation for rolling upgrade as part of Ambari 2.0, but this was primarily focused on automating the application of maintenance releases to your running cluster. Now, Ambari expands its reach to support rolling upgrade for feature bearing releases as well. This automates your ability to roll from HDP 2.2 to HDP 2.3.

Following the general availability of HDP 2.3, Cloudbreak will also become generally available. Since Hortonworks’ acquisition of SequenceIQ, the integrated team has been working hard to complete the deployment automation for public clouds including Microsoft Azure, Amazon EC2, and Google Cloud. Our support and guidance will be available to all Hortonworks customers who have an active Enterprise Plus support subscription.

hdp2.3_7

Proactive Support with Hortonworks SmartSense™

In addition to all of the tremendous platform innovation, Hortonworks is proud to announce Hortonworks SmartSense™, which adds proactive cluster monitoring and delivers critical recommendations to customers who opt into this extended support capability.

The addition of Hortonworks SmartSense further enhances Hortonworks’ world-class support subscriptions for Hadoop.

To adopt Hortonworks SmartSense our customers can simply download the Hortonworks Support Tool (HST) from the support portal and deploy it to their cluster. HST then collects configuration and other operational information about their HDP cluster and packages it up into a bundle.

After uploading this information bundle to the Hortonworks’ support team, we use our own HDP cluster to analyze all the information it provides. It performs more than 80 distinct checks across the underlying operating system, HDFS, YARN, MapReduce, Tez, and Hive components.

hdp2.3_8

 

Hortonworks SmartSense then delivers the results of the analysis in the form of recommendations to customers via the support portal. Feedback from our customers who have tried beta versions of SmartSense has been tremendously positive, and we believe there is much more we can do to expand this capability.

For example, we plan to integrate the service with Apache Ambari, so that our subscribers can receive recommendations as “recipes” that can be directly applied to the cluster. We also believe that additional predictions on capacity planning and tuning for maximum cluster resource utilization can be delivered via SmartSense.

This is an example of the Hortonworks SmartSense user interface:

hdp2.3_9

Of course, there is so much more that I didn’t cover here which is also part of HDP 2.3! There has been meaningful innovation within Hive for supporting Union within queries and using interval types in expressions, additional improvements for HBase and Phoenix, integration of Solr with Storm and HBase to enable near real-time indexing, and more. But, for now, I’ll leave those for subsequent blog posts that will highlight them all in more detail.

In closing, I would like to thank the entire Hortonworks team and the Apache community for the hard work they put in over the past six to eight months. That hard work set the stage for the enterprises adopting Open Enterprise Hadoop for the first time, as much as it will delight those who have been using Hadoop for years.

HDP 2.3 Resources

The post Hortonworks Data Platform 2.3 – Delivering Transformational Outcomes appeared first on Hortonworks.

Driving Business Transformation with Open Enterprise Hadoop

$
0
0

Hadoop isn’t optional for today’s enterprises—that much is clear. But as companies race to get control over the significantly growing volumes of unstructured data in their organizations, they’ve been less certain about the right way to put Hadoop to work in their environment.

We’ve already seen a variety of wrong approaches with proprietary extensions that limit innovation, fragment architectures and trade openness for vendor lock-in. Now a new consensus is forming around an emerging category that drives truly transformational outcomes: Open Enterprise Hadoop.

Hortonworks pioneered this category, and the Global 5000 is rapidly adopting its unique approach. You can see this momentum in our Hortonworks Q1 earnings announcement. We were able to achieve 200 percent growth in customers and 167 percent growth in GAAP revenue.

In fact, Hortonworks’ innovative approach to this market has been noticed by more than just the industry analyst community. Michal Katz from RBC notes in her blog that Hortonworks’ CEO Rob Bearden is joining the select top industry leaders to transform the industry with next generation solutions.

Here’s why so many organizational leaders are making Open Enterprise Hadoop the foundation of their big data strategy.

Open Enterprise Hadoop takes direct aim at those shortcomings that hampered previous approaches to Hadoop in the enterprise. Those earlier attempts typically relied on proprietary extensions of early Hadoop projects, a branching approach that sealed them off from subsequent innovations, locked them into vendor-specific analytics, and often undermined integration with YARN, the open data operating system.

By taking that path, those Hadoop vendors surrendered much of the rapid innovation that comes from open source development, making the platform feel all too much like the legacy technologies that it was supposed to surpass.

Open Enterprise Hadoop solutions keep Hadoop true to its open source heritage—while also adding crucial innovations to meet demanding enterprise requirements.

Instead of creating their own proprietary extensions, vendors in this category rely solely on open source components and on the open community. They harness the powerful processes governed by the Apache Software Foundation and its enterprise-savvy committers—including more than 100 at Hortonworks alone (which employs the most Hadoop committers in the industry).

As a result of this very intentional strategy, Open Enterprise Hadoop solutions:

  • Leverage the full power of open source development. Open Enterprise Hadoop solutions remain “on-the-trunk,” so enterprises benefit from the latest community innovations as soon as they become available.
  • Consolidate data silos. Open Enterprise Hadoop requires that all Hadoop ecosystem projects leverage the Apache Hadoop YARN data operating system. This makes it possible for organizations to access a centralized “data lake” via multiple heterogeneous access methods, support many different users at once, and scale to deployments managing petabytes of data. Open Enterprise Hadoop also ensures full interoperability beyond the Hadoop core through the promotion of open standards for the broad technology ecosystem.
  • Provide robust operations, security, and governance capabilities. Open Enterprise Hadoop vendors make the platform ready to meet those enterprise standards through the work of project committers who combine enterprise savvy with a commitment to open source principles and processes.

The latest version of Hortonworks Data Platform (HDP) will be introduced this month, and it illustrates some major advances only possible with this approach of harnessing the power of the community. Those advances in HDP 2.3 include:

  • A breakthrough user experience. Dramatic reductions in administrative complexity and a vastly improved user experience speed time to value. Fast setup, a streamlined configurations, and simple cluster formation help you get the platform up and running quickly, while real-time dashboards make it easier to maximize cluster health.
  • Enhanced security and governance. HDP 2.3 extends our data governance initiative with Apache Atlas. IT can use a single administrative console to set security policy across the entire cluster. Complete capabilities for authentication, authorization, and auditing support full access control and reporting. HDP can encrypt data at rest and in motion.
  • Proactive support. Hortonworks frees more finite engineering resources from maintaining the internals of the data platform. While you can still self-support HDP, we provide subscriptions for 24 x 7 support along with patches, updates, and other fixes to keep your critical enterprise workloads running.

You can learn more about what’s in HDP 2.3 by reading Tim Hall’s post from earlier today.

What does all this mean to you? As the industry moves toward open Hadoop solutions, Hortonworks is driving the Open Enterprise Hadoop category to deliver transformational outcomes for today’s businesses.

We’re working closely with our 437 customers—and counting—to understand the needs of enterprises across all industries, and then we leverage the power of our leadership in the open source community to innovate the technology according to those priorities. And as a community, we do it faster than any single vendor ever could.

And we’re just getting started. You’ll be hearing a lot more about Open Enterprise Hadoop in the months ahead—and you’ll like what you hear.

Learn More

About the Author

Mathew MorganMatthew Morgan is the vice president of global product marketing for Hortonworks.  In this role, he leads Hortonworks product marketing, vertical solutions marketing, and worldwide sales enablement. His background includes twenty years in enterprise software, including leading worldwide product marketing organizations for Citrix, HP Software, Mercury Interactive, and Blueprint.  Feel free to connect with him on LinkedIn or visit his personal blog

The post Driving Business Transformation with Open Enterprise Hadoop appeared first on Hortonworks.

New in HDP 2.3: Enterprise Grade HDFS Data At Rest Encryption

$
0
0

Apache Hadoop has emerged as a critical data platform to deliver business insights hidden in big data. As a relatively new technology, system administrators hold Hadoop to higher security standards. There are several reasons for this scrutiny:

  • External ecosystem that comprise of data repositories and operational systems that feed Hadoop deployments are highly dynamic and can introduce new security threats on a regular basis.
  • Hadoop deployment contains large volume of diverse data stored over longer periods of time. Any breach of this enterprise-wide data can be catastrophic.
  • Hadoop enables users across multiple business units to access, refine, explore and enrich data using different methods, thereby raising the risk for potential breach.

Security Pillars in Hortonworks Data Platform (HDP)

HDP is the only Hadoop platform offering comprehensive security and centralized administration of security policies across the entire stack. At Hortonworks we take a holistic view to enterprise security requirements and ensure that Hadoop can not only define but also apply a comprehensive policy. HDP leverages Apache Ranger for centralized security administration, authorization and auditing; Kerberos and Apache Knox for authentication and perimeter security, and support for native/partner solutions for encrypting over the wire and data-at-rest.

hdf_sec_1

Data at REST Encryption – State of the union

In addition to authentication and access control, data protection adds a robust layer of security, by making data unreadable in transit over the network or at rest on a disk.

Compliance regulations, such as HIPAA and PCI, stipulate that encryption is used to protect sensitive patient information and credit card data. Federal agencies and enterprises in compliance driven industries, such as healthcare, financial services and telecom, leverage data at rest encryption as core part of their data protection strategy. Encryption helps protect sensitive data, in case of an external breach or unauthorized access by privileged users.

There are several encryption methods, varying in degrees of protection. Disk or OS level encryption is the most basic version, which protects against stolen disks. Application level encryption, on the other hand, provides higher level of granularity and prevents rogue admin access; however, it adds a layer of complexity to the architecture.

Traditional Hadoop users have been using disk encryption methods such as dm-crypt as their choice for data protection. Although OS level encryption is transparent to Hadoop, it adds a performance overhead and does not prevent admin users from accessing sensitive data. Hadoop users are now looking to identify and encrypt only sensitive data, a requirement that involves delivering finer grain encryption at the data level.

Certifying HDFS Encryption

The HDFS community worked together to build and introduce transparent data encryption in HDFS. The goal was to encrypt specific HDFS files by writing them to HDFS directories known as encryption zones (EZ). The solution is transparent to applications leveraging HDFS file system, such as Apache Hive and Apache HBase. In other words, there is no major code change required for existing applications already running on top of HDFS. One big advantage of encryption in HDFS is that even privileged users, such as the “hdfs” superuser, can be blocked from viewing encrypted data.

As with any other Hadoop security initiative, we have adopted a phased approach of introducing this feature to customers running HDFS in production environment. After the technical preview announcement earlier this year, Hortonworks team has worked with select group of customers to gather use cases and perform extensive testing against those use cases. We have also devoted significant development effort in building a secure key storage in Ranger, by leveraging the open source Hadoop KMS. Ranger now provides centralized policy administration, key management and auditing for HDFS encryption.

We believe that HDFS encryption, backed by Ranger KMS, is now enterprise ready for specific use cases. We will introduce support for these use cases as part of the HDP 2.3 release.

HDFS encryption in HDP – Components and Scope

hdfs_sec_2

The HDFS encryption solution consists of 3 components (more details in the Apache website here)

  • HDFS encryption/decryption enforcement: HDFS client level encryption and decryption for files within an Encryption Zone
  • Key provider API: API used by HDFS client to interact with KMS and retrieve keys
  • Ranger KMS: The open source Hadoop KMS is a proxy that retrieves keys for a client. Working with the community, we have enhanced Ranger GUI to enable securely store key using a database and centralize policy administration and auditing. (Please refer to the screenshots below)

hdfs_sec_3

 

hdfs_sec_4

 

We have  extensively tested HDFS data at rest encryption across the HDP stack and will provide a detailed set of best practices for how to use HDFS data at rest encryption among various use cases as part of the HDP 2.3 release.

We are also working with key encryption partners so that they can integrate their own enterprise ready KMS offerings with HDFS encryption. This offers a broader choice to customers looking to encrypt their data in Hadoop.

Summary

In summary, to encrypt sensitive data, protect privileged access and go beyond OS level encryption, enterprise can now use HDFS transparent encryption. Both HDFS encryption and Ranger’s KMS are open source, enterprise-ready, and satisfy compliance sensitive requirements. As such they facilitate Hadoop adoption among compliant conscious enterprises.

The post New in HDP 2.3: Enterprise Grade HDFS Data At Rest Encryption appeared first on Hortonworks.

Announcing Apache Ambari 2.0

$
0
0

Advances in Hadoop security, governance and operations have accelerated adoption of the platform by enterprises everywhere. Apache Ambari is the open source operational platform for provisioning, managing and monitoring Hadoop clusters from a single pane of glass, and with the Apache Ambari 1.7.0 release last year, Ambari made it far easier for enterprises to adopt Hadoop.

Today, we are excited to announce the community release of Apache Ambari 2.0, which will further accelerate enterprise Hadoop usage by simplifying the technical challenges that slow adoption the most. Ambari 2.0 includes many features, most notable of which are:

Many thanks to all of the contributors and committers who collaborated on this release and resolved more than 1,700 JIRA issues. For the complete list of new features, check out this What’s New in Ambari 2.0 presentation.

Enough of the chit-chat. Here are some details of the exciting new features in Apache Ambari 2.0.

Automated Rolling Upgrades for the HDP Stack

The Hortonworks Dev team did a great job describing rolling upgrades in this blog post. To highlight, as enterprises everywhere adopt Hadoop, they deploy more and more mission-critical analytic applications. Because of these mission critical workloads, the platform must undergo minimal to no cluster downtime during upgrades from one version to the next. That means the Hadoop platform needs to be “rolling upgradeable.”

The effort in the open source community to make the Hadoop platform rolling upgradeable goes beyond packaging (even though that is one of the key components of rolling upgrades). The developers need to consider the API compatibility between components, the components need an ability to restart jobs underway on the cluster and the system needs to maintain high availability among the Hadoop components for seamless master component switches during upgrades.

That’s a lot of work. But the Hortonworks Dev team brought it all together with Hortonworks Data Platform 2.2 and the Ambari Automated Rolling Upgrade for HDP Stack capability. This allows Hadoop operators to perform a rolling upgrade from one version of HDP to the next with minimal disruption to the cluster. Ambari orchestrates a series of operations on the cluster (with checks along the way) that help you move components to a newer version.

This only scratches the surface. Stay tuned for subsequent blogs with more details on automated rolling upgrades.

Simplified, Comprehensive Hadoop Security

Ambari 2.0 helps provision, manage and monitor Hadoop security in two ways. First, Ambari now simplifies the setup, configuration and maintenance of Kerberos for strong authentication in the cluster. Secondly, Ambari now includes support for installing and configuring Apache Ranger for centralized security administration, authorization and audit.

Kerberos has long been the central technology for enabling strong authentication for Hadoop, but Kerberos configuration posed quite a challenge creating the principals and keytabs. Ongoing maintenance of those artifacts could be cumbersome.

Ambari 2.0 makes this easier with an automated wizard-driven Kerberos configuration approach that eliminates time-consuming administration tasks. Ambari can work with your existing Kerberos infrastructure, including Active Directory, to automatically generate your cluster’s requisite principals and keytabs. Then, as you expand your cluster with more hosts or new services, Ambari can talk to your Kerberos infrastructure and automatically adjust the cluster configuration.

Apache Ranger is the other side of the security equation, providing centralized management of access control services for administration, authorization and audit. Ranger was added as a GA component in Hortonworks Data Platform 2.2 and now with Ambari 2.0, Ranger can be automatically installed and configured with the rest of your cluster components.

Watch this blog for future posts digging deeper into Kerberos, Apache Ranger and comprehensive security support with Ambari 2.0.

Ambari Alerts

The enterprise Hadoop operator needs maximum visibility into the health of the cluster. As the operational framework for Hadoop, Ambari must provide that visibility out-of-the-box and also flexibly integrate with existing enterprise monitoring systems. Ambari Alerts aims to strike that balance between ease and flexibility.

ambari_2.0_1

Ambari Alerts provides centralized management of health alerts and checks for the services in your cluster. Ambari automatically configures the particular set of alerts based on the services installed. As a Hadoop operator, you have control over which alerts are enabled, their thresholds and their reporting output. For maximum flexibility, alert groups and multiple notification targets give you very granular control of the “who, what, why and how” around alerts. This puts both flexibility and power in the hands of the Hadoop operator, who can now:

  • Create and manage multiple notification targets and control who gets notified for which alerts.
  • Filter notification by alert severity and send certain notifications to specific targets based on that severity.
  • Control notification target methods, including support for EMAIL + SNMP so the person being notified can be alerted via their preferred method.

Ambari also exposes alerts REST API endpoints to enable integration with existing systems. There are a few integration patterns in the What’s New in Ambari 2.0 slides to give you a better sense of the possibilities. As one example of the Ambari community rallying around Alerts, our partners at SequenceIQ dove in head-first and have already integrated alerts with Periscope. Be sure to check out what they have done, since it’s a great example of community innovation in action.

Download Ambari and Learn More

The Ambari community is already hard at work improving Apache Ambari capabilities to provision, manage and monitor Hadoop clusters. Watch this blog for more news on enhancements to core features and extensibility features. But in the meantime, checkout the community release of Ambari 2.0 with the following resources:

The post Announcing Apache Ambari 2.0 appeared first on Hortonworks.

Viewing all 143 articles
Browse latest View live