Continued Innovation in Hadoop Security

August 18, 2014, 11:25 am

≫ Next: Hadoop Security in the Enterprise

≪ Previous: Protegrity: Data Security in Hadoop and Beyond

We are in the midst of a data revolution. Hadoop, powered by Apache Hadoop YARN, enables enterprises to store, process, and innovate around data at a scale never seen before making security a critical consideration. Enterprises are looking for a comprehensive approach to security for their data to realize the full potential of the Hadoop platform unleashed by YARN, the architectural center and the data operating system of Hadoop 2.

Hortonworks and the open community continue to work tirelessly to enhance security in Hadoop. Last week, we shared several blogs that highlight the tremendous innovation underway in the areas of authentication, authorization, auditing, and data protection.

We started last week with a blog introducing Apache Argus – incorporating the key IP from XA Secure – and called on the community to collaborate on an even bigger scale. Argus’ vision is to bring comprehensive security across all components in the Hadoop ecosystem making it easier for the enterprise to manage security policies across authorization, audit and other forms of data security. The Argus charter is a bold vision and in the coming months the team will share our approach to solve some of the biggest challenges around Hadoop security.

We highlighted Apache Knox, which helps Hadoop extend the reach of its services to more users securely by providing a gateway for REST/HTTP based services. Vinay Shukla blogged about a common use case of enabling secure ODBC and JDBC access to Hive, through Apache Knox.

We believe Hadoop can mature only in pure open source model with true collaboration across customers and partners—and security is no exception. We are delighted to showcase our partnership with industry leaders in data protection with the guest blog series last week:

Protegrity described how to expand Hadoop security with data-centric security across multiple enterprise systems with Protegrity Vaultless Tokenization for maximum usage of secured data with no data residency issues, and Extended HDFS Encryption for transparent AES file encryption.
Voltage Security blogged about data-centric security for the protection of sensitive data in Hadoop, from storage level encryption to standards-recognized Voltage Format Preserving Encryption™ (FPE) and Secure Stateless Tokenization™ to maintain referential integrity of de-identified data, enable regulatory compliance, and neutralize data breaches.
Dataguise discussed the use of data discovery and protection with DGSecure which scans data in structured, semi-structured or unstructured formats to provide security at the field level via masking or encryption, along with dashboard reporting.

For a key feature—native encryption of data at rest—the Hadoop community has been working to address this gap. To that end, the community is in the process of voting on this feature. When Transparent Data Encryption in HDFS is completed, data in HDFS can be encrypted natively.

The Hadoop community has worked to provide a Key Management Server (KMS) out of box. With the Key Provider API, Hadoop components can easily integrate with the Key Management software of their choice. This API allows enterprises to plug in their existing corporate standard Key Management software to leverage common Key Management across various components in the stack such as Databases, Email, and Hadoop.

What’s Next?

With the investments and commitments across the Hadoop ecosystem, we look forward to the next phase of the data revolution where the customer can leverage the full power of the next generation platform, with the confidence that their data are protected in all phases: ingest, processing, access, and egress.

Stay tuned for next set of blog series on Argus, Knox, Encryption and more..

The post Continued Innovation in Hadoop Security appeared first on Hortonworks.

↧

Hadoop Security in the Enterprise

August 21, 2014, 12:44 pm

≫ Next: Discover HDP 2.1: Webinar Series Wrap Up

≪ Previous: Continued Innovation in Hadoop Security

Zettaset is a Hortonworks partner. In this guest blog, John Armstrong, VP of Marketing at Zettaset Inc., shares Zettaset’s security features and explains why data encryption is vital for data in the Hadoop infrastructure.

Comprehensive Security Across the Hadoop Infrastructure

As big data technologies like Hadoop become widely deployed in production environments, the expectation is that they will meet the enterprise requirements in data governance, operations and security while integrating with existing data center infrastructure. The technology is not contained within a relatively small, controlled IT environment, but is interfacing with broadly available analytics applications in the business unit. Data within the Hadoop cluster environment is fluid, and big data is replicated in many places and moves as needed. Security must be consistently applied and enforced across a distributed computing environment.

Enterprises recognize that big data requires a comprehensive and coordinated approach to security, and the open source community with Hortonworks in the lead has embraced this challenge with a number of Apache projects including Apache Knox, Kerberos and recently announce Apache Argus incubator project.

Adrian Lane of Securosis recently penned an excellent article on the differences between traditional databases and big data architectures, and subsequent security challenges, so I needn’t go into detail in this blog. Suffice to say, for organizations handling sensitive data the risks associated with data security and possible non-compliance are too high to ignore. In the case of a breach, the enterprise will face brand damage control as well as potential impacts on customer confidence and business. Ideally, the enterprise should be focusing on developing a comprehensive security strategy that includes encryption, fine-grained access control, and security policy enforcement that works well in an environment where data is shared across multiple platforms, Hadoop being one such platform.

Hortonworks Data Platform (HDP) 2.1 provides a centralized security framework with Apache Argus, Apache Knox and Kerberos to provide authentication, authorization, auditing and administration. However this does not eliminate the need for additional protection against unauthorized access and to support compliance with regulations and mandates such as PCI/DSS, HIPAA, and HITECH.

Encrypting Sensitive Data

Encryption is a highly reliable security method which can be used to protect data-at-rest within the cluster. Encryption can prevent data exposure even if a server is physically removed from a data center, which is critical for organizations in highly regulated industries such as financial services and healthcare that handle sensitive data. HIPAA deals with the privacy, security, and transmission of medical information. The HIPAA Security Rule deals specifically with Electronic Protected Health Information (EPHI), and names addressable and required implementation specifications which include the encryption of a patient’s protected health information. PCI/DSS imposes similar rules on an individual’s personal and financial information.

A major provider of healthcare services in the U.S. captures terabytes of EPHI on a monthly basis, which is secured and encrypted by Zettaset. One particular type of information that the customer gathers that fits into the category of unstructured data are physician’s notes, which are typically jotted down on wireless tablets while a doctor is meeting with a patient. Of course, like all EPHI, this raw data and subsequent information must be strictly secured by law and meet HIPAA privacy requirements. Maintained in a Hadoop database, the healthcare service provider analyzes these notes from thousands of physicians, and derives valuable actionable information. For example, by correlating the age, sex, and diagnosis of large samples of patients with prescribed care and medication, analysts are able to determine which care regimens deliver the best outcomes for patients with specific ailments. This information can be used to guide future medical diagnoses and treatments, as well as help the healthcare organization evaluate the efficacy of their medical staff.

Zettaset has developed a KMIP-standard encryption solution, which is compatible with HDP 2.1 (as well as earlier 1.x versions) and other Hadoop and NoSQL databases. Zettaset’s encryption solution is optimized for Hadoop’s distributed architecture, but acknowledges that encryption solutions for centralized RDBMs exist in many organizations as well. Zettaset takes a standards-based approach that simplifies integration of big data encryption into existing data environments that have a mix of Hadoop and RDBMs, and ensures compatibility with PKCS-compliant hardware security modules (HSMs) that an organization may already have invested in.

Summary

There is a tremendous push in the open community, in partnership with leaders in data security like Zettaset, to provide Hadoop with robust security. The approach addresses the unique architecture of distributed computing and is designed to meet the security requirements of the enterprise data center and the Hadoop cluster environment. A comprehensive solution will include the best efforts of the open source community in tandem with proprietary data security solutions that can function across multiple platforms including Hadoop.

Tim O’Reilly, a strong proponent of open-source once stated:

Any successful industry provides a balance of open and proprietary. At the heart of the open PC hardware platform is a proprietary CPU, and a variety of proprietary devices. At the heart of the open Internet are proprietary Cisco routers, and for every open source program, there are proprietary ones as well.

The ideal big data security solution will ultimately consist of best-in-breed solutions driven by customer requirements.

The post Hadoop Security in the Enterprise appeared first on Hortonworks.

↧

Discover HDP 2.1: Webinar Series Wrap Up

August 22, 2014, 8:00 am

≫ Next: Securing Hadoop: What Are Your Options?

≪ Previous: Hadoop Security in the Enterprise

This summer, Hortonworks presented the Discover HDP 2.1 Webinar series. Our developers and product managers highlighted the latest innovations in Apache Hadoop and related Apache projects.

We’re grateful to the more than 1,000 attendees whose questions added rich interaction to the pre-planned presentations and demos.

For those of you that missed one of the 30-minute webinars (or those that want to review one they joined live), you can find recordings of all sessions on our What’s New in 2.1 page.

The full list includes:

Apache Hadoop 2.4, YARN and HDFS – Apache Hadoop YARN as the architectural center of Hadoop 2
Apache Hive 0.13 for Interactive SQL Queries – fast queries over petabytes of data
Apache Storm for Stream Data Processing – use cases for real-time streaming analysis
Apache Solr for Hadoop Search – overview of Hadoop search, with an end-to-end demo
Apache Ambari to Manage Hadoop Clusters – demonstration of new Ambari features
Apache Falcon for Data Governance – data governance defined and demonstrated
New Features for Security & Apache Knox – Knox Gateway, ACLs for HDFS security and next generation Hive authorization

And we’ve also added a one-hour presentation not included in the original Discover HDP 2.1 series: HDP Advanced Security. Hortonworks security experts Balaji Ganesan and Bosco Durai present HDP Advanced Security’s common interface for central administration of security policy and coordinated enforcement for the entire Hadoop stack across authentication, authorization, audit, and data protection.

HDP Advanced Security will be part of the recently created incubator project, Apache Argus.

The post Discover HDP 2.1: Webinar Series Wrap Up appeared first on Hortonworks.

↧

Securing Hadoop: What Are Your Options?

August 22, 2014, 12:29 pm

≫ Next: Deploying HTTPS in HDFS

≪ Previous: Discover HDP 2.1: Webinar Series Wrap Up

The open source community, including Hortonworks, has invested heavily in building enterprise grade security for Apache Hadoop. These efforts include Apache Knox for perimeter security, Kerberos for strong authentication and the recently announced Apache Argus incubator that brings a central administration framework for authorization and auditing.

Join Hortonworks and Voltage Security in a webinar on August 27 to learn more.

In multi-platform environments with data coming from many different sources, personally identifiable information, credit card numbers, and intellectual property can land in the Hadoop cluster. The question becomes: how to keep all this sensitive data secure, as it moves into Hadoop, as it is stored, and as it moves beyond Hadoop?

Join Hortonworks and Voltage Security to learn about comprehensive security in Apache Hadoop, and more:

Apache Argus: a central policy administration framework across security requirements for authentication, authorization, auditing, and data protection;
Data-centric protection technologies that easily integrate with Hive, Sqoop, MapReduce and other interfaces;
How to avoid the risks of cyber-attack and leaking of sensitive customer data; and
Ways to maintain the value of data for analytics, even in its protected form.

If you are enabling the Modern Data Architecture with Hadoop, protection of sensitive data is an area of security where enterprises need a cross platform solution. We invite you to learn more about our joint approach at our webinar on Wednesday, August 27. Register here.

Discover and Learn More

The post Securing Hadoop: What Are Your Options? appeared first on Hortonworks.

↧

Deploying HTTPS in HDFS

August 29, 2014, 2:11 pm

≫ Next: Partner Webinar: Sensitive Data Discovery and Security for Hadoop

≪ Previous: Securing Hadoop: What Are Your Options?

Haohui Mai is a member of technical staff at Hortonworks in the HDFS group and a core Hadoop committer. In this blog, he explains how to setup HTTPS for HDFS in a Hadoop cluster.

1. Introduction

The HTTP protocol is one of the most widely used protocols in the Internet. Today, Hadoop clusters exchange internal data such as file system images, the quorum journals, and the user data through the HTTP protocol. Since HTTP transfers the data in clear-text, an attacker might be able to tap into the network and put your valuable data at risk.

To protect your data, we have implemented full HTTPS support for HDFS in HDP 2.1. (Thanks to Hortonworks’ Haohui Mai, Suresh Srinivas, and Jing Zhao). At a very high level, HTTPS is the HTTP protocol transported over Secure Socket Layer (SSL/TLS), which prevents wiretapping and man-in-the-middle attack. The rest of this blog post describes how HTTPS works, and how to set it up for HDFS in a Hadoop cluster.

2. Background

Figure 1 describes the basic HTTP / HTTPS communication workflow. In the figure, Alice and Bob want to exchange information. From Alice’s perspective, there are two types of security threats in the communication. First, a malicious third-party, Eve, can tap into the network and sniff all the data passing between Alice and Bob. Second, a malicious party like Charlie can pretend to be Bob and intercept all communication between Alice and Bob.

The HTTPS protocol addresses the above security threats with two techniques. First, HTTPS encrypts all the communication between Alice and Bob to prevent Eve wiretapping the communication. Second, HTTPS requires all participants to prove their identities by presenting their certificates. A certificate works likes a government-issued passport, which includes the full name of the participant, the organization, etc. A trusted certificate authority (CA) signs the certificate to ensure that it is authentic. In our example, Alice verifies the certificate from the remote participant to ensure that she is indeed talking to Bob.

Behind the scene, public-key cryptography is the key technology that enables HTTPS. In public-key cryptography, each party has a paired private key and a public key. Public-key cryptography has an interesting property: one can use either the public or the private key for encryption, but decrypting the data requires the other key in the key pair. It is easy to see that public-key cryptography can implement encryption. Moreover, public keys can be used as a proof of identity because they are notarized by the CA. An in-depth technical introduction to public cryptography can be found here.

The next section describes how to deploy HTTPS in your Apache Hadoop cluster.

Figure 1: Basic workflow for HTTPS communication. Alice (left) and Bob (right) communicate through an insecure channel. The HTTPS protocol specifies how to secure the communication through cryptography to verify the identities of the participants (i.e., Alice is indeed talking to Bob) and to prevent wiretapping.

3. Deploying HTTPS in HDFS

3.1 Generating the key and the certificate for each machine

The first step of deploying HTTPS is to generate the key and the certificate for each machine in the cluster. You can use Java’s keytool utility to accomplish this task:
$ keytool -keystore -alias localhost -validity -genkey

You need to specify two parameters in the above command:

keystore: the keystore file that stores the certificate. The keystore file contains the private key of the certificate; therefore, it needs to be kept safely.
validity: the valid time of the certificate in days.

The keytool will ask for more details of the certificate:

Ensure that common name (CN) matches exactly with the fully qualified domain name (FQDN) of the server. The client compares the CN with the DNS domain name to ensure that it is indeed connecting to the desired server, not the malicious one.

3.2 Creating your own CA

After the first step, each machine in the cluster has a public-private key pair, and a certificate to identify the machine. The certificate, however, is unsigned, which means that an attacker can create such a certificate to pretend to be any machine.

Therefore, it is important to prevent forged certificates by signing them for each machine in the cluster. A certificate authority (CA) is responsible for signing certificates. CA works likes a government that issues passports—the government stamps (signs) each passport so that the passport becomes difficult to forge. Other governments verify the stamps to ensure the passport is authentic. Similarly, the CA signs the certificates, and the cryptography guarantees that a signed certificate is computationally difficult to forge. Thus, as long as the CA is a genuine and trusted authority, the clients have high assurance that they are connecting to the authentic machines.

In this blog we use openssl to generate a new CA certificate:

The generated CA is simply a public-private key pair and certificate, and it is intended to sign other certificates.

The next step is to add the generated CA to the clients’ truststore so that the clients can trust this CA:

$ keytool -keystore {truststore} -alias CARoot -import -file {ca-cert}

In contrast to the keystore in step 3.1 that stores each machine’s own identity, the truststore of a client stores all the certificates that the client should trust. Importing a certificate into one’s truststore also means that trusting all certificates that are signed by that certificate. As the analogy above, trusting the government (CA) also means that trusting all passports (certificates) that it has issued. This attribute is called the chains of trust, and it is particularly useful when deploying HTTPS on a large Hadoop cluster. You can sign all certificates in the cluster with a single CA, and have all machines share the same truststore that trusts the CA. That way all machines can authenticate all other machines.

3.3 Signing the certificate

The next step is to sign all certificates generated by step 3.1 with the CA generated in step 3.2. First, you need to export the certificate from the keystore:

$ keytool -keystore -alias localhost -certreq -file

Then sign it with the CA:

$ openssl x509 -req -CA -CAkey -in -out -days -CAcreateserial -passin pass:

Finally, you need to import both the certificate of the CA and the signed certificate into the keystore:

$ keytool -keystore -alias CARoot -import -file
$ keytool -keystore -alias localhost -import -file

The definitions of the parameters are the following:

keystore: the location of the keystore
ca-cert: the certificate of the CA
ca-key: the private key of the CA
ca-password: the passphrase of the CA
cert-file: the exported, unsigned certificate of the server
cert-signed: the signed certificate of the server

3.5 Configuring HDFS

The final step is to configure HDFS to use HTTPS. First, you need to specify dfs.http.policy in hdfs-site.xml to start the HTTPS server in the HDFS daemons.

<property>
<name>dfs.http.policy</name>
<value>HTTP_AND_HTTPS</value>
</property>

Three values are possible:

HTTP_ONLY: Only HTTP server is started
HTTPS_ONLY: Only HTTPS server is started
HTTP_AND_HTTPS: Both HTTP and HTTPS server are started

One thing worth noting is that WebHDFS is no longer available when dfs.http.policy is set to HTTPS_ONLY. You will need to use WebHDFS over HTTPS (swebhdfs, contributed by Hortonworks’ Jing Zhao and me) in this configuration, which protects your data that is transferred through webhdfs.

Second, you need to change the ssl-server.xml and ssl-client.xml to tell HDFS about the keystore and the truststore.

ssl-server.xml

<property>
<name>ssl.server.keystore.type</name>
<value>jks</value>
</property>
<property>
<name>ssl.server.keystore.keypassword</name>
<value><password of keystore></value>
</property>
<property>
<name>ssl.server.keystore.location</name>
<value><location of keystore.jks></value>
</property>
<property>
<name>ssl.server.truststore.type</name>
<value>jks</value>
</property>
<property>
<name>ssl.server.truststore.location</name>
<value><location of truststore.jks></value>
</property>
<property>
<name>ssl.server.truststore.password</name>
<value><password of truststore></value>
</property>

ssl-client.xml

<property>
<name>ssl.client.truststore.password</name>
<value><password of truststore></value>
</property>

<property>
<name>ssl.client.truststore.type</name>
<value>jks</value>
</property>
<property>
<name>ssl.client.truststore.location</name>
<value><location of truststore.jks></value>
</property>

The names of the configuration properties are self-explanatory. You can read more information about the configuration here. After restarting the HDFS daemons (NameNode, DataNode and JournalNode), you should have successfully deployed HTTPS in your HDFS cluster.

Closing Thoughts

Deploying HTTPS can improve the security of your Hadoop cluster. This blog describes how HTTPS works and how to set it up for HDFS in your Hadoop cluster. It is our mission to improve the security of Hadoop and to protect your valuable data.

Discover and Learn More

Read about Apache Argus
Read about Apache Knox

The post Deploying HTTPS in HDFS appeared first on Hortonworks.

↧

Partner Webinar: Sensitive Data Discovery and Security for Hadoop

September 23, 2014, 5:37 pm

≫ Next: Protegrity Avatar: Enterprise class data protection with HDP, exclusively offered to Hortonworks customers

≪ Previous: Deploying HTTPS in HDFS

As more companies turn to Hadoop as a crucial data platform, we are seeing security considerations continuing to play a much bigger role. Dataguise DgSecure works in concert with the Hortonworks Data Platform (HDP) to bring enterprise grade security and insight to Hadoop deployments. Data governance professionals can employ critical security features such as centrally managed authorization and audit, as well as sensitive data discovery, data centric protection and reporting to their Hadoop deployments.

Join us for a webinar featuring Jeremy Stieglitz, VP of Product Management from Dataguise, a pioneer in sensitive data discovery & security for Hadoop and a Hortonworks Certified Technology Partner, and Vinod Nair, Partner Product Management, Hortonworks, as they discuss real world scenarios and showcase various topics including:

Centrally managed Hadoop security: Apache Knox for perimeter security, Kerberos for strong authentication and the recently announced Apache Argus incubator that brings a central administration framework for authorization and auditing.
The power of sensitive data discovery in Hadoop: Identifying sensitive data elements in the Hadoop cluster enables a laser focused approach to data protection.
Apply a tailored security to policy setting & data centric protection in Hadoop: Clicks, not code. Simple, fast creation of protection schemas, including format preserving masking and encryption options.
Leveraging rich data & security analytics for securing the Hadoop environment: Real-time reporting and analytics showing you volume, location and protection of sensitive data elements.

Learn More

The post Partner Webinar: Sensitive Data Discovery and Security for Hadoop appeared first on Hortonworks.

↧

Protegrity Avatar: Enterprise class data protection with HDP, exclusively offered to Hortonworks customers

October 14, 2014, 1:01 pm

≫ Next: HDP 2.2 – A Major Step Forward for Enterprise Hadoop

≪ Previous: Partner Webinar: Sensitive Data Discovery and Security for Hadoop

Apache Hadoop has taken a mission critical role in the Modern Data Architecture (MDA) with the advent of Apache Hadoop YARN. YARN has enabled enterprises to store and process data across many execution engines at a scale that has not been possible earlier. This in turn has made security a crucial component of enterprise Hadoop. At Hortonworks we have broken the problem of enterprise security into four key areas of focus: authentication, authorization, auditing and data protection.

Hortonworks has led the open community efforts to bring enterprise grade security in Hadoop with Apache Knox, Kerberos for authentication and Apache Argus to build a centralized framework for support of authorization and auditing. We collaborate with partners like Protegrity to bring enterprise grade data protection in Hadoop to meet today’s complex compliance and regulatory landscape while providing cross platform support for encryption and key management.

It is critical that customers deploying Hadoop consider a data protection solution capable of solving the problem across multiple data platforms – traditional RDBMS, EDW and MPPs and Hadoop. Sometimes the need to decide on enterprise data protection adds unnecessary friction to the process of making the Hadoop decision. The primary points of friction encountered are evaluation of the data protection technology and difficulty in understanding total cost of ownership for Hadoop and data protection.

We are pleased that Protegrity has enabled a frictionless process for deploying data protection with the Hortonworks Data Platform (HDP). Today, Protegrity announced the availability of Protegrity Avatar, which supports advanced encryption for data at rest and in use with fine-grained role-based access controls for Hive, Pig, HBase and MapReduce. It bundles key technologies including Protegrity Vaultless Tokenization and Enterprise Security Administrator. Protegrity Avatar is optimized for and seamlessly integrated with HDP and will be made available for direct download to Hortonworks customers.

The post Protegrity Avatar: Enterprise class data protection with HDP, exclusively offered to Hortonworks customers appeared first on Hortonworks.

↧

HDP 2.2 – A Major Step Forward for Enterprise Hadoop

October 15, 2014, 6:06 am

≫ Next: HDP Operations: Migrating to the Hortonworks Data Platform

≪ Previous: Protegrity Avatar: Enterprise class data protection with HDP, exclusively offered to Hortonworks customers

Hortonworks Data Platform Version 2.2 represents yet another major step forward for Hadoop as the foundation of a Modern Data Architecture. This release incorporates the last six months of innovation and includes more than a hundred new features and closes thousands of issues across Apache Hadoop and its related projects.

Our approach at Hortonworks is to enable a Modern Data Architecture with YARN as the architectural center, supported by key capabilities required of an enterprise data platform — spanning Governance, Security and Operations. To this end, we work within the governance model of the Apache Software Foundation contributing to and progressing the individual components from the Hadoop ecosystem and ultimately integrating them into the Hortonworks Data Platform (HDP).

Our investment across all these technologies follows the same pattern.

VERTICAL: We integrate the projects within our Hadoop distribution with YARN and HDFS in order to enable HDP to span workloads from batch, interactive, and real time and across both open source and other data access technologies. Some work we deliver in this release to deeply integrate Apache Storm and Apache Spark within Hadoop are representative of this approach.
HORIZONTAL: We also ensure the key enterprise requirements of governance, security, and operations can be applied consistently and reliably across all the components within the platform. This allows HDP to meet the same requirements of any other technology in the data center. In HDP 2.2, our work within the Apache Ambari community helped extend integrated operations and we contributed Apache Ranger (Argus) to drive consistent security across Hadoop.
AT DEPTH: We deeply integrate HDP with the existing technologies within the data center to augment and enhance existing technologies and capabilities so you can reuse existing skills and resources.

A Comprehensive Data Platform

With YARN as its architectural center, Hadoop continues to attract new engines to run within the data platform, as organizations want to efficiently store their data in a single repository and interact with it simultaneously in different ways. They want SQL, streaming, machine learning, along with traditional batch and more… all in the same cluster. To this end, HDP 2.2 packages many new features. Every component is updated and we have added some key technologies and capabilities to HDP 2.2

HDP 2.2 Release Highlights

NEW: Enterprise SQL at Scale in Hadoop

While YARN has allowed new engines to emerge for Hadoop, the most popular integration point with Hadoop continues to be SQL and Apache Hive is still the defacto standard. While many SQL engines for Hadoop have emerged, their differentiation is being rendered obsolete as the open source community surrounds and advances this key engine at an accelerated rate. This release delivers phase 1 of the Stinger.next initiative, a broad, open community based effort to improve speed, scale and SQL semantics.

Updated SQL Semantics for Hive Transactions for Update and Delete
ACID transactions provide atomicity, consistency, isolation, and durability. This helps with streaming and baseline update scenarios for Hive such as modifying dimension tables or other fact tables.
Improved Performance of Hive with a Cost Based Optimizer
The cost based optimizer for Hive, uses statistics to generate several execution plans and then chooses the most efficient path as it relates system resources required to complete the operation. This presents a major performance increase for Hive.

NEW: Data Science within Hadoop with Spark on YARN

Apache Spark has emerged as an elegant, attractive development API allowing developers to rapidly iterate over data via machine learning and other data science techniques. While we have supported Spark as a tech preview for the past few months, in this release we plan to deliver an integrated Spark on YARN with improved integration to Hive 0.13 support and support for ORCFile by year-end. These improvements allow Spark to easily share and deliver data within and around Spark.

NEW: Kafka for processing the Internet of Things

Apache Kafka has quickly become the standard for high-scale, fault-tolerant, publish-subscribe messaging system for Hadoop. It is often used with Storm and Spark so that you can stream events in to Hadoop in real time and its application within the “internet of things” uses cases is tremendous.

New: Apache Ranger (Argus) for comprehensive cluster security policy

With increased adoption of Hadoop, a heightened requirement for a centralized approach to security policy definition and coordinated enforcement has surfaced. As part of HDP 2.2, Apache Ranger (formerly known as Argus) delivers a comprehensive approach to central security policy administration addressing authorization and auditing. Some of the work we have delivered extends Ranger to integrate with Storm and Knox while deepening existing policy enforcement capabilities with Hive and HBase.

New: Extensive improvements to manage & monitor Hadoop

Management and monitoring a cluster continues to be high priority for organizations adopting Hadoop. Our completely open approach via Apache Ambari is unique and we are excited to have Pivotal and HP jump on board to support Ambari with some of the other leaders in the data center like Microsoft and Teradata. In HDP 2.2, over a dozen new features to aid enterprises to manage Hadoop have been added, but some of the biggest include:

Extend Ambari with Custom Views
Ambari Views Framework offers a systematic way to plug-in UI capabilities to surface custom visualization, management and monitoring features in the Ambari Web console. A “view” extends Ambari to allow 3rd parties to plug in new resource types along with the APIs, providers and UI to support them. In other words, a view is an application that is deployed into the Ambari container.
Ambari Blueprints deliver a template approach to cluster deployment
Ambari Blueprints are a declarative definition of a cluster. With a Blueprint, you specify a Stack, the Component layout and the Configurations to materialize a Hadoop cluster instance (via a REST API) without having to use the Ambari Cluster Install Wizard. You can define any stack to be deployed.

NEW: Ensure uptime with Rolling Upgrades

In HDP 2.2 the rolling upgrade feature takes advantage of versioned packages, investments at the core of many of the projects and the underlying HDFS High Availability configuration to enable you to upgrade your cluster software and restart upgraded services, without taking the entire cluster down.

NEW: Automated cloud backup for Microsoft Azure and Amazon S3

Data architects require Hadoop to act like other systems in the data center and business continuity through replication across on-premises and cloud-based storages targets is a critical requirement. In HDP 2.2 we extend the capabilities of Apache Falcon to establish an automated policy for cloud backup to Microsoft Azure or Amazon S3. This is the first step in a broader vision to enable extensive heterogeneous deployment models for Hadoop.

Value in a Completely Open Approach

Hortonworks is 100% committed to open source and the value provided by an active and open community of developers. HDP is the ONLY 100% open source Hadoop distribution and our code goes back into an open ASF governed project with a live and broad community.

Hortonworks leadership is not just in numbers of committers but it is depth and diversity of involvement across the numerous open source projects that comprise our distribution. We are architects and builders and many of our developers are involved across multiple projects either directly as a committer or in partnering with developers across cube walls and across the Apache community. Our investment in Enterprise Hadoop starts with YARN, which allows us to integrate applications vertically within the stack, tying them to the data operating system, but this also allows us to apply consistent capabilities for key enterprise requirements of governance, security and operations.

Availability

A tech preview of HDP 2.2 is available today at hortonwoks.com/hdp

Complete List of HDP 2.2 New Features

Apache Hadoop YARN

Slide existing services onto YARN through ‘Slider’
GA release of HBase, Accumulo, and Storm on YARN
Support long running services: handling of logs, containers not killed when AM dies, secure token renewal, YARN Labels for tagging nodes for specific workloads
Support for CPU Scheduling and CPU Resource Isolation through CGroups

Apache Hadoop HDFS

Heterogeneous storage: Support for archival tier
Rolling Upgrade (This is an item that applies to the entire HDP Stack. YARN, Hive, HBase, everything. We now support comprehensive Rolling Upgrade across the HDP Stack).
Multi-NIC Support
Heterogeneous storage: Support memory as a storage tier (Tech Preview)
HDFS Transparent Data Encryption (Tech Preview)

Apache Hive, Apache Pig, and Apache Tez

Hive Cost Based Optimizer: Function Pushdown & Join re-ordering support for other join types: star & bushy.
Hive SQL Enhancements including:
- ACID Support: Insert, Update, Delete
- Temporary Tables
Metadata-only queries return instantly
Pig on Tez
Including DataFu for use with Pig
Vectorized shuffle
Tez Debug Tooling & UI

Apache HBase, Apache Phoenix, & Apache Accumulo

HBase & Accumulo on YARN via Slider
HBase HA
- Replicas update in real-time
- Fully supports region split/merge
- Scan API now supports standby RegionServers
HBase Block cache compression
HBase optimizations for low latency
Phoenix Robust Secondary Indexes
Performance enhancements for bulk import into Phoenix
Hive over HBase Snapshots
Hive Connector to Accumulo
HBase & Accumulo wire-level encryption
Accumulo multi-datacenter replication

Apache Storm

Storm-on-YARN via Slider
Ingest & notification for JMS (IBM MQ not supported)
Kafka bolt for Storm – supports sophisticated chaining of topologies through Kafka
Kerberos support
Hive update support – Streaming Ingest
Connector improvements for HBase and HDFS
Deliver Kafka as a companion component
Kafka install, start/stop via Ambari
Security Authorization Integration with Ranger

Apache Spark

Refreshed Tech Preview to Spark 1.1.0 (available now)
ORC File support & Hive 0.13 integration
Planned for GA of Spark 1.2.0
Operations integration via YARN ATS and Ambari
Security: Authentication

Apache Solr

Added Banana, a rich and flexible UI for visualizing time series data indexed in Solr

Cascading

Cascading 3.0 on Tez distributed with HDP — coming soon

Hue

Support for HiveServer 2
Support for Resource Manager HA

Apache Falcon

Authentication Integration
Lineage – now GA. (it’s been a tech preview feature…)
Improve UI for pipeline management & editing: list, detail, and create new (from existing elements)
Replicate to Cloud – Azure & S3

Apache Sqoop, Apache Flume & Apache Oozie

Sqoop import support for Hive types via HCatalog
Secure Windows cluster support: Sqoop, Flume, Oozie
Flume streaming support: sink to HCat on secure cluster
Oozie HA now supports secure clusters
Oozie Rolling Upgrade
Operational improvements for Oozie to better support Falcon
Capture workflow job logs in HDFS
Don’t start new workflows for re-run
Allow job property updates on running jobs

Apache Knox & Apache Ranger (Argus) & HDP Security

Apache Ranger – Support authorization and auditing for Storm and Knox
Introducing REST APIs for managing policies in Apache Ranger
Apache Ranger – Support native grant/revoke permissions in Hive and HBase
Apache Ranger – Support Oracle DB and storing of audit logs in HDFS
Apache Ranger to run on Windows environment
Apache Knox to protect YARN RM
Apache Knox support for HDFS HA
Apache Ambari install, start/stop of Knox

Apache Slider

Allow on-demand create and run different versions of heterogeneous applications
Allow users to configure different application instances differently
Manage operational lifecycle of application instances
Expand / shrink application instances
Provide application registry for publish and discovery

Apache Ambari

Support for HDP 2.2 Stack, including support for Kafka, Knox and Slider
Enhancements to Ambari Web configuration management including: versioning, history and revert, setting final properties and downloading client configurations
Launch and monitor HDFS rebalance
Perform Capacity Scheduler queue refresh
Configure High Availability for ResourceManager
Ambari Administration framework for managing user and group access to Ambari
Ambari Views development framework for customizing the Ambari Web user experience
Ambari Stacks for extending Ambari to bring custom Services under Ambari management
Ambari Blueprints for automating cluster deployments
Performance improvements and enterprise usability guardrails

The post HDP 2.2 – A Major Step Forward for Enterprise Hadoop appeared first on Hortonworks.

↧

HDP Operations: Migrating to the Hortonworks Data Platform

October 16, 2014, 11:02 am

≫ Next: End to End Wire Encryption with Apache Knox

≪ Previous: HDP 2.2 – A Major Step Forward for Enterprise Hadoop

Introduction

Hortonworks University announces a new operationally focused course for Apache Hadoop administrators. This two-day training course is designed for Hadoop administrators who are familiar with administering other Hadoop distributions and are migrating to the Hortonworks Data Platform (HDP). Through a combination of lecture and hands-on exercises you will learn how to install, configure, maintain and scale an HDP cluster

Target Audience

This course is designed for experienced Hadoop administrators and operators who will be responsible for installing, configuring and supporting the Hortonworks Data Platform.

Duration

In this two-day course, we will cover:

HDP installation, configuration, and planning
Apache components Hive and Tez
Backup and recovery
Securing HDP
High availability
Migrating from CDH to HDP

All done through lecture and hands-on lab exercises.

Prerequisites

Attendees should be familiar with the fundamentals of Hadoop, including HDFS and YARN, and have experience administering a Hadoop cluster, including configuring job schedulers, rack awareness, security, and adding/removing nodes. Students should also know how to install and configure the various components in the Hadoop ecosystem like Sqoop, Flume, Hive, Pig and Oozie.

Availability

We anticipate that this course will be ready for delivery in early November.

Learn More

For availability for individual seats in our open enrollment classes please visit us at www.hortonworks.com/training.
Onsite training is also available to be hosted at your offices. For more information around pricing and private training opportunities contact us at hwuniversity@hortonworks.com

The post HDP Operations: Migrating to the Hortonworks Data Platform appeared first on Hortonworks.

↧

End to End Wire Encryption with Apache Knox

October 20, 2014, 10:17 am

≫ Next: Discover HDP 2.2: Join Us for Eight 30-Minute Enterprise Hadoop Webinars

≪ Previous: HDP Operations: Migrating to the Hortonworks Data Platform

Enterprise Apache Hadoop provides the fundamental data services required to deploy into existing architectures. These include security, governance and operations services, in addition to Hadoop’s original core capabilities for data management and data access. This post focuses on recent work completed in the open source community to enhance the Hadoop security component, with encryption and SSL certificates.

Last year I wrote a blog summarizing wire encryption options in Hortonworks Data Platform (HDP). Since that blog, encryption capabilities in HDP and Hadoop have expanded significantly.

One of these new layers of security for a Hadoop cluster is Apache Knox. With Knox, a Hadoop cluster can now be made securely accessible to a large number of users. Today, Knox allows secure connections to Apache HBase, Apache Hive, Apache Oozie, WebHDFS and WebHCat. In the near future, it will also include support for Apache Hadoop YARN, Apache Ambari, Apache Falcon, and all the REST APIs offered by Hadoop components.

Without Knox, these clients would connect directly to a Hadoop cluster, and the large number of direct client connections poses security disadvantages. The main one is access.

In a typical organization, only a few DBAs connect directly to a database, and all the end-users are routed through a business application that then connects to the database. That intermediate application provides an additional layer of security checks.

Hadoop’s approach with Knox is no different. Many Hadoop deployments use Knox to allow more users to make use of Hadoop’s data and queries without compromising on security. Only a handful of admins can connect directly to their Hadoop clusters, while end-users are routed through Knox.

Apache Knox plays the role of reverse proxy between end-users and Hadoop, providing two connection hops between the client and Hadoop cluster. The first connection is between the client and Knox, and Knox offers out of the box SSL support for this connection. The second connection is between Knox and a given Hadoop component, which requires some configuration.

This blog walks through configuration and steps required to use SSL for the second connection, between Knox and a Hadoop component.

SSL Certificates

SSL connections require a certificate, either self-signed or signed by a Certificate Authority (CA). The process of obtaining self-signed certificates differs slightly from how one obtains CA-signed certificates. When the CA is well-known, there’s no need to import the signer’s certificate into the truststore. For this blog, I will use self-signed certificates; however, wire encryption can also be enabled with a CA-signed certificate. Recently, my colleague Haohui Moi blogged about HTTPS with HDFS and included instructions for a CA-signed certificate.

SSL Between Knox & HBase

The first step is to configure SSL on HBase’s REST server (Stargate). To configure SSL, we will need to create a keystore to hold the SSL certificate. This example uses a self-signed certificate, and a SSL certificate used by a Certificate Authority (CA) makes the configuration steps even easier.

As user HBase (su hbase) create the keystore.
export HOST_NAME=`hostname`
keytool -genkey -keyalg RSA -alias selfsigned -keystore hbase.jks -storepass password -validity 360 -keysize 2048 -dname "CN=$HOST_NAME, OU=Eng, O=Hortonworks, L=Palo Alto, ST=CA, C=US" -keypass password

Make sure the common name portion of the certificate matches the host where the certificate will be deployed. For example, the self-signed SSL certificate I created has the following CN sandbox.hortonworks.com, when the host running HBase is sandbox.hortonworks.com.

“Owner: CN=sandbox.hortonworks.com, OU=Eng, O=HWK, L=PA, ST=CA, C=US
Issuer: CN=sandbox.hortonworks.com, OU=Eng, O=HWK, L=PA, ST=CA, C=US”

We just created a self-signed certificate for use with HBase. Self-signed certificates are rejected during SSL handshake. To get around this, export the certificate and put it in the cacerts file of the JRE used by Knox. (This step is unnecessary when using a certificate issued by a well known CA.)

On the machine running HBase, export HBase’s SSL certificate into a file hbase.crt:
keytool -exportcert -file hbase.crt -keystore hbase.jks -alias selfsigned -storepass password

Copy the hive.crt file to the Node running Knox and run:

keytool -import -file hbase.crt -keystore /usr/jdk64/jdk1.7.0_45/jre/lib/security/cacerts -storepass changeit -alias selfsigned

Make sure the path to cacerts file points to cacerts of JDK used to run Knox gateway.
The default cacerts password is “changeit.”

Configure HBase REST Server for SSL
Using Ambari or another tool used for editing Hadoop configuration properties:

<property>
   <name>hbase.rest.ssl.enabled</name>
   <value>true</value>
 </property>
 <property>
   <name>hbase.rest.ssl.keystore.store</name>
   <value>/path/to/keystore/created/hbase.jks</value>
 </property>
 <property>
    <name>hbase.rest.ssl.keystore.password</name>
    <value>password</value>
 </property>
 <property>
   <name>hbase.rest.ssl.keystore.keypassword</name>
   <value>password</value>
 </property>

Save the configuration and re-start the HBase REST server using either Ambari or with command line as in:
sudo /usr/lib/hbase/bin/hbase-daemon.sh stop rest & sudo /usr/lib/hbase/bin/hbase-daemon.sh start rest -p 60080

Verify HBase REST server over SSL

Replace localhost with whatever is the hostname of your HBase rest server.

curl -H "Accept: application/json" -k https://localhost:60080/

It should output the tables in your HBase. e.g. name.
{“table”:[{"name":"ambarismoketest"}]}

Configure Knox to point to HBase over SSL and re-start Knox

Change the URL of the HBase service for your Knox topology in sandbox.xml to HTTPS e.g, ensure Host matches the host of HBase rest server.

<service> 
    	<role>WEBHBASE</role>
    	<url>https://sandbox.hortonworks.com:60080</url>
</service>

Verify end to end SSL to HBase REST server via Knox

curl -H "Accept: application/json" -iku guest:guest-password -X GET 'https://localhost:8443/gateway/sandbox/hbase/'

Should have an output similar to below:

HTTP/1.1 200 OK
Set-Cookie: JSESSIONID=166l8e9qhpi95ty5le8hni0vf;Path=/gateway/sandbox;Secure;HttpOnly
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Cache-Control: no-cache
Content-Type: application/json
Content-Length: 38
Server: Jetty(8.1.14.v20131031)

{“table”:[{"name":"ambarismoketest"}]}

SSL Between Knox & HiveServer

Create Keystore for Hive

As a Hive user (su hive) create the keystore.

The first step is to configure SSL on HiveServer2. To configure SSL create a keystore to hold the SSL certificate. The example here uses self-signed certificate. Using SSL certificate issued by a Certificate Authority(CA) will make the configuration steps easier by eliminating the need to import the signer’s certificate into the truststore.

On the Node running HiveServer2 run commands:

keytool -genkey -keyalg RSA -alias hive -keystore hive.jks -storepass password -validity 360 -keysize 2048 -dname "CN=sandbox.hortonworks.com, OU=Eng, O=Hortonworks, L=Palo Alto, ST=CA, C=US" -keypass password

keytool -exportcert -file hive.crt -keystore hive.jks -alias hive -storepass password -keypass password

Copy the hive.crt file to the Node running Knox and run:

keytool -import -file hive.crt -keystore /usr/jdk64/jdk1.7.0_45/jre/lib/security/cacerts -storepass changeit -alias hive

Make sure the path to cacerts file points to cacerts of JDK used to run Knox gateway. The default cacerts password is “changeit.”

Configure HiveServer2 for SSL

Using Ambari or other tool used for editing Hadoop configuration, ensure that hive.jks is in a location readable by HiveServer such as /etc/hive/conf

<property>
	<name>hive.server2.use.SSL</name>
	<value>true</value>
</property>
<property>
  <name>hive.server2.keystore.path</name>
  <value>/etc/hive/conf/hive.jks</value>
  <description>path to keystore file</description>
</property>
<property>
  <name>hive.server2.keystore.password</name>
  <value>password</value>
  <description>keystore password</description>
</property>

Save the configuration and re-start HiveServer2

Validate HiveServer2 SSL Configuration
Use Beeline & connect directly to HiveServer2 over SSL

beeline> !connect jdbc:hive2://sandbox:10001/;ssl=true beeline> show tables;

Ensure this connection works.

Configure Knox connection to HiveServer2 over SSL

Change the URL of the Hive service for your Knox topology in sandbox.xml to HTTPS e.g, ensure host matches the host of Hive server.

<service>
    	<role>Hive</role>
    	<url>https://sandbox.hortonworks.com:10001/cliservice</url>
</service>

Validate End to End SSL from Beeline > Knox > HiveServer2

Use Beeline and connect via Knox to HiveServer2 over SSL
beeline> !connect jdbc:hive2://sandbox:8443/;ssl=true;sslTrustStore=/var/lib/knox/data/security/keystores/gateway.jks;trustStorePassword=knox hive.server2.transport.mode=http;hive.server2.thrift.http.path=gateway/sandbox/hive
beeline> show tables;>

SSL Between Knox & WebHDFS

Create Keystore for WebHDFS

execute this shell script or set up environment vairables:

export HOST_NAME=`hostname`
export SERVER_KEY_LOCATION=/etc/security/serverKeys
export CLIENT_KEY_LOCATION=/etc/security/clientKeys
export SERVER_KEYPASS_PASSWORD=password
export SERVER_STOREPASS_PASSWORD=password
export KEYSTORE_FILE=keystore.jks
export TRUSTSTORE_FILE=truststore.jks
export CERTIFICATE_NAME=certificate.cert


export SERVER_TRUSTSTORE_PASSWORD=password
export CLIENT_TRUSTSTORE_PASSWORD=password
export ALL_JKS=all.jks
export YARN_USER=yarn

execute these commands:

mkdir -p $SERVER_KEY_LOCATION ; mkdir -p $CLIENT_KEY_LOCATION
cd $SERVER_KEY_LOCATION;
keytool -genkey -alias $HOST_NAME -keyalg RSA -keysize 2048 -dname "CN=$HOST_NAME,OU=hw,O=hw,L=paloalto,ST=ca,C=us" -keypass $SERVER_KEYPASS_PASSWORD -keystore $KEYSTORE_FILE -storepass $SERVER_STOREPASS_PASSWORD

cd $SERVER_KEY_LOCATION ; keytool -export -alias $HOST_NAME -keystore $KEYSTORE_FILE -rfc -file $CERTIFICATE_NAME -storepass $SERVER_STOREPASS_PASSWORD

keytool -import -trustcacerts -file $CERTIFICATE_NAME -alias $HOST_NAME -keystore $TRUSTSTORE_FILE

Also, import the certificate to the truststore used by Knox which is the JDK’s default cacerts file.

keytool -import -trustcacerts -file certificate.cert -alias $HOST_NAME -keystore /usr/lib/jvm/jre-1.7.0 openjdk.x86_64/lib/security/cacerts

Make sure to point the path to cacerts for the JDK used by Knox in your deployment.

Type yes when the asked to add the certificate to the truststore.

Configure HDFS for SSL

Copy example ssl-server.xml and edit it to use the ssl configuration created in previous step.
cp /etc/hadoop/conf.empty/ssl-server.xml.example /etc/hadoop/conf/ssl-server.xml

And and make sure the following properties are set in /etc/hadoop/conf/ssl-server.xml:

<configuration>
 <property>
   <name>ssl.server.truststore.location</name>
   <value>/etc/security/serverKeys/truststore.jks</value>
 </property>
 <property>
   <name>ssl.server.truststore.password</name>
   <value>password</value>
 </property>
 <property>
   <name>ssl.server.truststore.type</name>
   <value>jks</value>
 </property>
 <property>
   <name>ssl.server.keystore.location</name>
   <value>/etc/security/serverKeys/keystore.jks</value>
 </property>
 <property>
   <name>ssl.server.keystore.password</name>
   <value>password</value>
 </property>
 <property>
   <name>ssl.server.keystore.type</name>
   <value>jks</value>
 </property>
 <property>
   <name>ssl.server.keystore.keypassword</name>
   <value>password</value>
 </property>
</configuration>

Use Ambari to set the following properties in core-site.xml.

hadoop.ssl.require.client.cert=false
hadoop.ssl.hostname.verifier=DEFAULT_AND_LOCALHOST
hadoop.ssl.keystores.factory.class=org.apache.hadoop.security.ssl.FileBasedKeyStoresFactory
hadoop.ssl.server.conf=ssl-server.xml

Use Ambari to set the following properties in hdfs-site.xml.

dfs.http.policy=HTTPS_ONLY
dfs.datanode.https.address=workshop.hortonworks.com:50475

The valid values for dfs.http.policy are HTTPS_ONLY & HTTP_AND_HTTPS.

The valid values for hadoop.ssl.hostname.verifier are DEFAULT, STRICT,STRICT_I6, DEFAULT_AND_LOCALHOST and ALLOW_ALL. Only use ALLOW_ALL in a controlled environment & with caution. And then use ambari to restart all hdfs services.

Configure Knox to connect over SSL to WebHDFS

Make sure /etc/knox/conf/topologies/sandbox.xml (or whatever is the topology for your Knox deployment is) has a valid service address with HTTPS protocol to point to WebHDFS.

<service>
    <role>WEBHDFS</role>
    <url>https://workshop.hortonworks.com:50470/webhdfs</url>
</service>

Validate End to End SSL – Client > Knox > WebHDFS
curl -iku guest:guest-password -X GET 'https://workshop.hortonworks.com:8443/gateway/sandbox/webhdfs/v1/ op=LISTSTATUS'

SSL Between Knox & Oozie

By default, Oozie server runs with properties necessary for SSL configuration. For example,
do a ‘ps’ on your Oozie server (look for the process named Bootstrap) and you will see the following properties:

-Doozie.https.port=11443
-Doozie.https.keystore.file=/home/oozie/.keystore
-Doozie.https.keystore.pass=password

You can change these properties with Ambari in Oozie server config in the advance oozie-env config section. I changed them to point the keystore file to /etc/oozie/conf/keystore.jks. For this blog, I re-used the keystore I created earlier for HDFS and copied /etc/security/serverKeys/keystore.jks to /etc/oozie/conf/keystore.jks

Configure Knox to connect over SSL to Oozie

Make sure /etc/knox/conf/topologies/sandbox.xml (if whatever is the topology for your Knox deployment is) has a valid service address with HTTPS protocol to point to Oozie.

<service>
    <role>OOZIE</role>
    <url>https://workshop.hortonworks.com:11443/oozie</url>
 </service>

Validate End to End SSL – Client > Knox > Oozie
Apache Knox comes with a DSL client that makes operations against a Hadoop cluster trivial to use for exploration and one does not need to handle raw HTTP requests as I used in the previous examples.

Now run the following commands:
cd to /usr/lib/knox & java -jar bin/shell.jar samples/ExampleOozieWorkflow.groovy

Check out the /usr/lib/knox/samples/ExampleOozieWorkflow.groovy for details on what this script does.

Use higher strength SSL cipher suite

Often it is desirable to use a more secure cipher suite for SSL. For example, in the HDP 2.1 Sandbox, the cipher used for SSL between curl client and Knox in my environment is “EDH-RSA-DES-CBC3-SHA.”

JDK ships with ciphers that comply with US exports restricts and that limit the cipher strength. To use higher strength cipher, download UnlimitedJCE policy for your JDK vendor’s website and copy the files into the JRE. For Oracle JDK, you can get unlimited policy files from here.

cp UnlimitedJCEPolicy/* /usr/jdk64/jdk1.7.0_45/jre/lib/security/

If you use Ambari to setup your Hadoop cluster, you already have highest strength cipher allowed for the JDK, since Ambari setups the JDK with unlimited strength JCE policy.

Optional: Verify higher strength cipher with SSLScan

You can use a tool such as OpenSSL or SSLScan to verify that now AES256 is used for SSL.

For example, the command:
openssl s_client -connect localhost:8443
will print the cipher details:

New, TLSv1/SSLv3, Cipher is ECDHE-RSA-AES256-SHA384

Conclusion

SSL is the backbone of wire encryption. In Hadoop, there are multiple channels to move the data and access the cluster. This blog covered key SSL concepts and walked through steps to configure a cluster for end-to-end SSL. These steps are validated on a single node Hadoop cluster. Following these instructions for a multi-node Hadoop cluster will require a few more steps.

This blog covered a lot of ground, and I want to thank my colleague Sumit Gupta for his excellent input. Please let me know you have questions or comments.

The post End to End Wire Encryption with Apache Knox appeared first on Hortonworks.

↧

Discover HDP 2.2: Join Us for Eight 30-Minute Enterprise Hadoop Webinars

October 21, 2014, 8:56 am

≫ Next: Rebalancing the Security Equation

≪ Previous: End to End Wire Encryption with Apache Knox

Last week’s release of Hortonworks Data Platform 2.2 is packed with countless new features for Enterprise Hadoop. These included the results of Hortonworks investment in VERTICAL integration with YARN and HDFS and also HORIZONTAL innovation to ensure the key enterprise services of governance, security, and operations can be applied consistently and reliably across all the components within the Apache Hadoop platform.

To guide you through these capabilities, Hortonworks is hosting a new series of eight Thursday webinars beginning on October 23 and running to December 18. We hope you can join us!

You can join any of the webinars listed below, or sign up for all 8 by clicking this button.

Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache Knox
REGISTER NOW – Thursday, October 23, 2014 – 10:00 AM PST

In this 30-minute webinar Balaji Ganesan, Hortonworks senior director for enterprise security strategy and Vinay Shukla, director of product management, discuss HDP 2.2’s features for delivering comprehensive security in the platform.

Balaji and Vinay will discuss Apache Ranger and Apache Knox and how they are integrated in HDP 2.2 to provide fine grain authorization, auditing and API security that can be centrally administered. Balaji and Vinay will present an overview of HDP 2.2 security, show a brief demo, and leave time for questions at the end.

Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
REGISTER NOW – Thursday, October 30, 2014 – 10:00 AM PST

Earlier this year, the open source community delivered the Stinger Initiative to improve speed, scale and SQL semantics in Apache Hive. Now Stinger.next is underway, to build on those initial successes.

Join this 30-minute webinar with Hortonworks founder Alan Gates and the Hortonworks Hive product manager Raj Baines. Alan and Raj will discuss SQL queries in HDP 2.2, with ACID transactions and the cost based optimizer. They will also talk about the road ahead for the Stinger.next initiative. As always, we’ll leave time for questions at the end.

Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
REGISTER NOW – Thursday, November 6, 2014 – 10:00 AM PST

Hortonworks Data Platform 2.2 includes Apache Falcon for Hadoop data governance. In this 30-minute webinar, we’ll discuss why the enterprise needs Falcon for governance, and demonstrate data pipeline construction, policies for data retention and management with Apache Ambari. We’ll also discuss new innovations including: integration of user authentication, data lineage, an improved interface for pipeline management, and the new Falcon capability to establish an automated policy for cloud backup to Microsoft Azure or Amazon S3.

Join Apache Falcon committer and PMC member Venkatesh Seetharam and Hortonworks Falcon product manager Andrew Ahn as they discuss Falcon and take questions from attendees.

Discover HDP 2.2: Data Storage Innovations in Hadoop Distributed File System (HDFS)
REGISTER NOW – Thursday, November 13, 2014 – 10:00 AM PST

Hadoop Distributed File System (HDFS) is a core pillar of enterprise Hadoop – providing reliable, scalable and flexible data storage for multiple analytical workloads. HDP 2.2 includes a huge amount of core innovation in HDFS from the Apache community.

In this 30-minute webinar you’ll learn what’s new in HDFS. Rohit Bakhshi, the Hortonworks product manager for HDFS and Jitendra Pandey, a senior HDFS architect at Hortonworks will discuss the major innovations in HDFS – covering new heterogeneous storage capabilities, new encryption functionality and enhancements in operational security.

Discover HDP 2.2: Learn What’s New in YARN: Reliability, Scheduling and Isolation
REGISTER NOW – Thursday, November 20, 2014 – 10:00 AM PST

Apache Hadoop YARN is the architectural center of Hadoop that allows multiple data processing engines such as interactive SQL, real-time streaming, data science and batch processing to handle data stored in a single platform.

In this 30-minute webinar, Rohit Bakhshi, product manager at Hortonworks, and Vinod Vavilapalli, who leads YARN development at Hortonworks, will present an overview of YARN and discuss recent YARN innovations for reliability, scheduling and isolation, including:

Rolling upgrades
Fault tolerance
CPU scheduling
C-Group isolation

Discover HDP 2.2: Apache Storm and Apache Kafka for Stream Data Processing
REGISTER NOW – Thursday, December 4, 2014 – 10:00 AM PST

Hortonworks Data Platform 2.2 ships with Apache Storm and Apache Kafka for processing stream data in Hadoop. Now Storm runs on YARN with Apache Slider and it includes Kerberos support. The new Apache Kafka bolt for Storm supports sophisticated chaining for real-time analysis.

Join Hortonworks vice president of product management Tim Hall and Taylor Goetz, a Hortonworks committer to Storm for this 30-minute webinar as they discuss these and other new streaming and security capabilities in HDP 2.2.

Discover HDP 2.2: Apache HBase with YARN and Slider for Fast, NoSQL Data Access
REGISTER NOW – Thursday, December 11, 2014 – 10:00 AM PST

Apache HBase provides low-latency storage for scenarios that require real-time analysis and tabular data for end user applications.

Join this 30-minute webinar to learn from Devaraj Das, Hortonworks co-founder and Apache HBase committer and Hortonworks product manager Carter Shanklin. Devaraj and Carter will discuss the HBase innovations that are included in HDP 2.2, including: support for Apache Slider; Apache HBase high availability (HA); block cache compression; and wire-level encryption.

Discover HDP 2.2: Using Apache Ambari to Manage Hadoop Clusters
REGISTER NOW – Thursday, December 18, 2014 – 10:00 AM PST

Apache Ambari is a single framework for IT administrators to provision, manage and monitor a Hadoop cluster. Apache Ambari 1.7.0 is included with Hortonworks Data Platform 2.2.

In this 30-minute webinar, learn from the Hortonworks Ambari product manager Jeff Sposetti and Apache Ambari committer Mahadev Konar about new capabilities including:

Improvements to Ambari core – such as support for ResourceManager HA
Extensions to Ambari platform – introducing Ambari Administration and Ambari Views
Enhancements to Ambari Stacks – dynamic configuration recommendations and validations via a “Stack Advisor”

The post Discover HDP 2.2: Join Us for Eight 30-Minute Enterprise Hadoop Webinars appeared first on Hortonworks.

↧

Rebalancing the Security Equation

October 21, 2014, 12:51 pm

≫ Next: Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache Knox

≪ Previous: Discover HDP 2.2: Join Us for Eight 30-Minute Enterprise Hadoop Webinars

Joe Travaglini, director of product marketing at Sqrrl and Ely Kahn, vice president of business development at Sqrrl, are our guest bloggers. They explain Sqrrl’s integration with Hortonworks Data Platform (HDP).

There Is No Secure Perimeter

With the dawn of phenomena such as Cloud Computing and Bring Your Own Device (BYOD), it is no longer the case that there is a well-defined perimeter to secure and defend. Data is able to flow inside, outside, and across your network boundaries with limited interference from traditional controls. The “trusted zone” as we know it is a thing of the past.

Furthermore, Big Data is all about breaking down silos and gathering disparate data sources with various security and compliance requirements into a shared platform. While this enables building new types of applications and analytics, it also compounds the risks of data loss events, given the extra gravity these platforms command. In other words, Big Data amplifies the stakes of security.

How will you address this issue? It requires rethinking the approach. We need to embrace the chaos and change the security equation entirely. If we can’t adequately protect the data, why not let it protect itself?

A New Security Paradigm

Data-Centric Security describes the philosophy that all data has embedded within it information that specifies policy, access, and governance. A core principle of the Big Data movement brought a fundamental change to the flow in the data-application lifecycle (i.e., “move the application to the data”, instead of the other way around), and Data-Centric Security involves a similar inversion. Rather than building layer upon layer of rules and protections, and funneling everything through multiple checkpoints to enforce security procedures, Data-Centric Security yields a hardened ecosystem with self-contained policy and distributed enforcement.

Figure 1. DCS Reference Architecture

Sqrrl Leads the Charge

Sqrrl Enterprise, integrated with the Hortonworks Data Platform (HDP), provides comprehensive, end-to-end Data-Centric Security for NoSQL data access. We believe that a Data-Centric Security offering should include:

Fine-grained, cell-level security enforcement – the independent access validation of every field of data stored in the system, individually
Data labeling capability – the ability to assign visibility labels to data that specify access policy, using a set of rules
Policy specification capability – the ability to grant individual or groups of users entitlements to view data that has a particular set of visibility labels
Encryption, at-rest and in-motion – ensuring that data is always protected cryptographically, whether resident on disk or traversing the network
Secure search – ensuring that data is easily retrievable, and that this convenience does not provide a source of data leakage
Auditing – recording every client operation taken against the system

Sqrrl Enterprise is a secure, scalable, and flexible NoSQL database that allows secure integration, exploration, and analyses of disparate datasets. It sits on each data node within the Hadoop cluster and can power secure, real-time analytics and visualizations on Hadoop. Figure 2 outlines how Sqrrl Enterprise integrates with HDP.

Figure 2. Sqrrl/Hortonworks Joint Reference ArchitectureData is first ingested from a variety of sources. Sqrrl Enterprise can support bulk uploads via its MapReduce-based bulk uploader or streaming uploads via its integration with Apache Flume. When data is ingested it is labeled at the “cell-level”. This means that each individual key/value pair (or field in a JSON document) is tagged with a unique security label that dictates who can access that individual piece of data. All data is also encrypted (both in motion and at rest).

Data is then indexed via Sqrrl’s secondary indexing techniques and stored in an enhanced version of Apache Accumulo within HDP (full integration with HDP Accumulo is expected in mid-2015). Sqrrl Enterprise provides users with a powerful query language (referred to as SqrrlQL) to explore the data. A unique feature that Sqrrl provides is that SqrrlQL is fully integrated with cell-level security concepts. This means that users can conduct SQL-like, full-text, or graph searches, and they will only see the pieces of data that they are authorized to see based on how the data is tagged and their authorizations.

Sqrrl Enterprise also provides integrations with other tools, such as Apache Spark, R, Apache Pig, and MapReduce to run predictive analytics, including machine learning, over data stored in the platform. Apache Hive integration is also expected in the future.

Apache Slider is an incubating Hadoop project that will enable YARN for long-running processes, such as Apache Accumulo. Since Sqrrl has a foundation of Accumulo, YARN support for Sqrrl will come online as Slider graduates to a top-level Apache project.

Integration with Other Hadoop Security Projects

There are also a variety of other Hadoop-related security projects that can complement the capabilities of Sqrrl Enterprise. A previous Hortonworks blog post identified a number of these projects, and below is a list highlighting how Sqrrl Enterprise interfaces with them.

Apache Ranger: This project is focused on coordinating security policies across the entire Hadoop stack, and can help ensure policies associated with Sqrrl Enterprise are aligned with the rest of the Hadoop stack.
Apache Knox: Knox provides authentication, authorization, audit, and SSO capabilities for the Hadoop stack. Sqrrl Enterprise currently has support for Kerberos, LDAP, Active Directory, SSO, and audit, and the goal is to integrate these capabilities with Knox.

Use Cases

Sqrrl and Hortonworks have partnered to bring powerful Big Data solutions to a variety of large corporations in industries such as telecommunications, healthcare, government, and finance. Below is a description of the joint Sqrrl/Hortonworks solution for a Fortune 100 company.

Problem: The Company faces an evolving threat landscape presenting advanced persistent threats (APTs), massive volumes of data, and new levels of attacker sophistication. To confront this these threats, the Company sought the capability to perform advanced security analytics on years’ worth of collected data including Internet, active directory, email, USB, and VPN logs. Its current SIEM tools could not scale cost effectively or efficiently to this amount of data. The Company also had security concerns about integrating various datasets in a single location.
Sqrrl and Hortonworks Solution: Sqrrl and Hortonworks collaborated to provide a distributed computing and storage platform for the Company’s Security Operation Center. Sqrrl and Hortonworks are cost effectively ingesting and storing massive amounts of disparate cyber data in a single secure “data lake”, which enables better data retention and deeper analysis and visibility into the data. Specifically, Sqrrl Enterprise powers an internal investigations application to query and summarize massive amounts of cyber data from across the organization. Sqrrl Enterprise supports interactive query speeds, keyword searches, streaming results, and encryption of all data at rest and in motion. Sqrrl Enterprise also integrates with HDP to support advanced analytics, such as machine learning. The joint solution relies on Sqrrl’s data-centric security capabilities to enable secure access to the integrated cyber datasets from users across the organization.

Getting Started

These is a quick and simple way to experience the power of Sqrrl Enterprise + HDP. Sqrrl has recently released its Test Drive VM that is fully integrated with and packaged with HDP 2.1, courtesy of the Hortonworks Sandbox. To request access to the VM, please sign up here:

Visit http://info.sqrrl.com/trial-software-vm
The Sqrrl/Hortonworks Test Drive VM tutorial
The Sqrrl/Hortonworks joint reference architecture

Learn More, Join the Webinar November 12 @10am PT / 1pm ET

The post Rebalancing the Security Equation appeared first on Hortonworks.

↧

Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache Knox

October 28, 2014, 12:54 pm

≫ Next: Improve Insight into Your Enterprise Data with Red Hat JBoss Data Virtualization and HDP – Part 2

≪ Previous: Rebalancing the Security Equation

Last week Hortonworks presented the first of 8 Discover HDP 2.2 webinars: Comprehensive Hadoop Security with Apache Ranger and Apache Knox. Vinay Shukla and Balaji Ganesan hosted this first webinar in the series.

Balaji discussed how to use Apache Ranger (for centralized security administration, to set up authorization policies, and to monitor user activity with auditing. He also covered Ranger innovations now included in HDP 2.2:

Support for Apache Knox and Apache Storm, for centralized authorization and auditing
Deeper integration of Ranger with the Apache Hadoop stack with support for local grant/revoke in HDFS and HBase
Ranger’s enterprise readiness, with the introduction of REST APIs for policy management, and scalable storage of audit in HDFS

Vinay presented Apache Knox and API security for Apache Hadoop. Specifically, Vinay covered how Apache Knox securely extends the reach of Hadoop APIs to anyone in an organization, using any type of device. Vinay also walked through new innovations in Knox that are included in HDP 2.2:

Support for the YARN REST API
Support for HDFS HA
Support for SSL to Hadoop cluster services (WebHDFS, Apache HBase, Apache Hive and Apache Oozie)
The Knox Management REST API
Integration with Apache Ranger for service-level authorization

Here is the complete recording of the Webinar, including Balaji’s demo of Apache Ranger.

And here are the Presentation slides.

Attend our next Discover HDP 2.2 on Thursday, October 30 at 10am Pacific Time: Even Faster SQL Queries with Apache Hive and Stinger.next

Or register for all remaining webinars in the series.

We’re grateful to the many participants who joined the HDP 2.2 security webinar and asked excellent questions. This is the complete list of questions with their corresponding answers:

Question	Answer
Does Apache Ranger affect Apache Hive’s performance?	No. Ranger manages the policy centrally, but then it pushes enforcement down to the local component for enforcement. So for Apache Hive authorization, the policy is managed by Ranger but enforced by Ranger plugin running within Hive. So the integration of Ranger does not impact Hive’s performance. But Apache Ranger brings in great value with the ability to centrally manage access policies across different components in the Hadoop platform
Does Apache Ranger manage YARN ACLs?	Not yet. Externalizing YARN ACL through Ranger is in the works, but it is not available today in HDP 2.2.
How does Ranger hook into various services to enforce authorization? Do Hive and HBase provide necessary hooks for Ranger policies?	Yes. Apache Ranger provides plugin which embed within processes of various components. These plugin use authorization hooks to enforce access control for user requests. In the work we’ve done for HDP 2.2, we’ve made these hooks even better. Apache Hive has a new Hive authorization API, and Ranger has an implementation of that. In the case of HBase, it also has an authorization method where an external co-processor can be used. Ranger provides its own co-processor that is invoked as part of the HBase process and used for authorization. Also, in the case of Apache Knox and Apache Storm, we have used similar authorization hooks within those components. That’s the idea of Ranger. We don’t want to change anything within the components, but we want to use those hooks to externalize the management of the authorization.
How do ODBC and JDBC drivers talk with the Knox API Gateway in a secure way?	See this blog for a detailed answer to the question: Secure JDBC and ODBC Clients’ Access to HiveServer2. In summary, with Hive, using Beeline, when you configure HiveServer2‘s thrift gateway, ODBC and JDBC calls can be routed over HTTP. Then it becomes an HTTP call and Knox can secure those calls. The main question is: how do you provide ODBC/JDBC access over HTTP? You enable Thrift Server calls and route those calls through Apache Knox. Knox then provides authentication, wire encryption and authorization (through Apache Ranger).
What protects Apache Ranger’s audit data from intentional alteration or corruption?	With Apache Ranger in HDP 2.2 we can store audit in MYSQL, Oracle DB & HDFS, and we can apply an internal process to protect the audit from being altered by any user other than Apache Ranger. Audit information is accessible only through the Ranger Admin Portal, for specific users with privileges.
How do you secure access to the Management REST API?	You can support authentication through Knox. You can put Knox in front of these REST APIs. The other model is direct REST API access, if you are not using Knox. You can directly access Ranger’s REST API and use the standard security methods such as SSL.
How does the integration work with SiteMinder for secure single sign-on (SSO)?	With Knox, we support SSO, so for all the REST APIs that you expose to your Hadoop end users, you can support the SSO through Knox. For example, when you deploy Knox, it supports CA SiteMinder, Oracle Access Management Suite or Tivoli Access Manager. You can deploy Knox with an Apache HTTP Server and leverage its integration, or you can directly integrate with Knox.

Visit these pages to learn more

The post Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache Knox appeared first on Hortonworks.

↧

Improve Insight into Your Enterprise Data with Red Hat JBoss Data Virtualization and HDP – Part 2

November 6, 2014, 12:48 pm

≫ Next: Chart Your Journey to Scale the Hadoop Summit Brussels

≪ Previous: Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache Knox

In part 1, Kenneth Peeples, JBoss technology evangelist and principal marketing manager for Data Virtualization and Fuse Service Works at Red Hat, gave us an overview of the Red Hat and Hortonworks webinar series and offered insights into JBoss Data Virtualization and HDP. He started with an overview of data virtualization with the Hortonworks Data Platform and went over the first use case, Sentiment and Sales Analysis. Today, he describes the three other use cases. For those of you who missed one of our webinar series (or want to review them), you can find recordings of all sessions on the Red Hat partner page

Use Case 2 from Webinar 2 – Data Firewall with Multiple Hadoop Instances

This use case describes using a data firewall with multiple Hadoop instances. The data virtualization unified view uses roles to display all data, data for a specific region and/or mask data.

Objective: Secure data according to role for low level security and column masking.
Problem: Cannot hide region data from region specific users.
Solution: Leverage JBoss Data Virtualization to provide low level security and masking of columns.

In this use case, we show some of the security capabilities within DV such as Role Based Access Control (RBAC), Column Masking and Centralized Management of VDB privileges. A summary of the security features in DV are:

Authentication: Kerberos, LDAP, WS-UsernameToken, HTTP Basic, SAML
Authorization: Virtual data views, Role based access control
Administration: Centralized management of VDB privileges
Audit: Centralized audit logging and dashboard
Protection: Row and column masking and SSL encryption (ODBC and JDBC)

Demonstration Detail: Our demonstration will show multiple test cases. We have a super user or admin user, a US user and EU user. The admin user has the admin role, the US user has the usaccess role and the EU user has the euaccess role. We have a Virtual Database with the masking and row level security defined. Two HDP Sandboxes are setup with 2 tables in each – customer and customer address. The VDB contains 2 tables – customers and customer addresses. Our three test cases are:

Superuser with admin role

All data is available for all the regions for both tables
No data is masked

USuser with USaccess role

Only the US region data is viewable
All of the birthdate information is masked

EUuser with EUaccess role

Only the EU region data is viewable
All of the birthdate information is masked

The SQuirreL Client is used to run the different use cases to highlight the row security and column masking.

Demonstration References

Use Case 3 from Webinar 3 – Virtual Data Marts with multiple Virtual Databases and the HDP Sandbox

This use case describes using Virtual Data Marts with multiple Virtual Databases and the HDP Sandbox.

Objective: Purpose oriented data views for functional teams over a rich variety of semi-structured and structured data.
Problem: Data Lakes have large volumes of consolidated clickstream data, product and customer data that need to be constrained for multi-departmental use.
Solution: Leverage HDP to mashup clickstream analysis data with product and customer data to better understand customers’ behaviors on the website, and mashup customer and product data to improve product marketing strategy.

Demonstration Detail: Our demonstration uses one HDP Sandbox with two VDBs. The User, Product and web log data are all stored in the HDP. The two VDBs allow access for the Marketing and Product teams. The Marketing VDB combines the clickstream logs with customer data so that Marketing could find who (what gender, age) is accessing their site and when they drop off. The Product VDB that combines the customer and product data so they can see who (location, age, gender) has been buying the products so they can make product plans targeting their users. The Data Virtualization Dashboard is used to show the data according to the Marketing or Product teams.
Demonstration References:

Use Case 4 from Webinar 3 – Materialized views to Improve Access to Data

This use case describes how to use materialized views to improve access to your data. This use case is in progress so the demonstration source and supporting files will be available soon, so keep a look out for them, but I want to describe it briefly here.

Objective: Improve access to data, especially operational data.
Problem: All the legacy and archived data are in the Hadoop data lake. We want to access the most recent, up to the minute, operational data often and quickly.
Solution: Use JBoss Data Virtualization to integrate up to the minute data from multiple diverse data sources that can be quickly queried.

Use HDP for all data older than today.
Use JDV to materialize the data in HDP for faster access and to combine with operational VDB

Demonstration Detail: This demonstration is being worked on currently.

Demonstration References

Source and Supporting Files: To be posted
Tutorial: To be posted

Conclusion

In closing, DV and HDP complement each other to give your enterprise the necessary tools and architecture to get the most out of your data. Your legacy data stores as well as big data can be combined into helpful views for easier analysis through a wide range of analytic tools. This will help your enterprise interpret the large amounts of data that continually grow at an astronomical rate in today’s enterprise. Try our demonstrations with the collateral that has been created and keep watching for more as the partnership with Red Hat and Hortonworks continues to grow.

Additional Resources

To learn more, listen to the replays of the webinars listed below:

The post Improve Insight into Your Enterprise Data with Red Hat JBoss Data Virtualization and HDP – Part 2 appeared first on Hortonworks.

↧

Chart Your Journey to Scale the Hadoop Summit Brussels

November 19, 2014, 1:23 pm

≫ Next: Announcing Apache Knox Gateway 0.5.0

≪ Previous: Improve Insight into Your Enterprise Data with Red Hat JBoss Data Virtualization and HDP – Part 2

A Cosmopolitan Metropolis

Brussels, Belgium, conjures images of a cosmopolitan metropolis, where geopolitical summits are held, where world economic forums are debated, where global European institutions are headquartered, and where citizens and diplomats fluently converse in more than three languages—English, French, Dutch or German, along with other non-official local flavors.

To this colorful collage, add the image of a Hadoop Summit Europe 2015 for big data developers, practitioners, industry experts, and entrepreneurs, who make a difference in the digital world, who fluently code in multiple programming languages—Java, Python, Scala, C++, Pig, SQL, or R—and innovate and incubate Apache projects.

Journey to Scale the Hadoop Summit

In early spring next year, the Hadoop Summit Europe 2015, the leading conference for the Apache Hadoop community, will be held in this metropolis from April 15-16, 2015. Overall, the summit will have keynotes along with tracks divided into six key topical areas including:

Committer Track: If you’re a committer and wish to share a deep-dive technical talk with other committers on your Apache Hadoop related project, find out more about the track here and submit your abstract.
Data Science & Hadoop: Sessions in this track focus on the practice of data science using Hadoop. If you’re a data scientist or data analyst, find out more about the track here and submit your abstract.
Hadoop Governance, Security & Operations: As the core pillars that make up requirements of any enterprise Hadoop and modern data architecture, speakers who wish to speak and share their experience on building and deploying big data infrastructure can find out more about the track here and submit your abstract.
Hadoop Access Engines: Apache Hadoop YARN, the data operating system and architectural center of Hadoop, has transformed Hadoop into a multi-tenant data platform. In this track speakers will present both foundational and latest trends for YARN and other key data processing and accessing engines. Learn more about the track here and submit your abstract.
Applications of Hadoop and Data Driven Business: For speakers who would like to discuss tools, techniques, takeaways, and solutions to deliver business value and competitive advantage from big data or discuss case studies, learn more about the track here and submit your abstract.
The Future of Apache Hadoop: Where is Hadoop going? What innovative projects are under incubation? What industry initiatives are underway? Want to share your vision and ideas with technical leads, architects, committers, and expert users, discover more about the track here and submit your abstract.

In all you have six topical tracks to choose from or submit abstracts to multiple tracks. The deadline for submitting abstracts is approaching fast: December 5th, 2014.

Scale the Hadoop Summit Brussels and savor the glamorous Brussels during early spring. Submit your abstract now!

The Hadoop community is looking forward to hear you speak at this summit.

The post Chart Your Journey to Scale the Hadoop Summit Brussels appeared first on Hortonworks.

↧

Announcing Apache Knox Gateway 0.5.0

December 9, 2014, 9:44 am

≫ Next: Adding a Federation Provider to Apache Knox

≪ Previous: Chart Your Journey to Scale the Hadoop Summit Brussels

With Apache Hadoop YARN as its architectural center, Apache Hadoop continues to attract new engines to run within the data platform, as organizations want to efficiently store their data in a single repository and interact with it for batch, interactive and real-time streaming use cases. More and more independent software vendors (ISVs) are developing applications to run in Hadoop via YARN. This increases the number of users and processing engines that operate simultaneously across a Hadoop cluster, on the same data, at the same time.

The Apache Knox Gateway (“Knox”) provides perimeter security so that the enterprise can confidently extend Hadoop access to more of those new users while also maintaining compliance with enterprise security policies.

We recently released Apache Knox Gateway 0.5.0. With this release, the community addressed more than 90 JIRA issues. Among these many improvements, four stand out as particularly important:

This blog gives an overview of these new features and how they integrate with other Hadoop services. We’ll also touch on additional innovation we plan for upcoming releases.

Support for HDFS HA

High availability of HDFS resources through Apache Knox and WebHDFS routing services with HA support allow Knox users to take advantage of high-availability provided by HDFS transparently. HDFS HA provides failover in case of NameNode failure, shielding Knox users if this occurs.

Installation and configuration with Apache Ambari

As part of the recent release of Ambari 1.7.0, Knox is now integrated so that Ambari can install and manage Knox’s configuration and process lifecycle. This simplifies Knox install for someone familiar with Ambari and allows Knox to be managed by Ambari.

Service-level authorization with Apache Ranger

This improvement empowers Apache Ranger to centrally manage the service-level authorizations enforced by Knox. With this integration, all benefits of Ranger-based centralized policy management are also exposed to Knox users.

YARN REST API access

Now the Apache Hadoop YARN’s rich web service REST APIs are accessible through a new Knox routing service for the resource manager. This provides access to the metadata, monitoring and management capabilities of YARN’s application resources and should be useful for many upcoming innovations in the community of Knox developers and users.

Plans for the Future

The Apache Knox team and contributors have done a great job understanding the needs of the user community and meeting those in this release. We see the quality and number of features growing with each successive Apache Knox release.

We will continue to focus on improvements in three primary areas:

Providing access to the evolving REST API of the modern data architecture. Exposing more APIs to new and interesting clients and applications.
Securing access to the REST APIs for the API consumers in the community. We will be expanding authentication and federation/SSO capabilities to meet user and developer needs. We plan to deliver OAuth 2, Keystone token and JWT capabilities in an upcoming release.
Improving developer experience in the Apache Knox API Gateway for two groups of developers:

those that want to extend Apache Knox and contribute directly to it with routing services and security provider extensions, and
those that want to consume Hadoop resources within their applications, scripts and higher level APIs through the Knox API Gateway.

Download Apache Knox Gateway and Learn More

The post Announcing Apache Knox Gateway 0.5.0 appeared first on Hortonworks.

↧

Adding a Federation Provider to Apache Knox

December 9, 2014, 1:01 pm

≫ Next: Announcing Apache Ranger 0.4.0

≪ Previous: Announcing Apache Knox Gateway 0.5.0

The architecture of Hortonworks Data Platform (HDP) matches the blueprint for Enterprise Apache Hadoop, with data management, data access, governance, operations and security. This post focuses on one of those core components: security. Specifically, we will focus on Apache Knox Gateway for securing access to the Hadoop REST APIs.

Pseudo Federation Provider

This blog will walk through the process of adding a new provider for establishing the identity of a user. We will use the simple example of the Pseudo authentication mechanism in Hadoop to illustrate ideas for extending the pre-authenticated federation provider that is available out of the box in Apache Knox. This provider is not yet ready for use in a production environment, but the example will highlight the general programming model for adding pre-authenticated federation providers. There is also a companion github project for this article.

Provider Types

Apache Knox has two types of providers for establishing the identity of the source of an incoming REST request. One is an Authentication Provider and the other is a Federation Provider.

Authentication Providers

Authentication providers are responsible for actually collecting credentials of some sort from the end user. Some examples would be things like HTTP BASIC authentication with username and password that gets authenticated against LDAP or RDBMS. Apache Knox ships with HTTP BASIC authentication against LDAP using Apache Shiro. The Shiro provider can actually be configured in multiple ways.

Authentication providers are sometimes less than ideal since many organizations only want their users to provide credentials to the specific trusted solutions and to use some sort of SSO or federation of that authentication event across all other applications.

Federation Providers

Federation providers, on the other hand, never see the user’s actual credentials. Instead, they validate a token that represents a prior authentication event. This allows for greater isolation and protection of user credentials while still providing some means to verify the trustworthiness of the incoming identity assertions. OAuth 2, SAML assertions, JWT/SWT tokens and header-based identity propagation are all examples of federation providers.

Out of the box, Apache Knox enables the use of custom headers for propagating things like the user principal and group membership through the HeaderPreAuth federation provider. This is generally useful for solutions such as CA SiteMinder and IBM Tivoli Access Manager. In these sorts of deployments, all traffic to Hadoop would go through the solution gateway, which would then authenticate the user and can inject the request with identity propagation headers.

The use of network security and the identity management solution does not allow requests to bypass the authenticating solution gateway. This provides a level of trust for accepting the header-based identity assertions. Knox can be configured to provide additional validation through a pluggable mechanism and IP address validation. This ensures that the requests are coming from a configured set of trusted IP addresses, presumably those of the solution gateway.

Let’s Add a Federation Provider

This blog will discuss how to add a new federation provider that will extend the abstract bases that were introduced in the PreAuth provider module. It will be a minimal provider that accepts a request parameter from the incoming request.

The module and dependencies

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
 
    <groupId>net.minder</groupId>
    <artifactId>gateway-provider-security-pseudo</artifactId>
    <version>0.0.1</version>
 
    <repositories>
        <repository>
            <id>apache.releases</id>
            <url>https://repository.apache.org/content/repositories/releases/</url>
        </repository>
    </repositories>
 
    <dependencyManagement>
        <dependencies>
            <dependency>
                <groupId>org.apache.knox</groupId>
                <artifactId>gateway-spi</artifactId>
                <version>0.5.0</version>
            </dependency>
        </dependencies>
    </dependencyManagement>
    <dependencies>
        <dependency>
            <groupId>org.apache.knox</groupId>
            <artifactId>gateway-spi</artifactId>
            <version>0.5.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.knox</groupId>
            <artifactId>gateway-util-common</artifactId>
            <version>0.5.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.knox</groupId>
            <artifactId>gateway-provider-security-preauth</artifactId>
            <version>0.5.0</version>
        </dependency>
        <dependency>
            <groupId>org.eclipse.jetty.orbit</groupId>
            <artifactId>javax.servlet</artifactId>
            <version>3.0.0.v201112011016</version>
        </dependency>
 
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.11</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.easymock</groupId>
            <artifactId>easymock</artifactId>
            <version>3.0</version>
            <scope>test</scope>
        </dependency>
 
        <dependency>
            <groupId>org.apache.knox</groupId>
            <artifactId>gateway-test-utils</artifactId>
            <scope>test</scope>
            <version>0.5.0</version>
        </dependency>
    </dependencies>
 
</project>

Dependencies

NOTE: the “version” element must match the version indicated in the pom.xml of the Knox project. Otherwise, building will fail.

gateway-provider-security-preauth

This particular federation provider is going to extend the existing PreAuth module with the capability to accept the user.name request parameter as an assertion by a trusted party of the user’s identity. Knox will use the preauth module to leverage the base classes for things like IP address validation.

gateway-spi

The gateway-spi dependency pulls in the general interfaces, base classes and utilities that are expected by the Apache Knox gateway. The core GatewayServices are available through the gateway-spi module and other elements of gateway development.

gateway-util-commom

This gateway-util-common module provides common utility facilities for developing the gateway product. This is where you find the auditing, JSON and url utilities classes for gateway development.

javax.servlet from org.eclipse.jetty.orbit

This module provides the specific classes needed to implement the provider filter.

junit, easymock and gateway-test-utils

JUnit, easymock and gateway-test-utils provide the basis for writing REST-based unit tests for the Apache Knox Gateway project. They can be found in all of the existing unit tests for the various modules that make up the gateway offering.

Apache Knox Topologies

In Apache Knox, individual Apache Hadoop clusters are represented by descriptors called topologies. These topologies deploy specific endpoints that expose and protect access to the services of the associated Hadoop cluster. The topology descriptor describes the available services and their respective URLs within the actual cluster. It also describes the policy for protecting access to those services.

The policy is defined through the description of various Providers. Each provider and service within a Knox topology has a role, and provider roles consist of:

authentication,
federation
authorization, and
identity assertion

In this blog we are concerned with a Provider of type “federation.”

The Pseudo provider makes two assumptions. First, that authentication has happened at the OS level or from within another piece of middleware. The second assumption is that credentials were exchanged with some party other than Knox. This other party will be trusted by the Knox federation provider. The typical provider configuration will look something like this:

<provider>
    <role>federation</role>
    <name>Pseudo</name>
    <enabled>true</enabled>
</provider>

Ultimately, an Apache Knox topology manifests as a web application deployed within the gateway process. It exposes and protects the URLs associated with the services of the underlying Hadoop components in each cluster.

Providers generally interject a ServletFilter into the processing path of the REST API requests that enter the gateway and are dispatched to the Hadoop cluster. The mechanism used to interject the filters, their related configuration and integration into the gateway is the ProviderDeploymentContributor.

ProviderDeploymentContributor

/**
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements.  See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership.  The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License.  You may obtain a copy of the License at
*
*     http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.gateway.preauth.deploy;

import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;

import org.apache.hadoop.gateway.deploy.DeploymentContext;
import org.apache.hadoop.gateway.deploy.ProviderDeploymentContributorBase;
import org.apache.hadoop.gateway.descriptor.FilterParamDescriptor;
import org.apache.hadoop.gateway.descriptor.ResourceDescriptor;
import org.apache.hadoop.gateway.topology.Provider;
import org.apache.hadoop.gateway.topology.Service;

public class PseudoAuthContributor extends
    ProviderDeploymentContributorBase {
  private static final String ROLE = "federation";
  private static final String NAME = "Pseudo";
  private static final String PREAUTH_FILTER_CLASSNAME = "org.apache.hadoop.gateway.preauth.filter.PseudoAuthFederationFilter";

  @Override
  public String getRole() {
    return ROLE;
  }

  @Override
  public String getName() {
    return NAME;
  }

  @Override
  public void contributeFilter(DeploymentContext context, Provider  provider, Service service, 
      ResourceDescriptor resource, List<FilterParamDescriptor> params) {
    // blindly add all the provider params as filter init params
    if (params == null) {
      params = new ArrayList<FilterParamDescriptor>();
    }
    Map<String, String> providerParams = provider.getParams();
    for(Entry<String, String> entry : providerParams.entrySet()) {
      params.add( resource.createFilterParam().name( entry.getKey().toLowerCase() ).value( entry.getValue() ) );
    }
    resource.addFilter().name( getName() ).role( getRole() ).impl(PREAUTH_FILTER_CLASSNAME ).params( params );
  }
}   

The topology descriptor indicates which DeploymentContributors are required for a given cluster through the role and the name of the providers.. The topology deployment machinery within Knox first looks up the required DeploymentContributor by role. In this example, it identifies the provider as being a type of federation. It then looks for the federation provider with the name of Pseudo.
Once the providers have been resolved into the required set of DeploymentContributors, each contributor is given the opportunity to contribute to the construction of the topology web application that exposes and protects the service APIs within the Hadoop cluster.
This particular DeploymentContributor needs to add the PseudoAuthFederationFilter servlet filter implementation to the topology specific filter chain. It will also add each of the provider parameters from the topology descriptor as filterConfig parameters. This enables the configuration of the resulting servlet filters from within the topology descriptor while encapsulating the specific implementation details of the provider from the end user.
PseudoAuthFederationFilter
/**
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
package org.apache.hadoop.gateway.preauth.filter;

import java.security.Principal;
import java.util.Set;

import javax.servlet.FilterConfig;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServletRequest;

public class PseudoAuthFederationFilter 
  extends AbstractPreAuthFederationFilter {

  @Override
  public void init(FilterConfig filterConfig) throws ServletException {
    super.init(filterConfig);
  }

  /**
   * @param httpRequest
   */
  @Override
  protected String getPrimaryPrincipal(HttpServletRequest httpRequest) {
    return httpRequest.getParameter("user.name");
  }

  /**
   * @param principals
   */
  @Override
  protected void addGroupPrincipals(HttpServletRequest request, 
      Set<Principal> principals) {
    // pseudo auth currently has no assertion of group membership
  }
}

The PseudoAuthFederationFilter above extends AbstractPreAuthFederationFilter. This particular base class takes care of a number of boilerplate type aspects of pre-authenticated providers that would otherwise have to be done redundantly across providers. The two abstract methods that are specific to each provider are getPrimaryPrincipal and addGroupPrincipals. These methods are called by the base class in order to determine what principals should be created and added to the java Subject that will become the effective user identity for the request processing of the incoming request.

getPrimaryPrincipal

Implementing the abstract method getPrimaryPrincipal allows the new provider to extract the established identity from the incoming request for the given provider and communicate it back the AbstractPreAuthFederationFilter. This will then add it to the java Subject being created to represent the user’s identity. For this particular provider, all we have to do is return the request parameter by the name of “user.name.”

addGroupPrincipals

Given a set of Principals, the addGroupPrincipals is an opportunity to add additional group principals to the resulting java Subject that will be used to represent the user’s identity. This is done by adding new org.apache.hadoop.gateway.security.GroupPrincipals to the set. For the Pseudo authentication mechanism in Hadoop, there really is no way to communicate the group membership through the request parameters. One could easily envision adding an additional request parameter for this though, “user.groups” for example.

Configure as an Available Provider

resources/META- INF/services/org.apache.hadoop.gateway.deploy.ProviderDeploymentContributor
##########################################################################
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
##########################################################################

org.apache.hadoop.gateway.preauth.deploy.PseudoAuthContributor

Add to Knox as a Gateway Module

At this point, the module should be able to be built as a standalone module with:
mvn clean install

However, we want to extend the Apache Knox Gateway build to include the new module in its build and release processes. In order to do this we will need to add it to a common pom.xml file.

At the root of the project source tree there is a pom.xml file that defines all of the modules that are official components of the gateway server release. You can find each of these modules in the “modules” element. We need to add our new module declaration there:

<modules>
  ...
  <module>gateway-provider-security-pseudo</module>
  ...
</modules>

Then later in the same file we have to add a fuller definition of our module to the dependencyManagement/dependencies element:

<dependencyManagement>
    <dependencies>
        ...
        <dependency>
            <groupId>${gateway-group}</groupId>
            <artifactId>gateway-provider-security-pseudo</artifactId>
            <version>${gateway-version}</version>
        </dependency>
        ...
    </dependencies>
</dependencyManagement>

Gateway Release Module Pom.xml

Now, our Pseudo federation provider is building with the gateway project but it isn’t quite included in the gateway server release artifacts. In order to include it in the release archives and make available to the runtime, we need to add it as a dependency to the appropriate release module. In this case, we are adding it to the pom.xml file within the gateway-release module:

<dependencies>
    ...
    <dependency>
        <groupId>${gateway-group}</groupId>
        <artifactId>gateway-provider-security-pseudo</artifactId>
    </dependency>
    ...
</dependencies>

Note that this is basically the same definition that was added to the root level pom.xml but minus the “version” element.

Build, Test and Deploy

At this point, we should have an integrated custom component that can be described for use within the Apache Knox topology descriptor file and engaged in the authentication of incoming requests for resources of the protected Hadoop cluster.

building

You may use the same maven commands to:

mvn clean install

This will build and run the gateway unit tests.

You may use the following to not only build and run the tests but to also package up the release artifacts. This is a great way to quickly setup a test instance to manually test your new Knox functionality.

ant package

testing

To install the newly packaged release archive in a GATEWAY_HOME environment:

ant install-test-home

This will unzip the release bits into a local ./install directory and do some initial setup tasks to ensure that it is actually runnable.

We can now start a test ldap server that is seeded with a couple test users:

ant start-test-ldap

The sample topology files are setup to authenticate against this LDAP server for convenience and can be used as is for a quick sanity test of the install.

At this point, we can choose to run a test Knox instance or a debug Knox instance. If you want to run a test instance without the ability to connect a debugger then:

ant start-test-gateway

You may now test the out-of-box authentication against LDAP using HTTP BASIC by using curl and one of the simpler APIs exposed by Apache Knox:

curl -ivk --user guest:guest-password https://localhost:8443/gateway/sandbox/webhdfs/v1/tmp?op=LISTSTATUS

Change Topology Descriptor

Once the server is up and running and you are able to authenticate with HTTP BASIC against the test LDAP server, you can now change the topology descriptor to leverage your new federation provider.

Find the sandbox.xml file in the install/conf/topologies file and edit it to reflect your provider type, name and any provider specific parameters.

<provider>
   <role>federation</role>
   <name>PseudoProvider</name>
   <enabled>true</enabled>
   <param>
       <name>filter-init-param-name</name>
       <value>value</value>
   </param>
</provider>

Once your federation provider is configured, just save the topology descriptor. Apache Knox will notice that the file has changed and it will automatically redeploy that particular topology. Any provider params described in the provider element will be added to the PseudoAuthFederationFilter as servlet filter init params. These can be used to configure aspects of the filter’s behavior.

curl again

We are now ready to use curl again to test the new federation provider and ensure that it is working as expected:

curl -ivk https://localhost:8443/gateway/sandbox/webhdfs/v1/tmp?op=LISTSTATUS&user.name=guest

Conclusion

This blog illustrated a simplified example of implementing a federation provider for establishing the identity of a previous authentication event and propagating that into the request processing for Hadoop REST APIs inside of Apache Knox.

The process to extend the pre-authenticated federation provider is a quick and simple way to extend certain SSO capabilities to provide authenticated access to Hadoop resources through Apache Knox.

The Knox community is a growing community that welcomes contributions from users interested in extending Knox’s capabilities with useful features.

NOTE: The provider illustrated in this example has limitations that preclude it from being used in production. Most notably, it does not have any means to follow redirects because it lacks the user.name parameter in the Location header. We would need to add something like a cookie to be able to determine the user identity on the redirected request.

More Resources

The post Adding a Federation Provider to Apache Knox appeared first on Hortonworks.

↧

Announcing Apache Ranger 0.4.0

December 10, 2014, 9:27 am

≫ Next: Further Accelerating the Adoption of Enterprise Hadoop

≪ Previous: Adding a Federation Provider to Apache Knox

With Apache Hadoop YARN as its architectural center, Apache Hadoop continues to attract new engines to run within the data platform, as organizations want to efficiently store their data in a single repository and interact with it in different ways. As YARN propels Hadoop’s emergence as a business-critical data platform, the enterprise requires more stringent data security capabilities. Apache Ranger provides many of these, with central security policy administration across authorization, accounting and data protection.

On November 17, the community announced the release of Apache Ranger 0.4.0. With this release, the team closed 163 JIRA tickets. Ranger 0.4.0 delivers many new features, fixes and enhancements, chief among those are:

Contribution of the technology behind XA Secure to the open source Apache Ranger project
Support for Apache Storm and Apache Knox
REST APIs for the policy manager
Support for storing audit events in HDFS

This blog gives a brief overview of features in Apache Ranger 0.40 and also looks ahead to future plans.

First Release of open source Apache Ranger

In May of this year, Hortonworks acquired XA Secure to accelerate the delivery of a holistic, centralized and completely open-source approach to Hadoop security. Hortonworks took the proprietary XA Secure technology and contributed it to the Apache Software Foundation. This approach to investing in the tech highlights Hortonworks’ consistent and unwavering commitment to 100% open enterprise Hadoop. XA Secure was one of the first solutions to provide centralized security administration for Hadoop. The Apache Ranger community began with the code contributed by Hortonworks and added other features as part of this release.

The first release of Apache Ranger is an important milestone in the evolution of Hadoop into a mature enterprise-ready platform. Enterprise users can now securely store all types of data and run multiple workloads with different users, leveraging Ranger’s centralized security administration with fine-grain authorization and on-demand audit data. The community can now innovate to further deliver advanced security capabilities, in a way only possible with an open source platform.

Support for Apache Storm and Apache Knox

Apache Ranger now supports administration of access policies for Apache Knox and Apache Storm, extending the Ranger policy administration portal beyond previous support for HDFS, Apache HBase and Apache Hive. Now users can also view audit information for both Storm and Knox in the Ranger portal.

REST APIs for the Policy Manager

Enterprise security administrators can now use REST APIs to create, update and delete security policies. This allows enterprise users and partners to integrate Hadoop security into their existing entitlement stores and update policies using their own tools. REST APIs open the door for extended adoption of Ranger within the ecosystem.

Audit logs stored in HDFS

Lower latency and faster transaction speeds within Hadoop means an increase in the volume of audit events. To meet this growing need, Apache Ranger now offers the flexibility to store audit logs in HDFS. This leverages Hadoop’s reliable and scalable infrastructure to store and process the underlying audit events. Ranger stores the audit logs in a secure location, only accessible to privileged users.

Plans for the Future

The release would not have been possible without excellent contributions from the dedicated, talented community members. The community plans continued execution on the vision of providing comprehensive security within the Hadoop ecosystem, with the plan to extend support to Apache Solr, Kafka, and Spark. We also intend to streamline other areas of security, including authentication and encryption. In the coming weeks, we plan to publish a detailed roadmap on the Ranger wiki or through Apache JIRAs.

Download Apache Ranger, Take the Tutorial and Watch the Webinar

Download Ranger 0.4.0
Apache Ranger Release Notes
Try the Tutorial: Manage Security Policy for Hive and HBase with Knox and Ranger
Watch the Webinar: Comprehensive Hadoop Security with Apache Ranger and Apache Knox

The post Announcing Apache Ranger 0.4.0 appeared first on Hortonworks.

↧

Further Accelerating the Adoption of Enterprise Hadoop

December 17, 2014, 8:09 am

≫ Next: Apache Ranger Audit Framework

≪ Previous: Announcing Apache Ranger 0.4.0

Hortonworks introduces HDP Operations Ready, HDP Security Ready and HDP Governance Ready certifications to showcase solutions that deeply integrate with enterprise Hadoop.

Customer adoption of Apache Hadoop continues to accelerate the pace at which the community works to meet the requirements of Enterprise Hadoop. Once the place of HDFS and MapReduce only, the introduction of Apache Hadoop YARN a little over a year ago has unleashed many new ways to get value from a Hadoop cluster. The expansion of YARN-enabled systems has demonstrated the need for enterprise-required functionality for security, governance and operations. Without it, enterprises will be less able to make Hadoop part of their modern data architectures.

When we rolled out the HDP YARN Ready program in June, we were responding to the need expressed by customers and software vendors alike, to ensure technologies that integrate with Hadoop do so by leveraging existing cluster resources in a cooperative way. By any measure the program has been a success, with more than 100 vendors certifying over 70 HDP YARN Ready products to date, and many more coming.

Because of the breadth of an enterprise Hadoop platform, along with this growth in HDP YARN Ready applications came additional questions from customers: How do these technologies integrate with our operations procedures? With our security controls? With the mechanisms we’ve set up for orchestrating data movement?

While customers are looking for better-integrated technologies, our partners are looking for ways to showcase the work they’ve done to more tightly integrate with Hadoop, As a result, in addition to HDP YARN Ready, solutions can now be tested and verified for:

HDP Operations Ready : Delivers assurance to manage and run applications on HDP from an operational perspective. Specifically, integrates with Apache Ambari, using Ambari as a client to an enterprise management system, integrating Ambari-managed Hadoop components via Ambari Stacks, or providing tailored user tools with Ambari Views.
HDP Security Ready: Delivers tested and validated integration with security-related components of the platform. Beyond the ability to work in a Kerberos-enabled cluster, it also works with the Apache Knox gateway, and with Apache Ranger for comprehensive security integration and centralized management.
HDP Governance Ready: Provides assurance that data is integrated into the platform via automated and managed data pipelines as described and facilitated by the Apache Falcon data workflow engine.

These extensions of the Hortonworks Certified Technology Program provide three additional certifications under the umbrella of HDP Certified. This is the most comprehensive certification program for Hadoop platforms available today, and delivers the level of assurance for enterprises as they make the move to a modern data architecture.

We are excited about these new certifications and what they represent: a vibrant ecosystem of Hadoop integrations built for the enterprise.

Additional resources for customers:

See the partners’ page to find solutions that meet these new standards.
Hear from Teradata about their Ops Ready integration.
Hear from Cisco about their Ops Ready Integration.
Hear from VMware about their Ops Ready integration.
Hear from Syncsort about their Ops Ready integration.

Additional resources for partners or potential partners:

Partner portal (accessible from here) for additional materials.
Become a partner
HDP YARN Ready
HDP Operations Ready
HDP Security Ready
HDP Governance Ready

The post Further Accelerating the Adoption of Enterprise Hadoop appeared first on Hortonworks.

↧

Apache Ranger Audit Framework

December 23, 2014, 9:24 am

≫ Next: Announcing Apache Falcon 0.6.0

≪ Previous: Further Accelerating the Adoption of Enterprise Hadoop

Introduction

Apache Ranger provides centralized security for the Enterprise Hadoop ecosystem, including fine-grained access control and centralized audit mechanism, all essential for Enterprise Hadoop. This blog covers various details of Apache Ranger’s audit framework options available with Apache Ranger Release 0.4.0 in HDP 2.2 and how they can be configured.

The audit framework can be configured to send access audit logs generated by Apache Ranger plug-ins to one or more of the following destinations:

RDBMS: MySQL or Oracle
HDFS
Log4j appender

Ranger Audit framework supports saving audit logs to RDBMS. Currently, MySQL and Oracle are the supported RDBMS, with other DBs such as Postgres in the roadmap. Interactive audit reporting in Ranger Administration portal reads the audit logs from RDBMS.

Database schema for audit logging is generated during the installation of Ranger Admin. Before running setup.sh to setup Ranger Admin, please specify the audit database details in install.properties file, as shown in the example below. (Please refer to Ranger Admin installation documentation for details of install.properties and setup.sh.)

DB_FLAVOR=MYSQL
SQL_COMMAND_INVOKER=mysql
db_root_password=secretPa5$
db_host=mysqdb.example.com
audit_db_name=ranger_audit
audit_db_user=ranger_audit
audit_db_password=secretPa5$

During setup of each Ranger plug-in (HDFS/Hive/HBase/Knox/Storm), i.e. before running enable-*-plugin.sh, please specify the audit database details in install.properties , as shown in the example below. (Please refer to Ranger plug-in installation documentation for details of install.properties and enable-*-plugin.sh.) Please ensure to provide the same database details used during Ranger Admin setup.

XAAUDIT.DB.IS_ENABLED=true
XAAUDIT.DB.FLAVOUR=MYSQL
XAAUDIT.DB.HOSTNAME=mysqldb.example.com
XAAUDIT.DB.DATABASE_NAME=ranger_audit
XAAUDIT.DB.USER_NAME=ranger_audit
XAAUDIT.DB.PASSWORD=secretPa5$

Audit logging to RDBMS can be configured to be synchronous or asynchronous. In synchronous mode, the calls to audit will block the thread until it is committed to the database. In asynchronous mode, the calls to audit will return quickly after adding the audit log to an in-memory queue. Another thread in the audit framework will read from this queue and save to RDBMS. In asynchronous mode, a single database commit can include number of audit logs (batch commit); this can result in significant performance improvements. If the in-memory queue is full, the audit log will be dropped; periodic log messages will be written to the component log file with the count of dropped audit logs.

The default mode for audit logging to RDBMS is asynchronous. To alter the default logging mode and other configurations, like the size of the in-memory queue, update the xasecure-audit.xml in the CLASSPATH, which is typically in the component’s configuration directory, for example /etc/hadoop/conf/xasecure-audit.xml. For any configuration changes to take effect, restart the component. A list of available configurations is provided in Configuration section below.

To handle higher rate and volume of audit logs in your environment, we suggest you plan appropriate database sizing, partitioning, and automated way of purging logs.

Logging to HDFS

Ranger Audit framework can be configured to store the audit logs to HDFS, in JSON format (example below). Audit logs in HDFS can later be processed by other applications, like Apache Hive, to query and report. Please note that audit reporting functionality in Ranger Administration Portal currently uses only the audit logs stored in RDBMS.

A sample Apache HBase access audit log in JSON format:

{
   "resource":"tbl_xyz",
   "resType":"table",
   "reqUser":"user1",
   "evtTime":"2014-11-25 22:40:33.946",
   "access":"createTable",
   "result":1,
   "enforcer":"xasecure-acl",
   "repoType":2,
   "repo":"hbasedev",
   "cliIP":"172.18.145.43",
   "action":"createTable",
   "agentHost":"host1",
   "logType":"RangerAudit",
   "id":"eb45f6e8-6737-4174-92f6-45a9beabf5e7"
}

During setup of each Ranger plug-in (HDFS/Hive/HBase/Knox/Storm), i.e. before running enable-*-plugin.sh, please specify the HDFS audit log properties in install.properties, as shown in the example below. Please ensure to create necessary HDFS/staging/archive directories with read and write privileges for the plug-in’s user or owner.

XAAUDIT.HDFS.IS_ENABLED=true
XAAUDIT.HDFS.DESTINATION_DIRECTORY=hdfs://namenode.example.com:8020/ranger/audit/%app-type%/%time:yyyyMMdd%
XAAUDIT.HDFS.LOCAL_BUFFER_DIRECTORY=/var/log/hadoop/%app-type%/audit
XAAUDIT.HDFS.LOCAL_ARCHIVE_DIRECTORY=/var/log/hadoop/%app-type%/audit/archive

More details on the tags supported in the file/directory name specifications are provided later in this section.

To minimize the performance impact, the calls to create audit log write the audit log to a staging file on the host where the component runs. The local staging file is rolled-over periodically, every 10 minutes by default. After a rollover, another thread in the audit framework writes/appends the staged file contents to a HDFS file. Depending upon the rollover interval configuration of the HDFS and local staging files, multiple local staged files can be written to the same HDFS file.

Saving of audit logs to local staging file can either be synchronous or asynchronous. In synchronous mode, the calls to audit will block the thread until the log is written to the staging file. By contrast, in asynchronous mode, the calls to audit will return quickly after adding the audit log to an in-memory queue. A separate thread in the audit framework will read from this queue and write to local staging file. If the in-memory queue is full when an audit call is made, the audit log will not be recorded. To keep record of unrecorded audit logs, a count of unrecorded audit logs will be periodically written to the component log.

As with logging to RDMS, the default mode for audit logging to HDFS is asynchronous. The logging mode and other configurations, like the size of the in-memory queue, rollover period, etc., can be changed by updating xasecure-audit.xml in the CLASSPATH (typically in the component’s configuration directory, for example /etc/hadoop/conf/xasecure-audit.xml). For changes to take effect, restart of the component is required. A list of available configurations is provided in Configuration section below.

To help organize the audit logs in the file system, Ranger audit framework supports various tags in the file/directory names. At the time of file creation, the audit framework replaces these tags with appropriate values. Here are the details of the tags supported on file and directory names:

%hostname%
Name of the current host in which the audit framework is executing.
%time:date-format-specification%
Current time formatted using the given specification. For more details on the supported format specification, please refer to Java SimpleDateFormat documentation.
%jvm-instance%
Unique identifier of the JVM instance in which the audit framework is executing – generated using Java VMID class.
%property:system-property-name%
Value of the given system property name in the JVM where audit framework is executing.
%env:env-variable-name%
Value of the given environment variable in the JVM where audit framework is executing.
%app-type%
Type of the application the audit framework runs in:
hdfs, hiveServer2, hbaseMaster, hbaseRegional, knox, storm

Logging using Log4j

The Ranger Audit framework supports sending audit logs to log4j appender(s). Using this mechanism, you can send Ranger audit logs to destinations that have log4j appenders. To receive audit logs in JSON format, component’s log4j configuration should be updated to specify the appender(s) in the following property:

log4j.logger.xaaudit=

Configuration

Ranger audit framework reads its configuration from xasecure-audit.xml in the CLASSPATH, typically in the conf directory of the Hadoop component in which the Ranger plug-in runs. This file is populated with values provided by the user during Ranger plug-in installation. The configurations supported in xasecure-audit.xml along with the details of the values for each are listed in the following table; this file has additional configuration than the ones available during installation.Please note that for changes to this file to become effective, the component needs to be restarted.

Configuration Name	Default Value	Notes/strong>
xasecure.audit.is.enabled	true	Setting to enable/disable audit logging in the Ranger plug-in. true – enable audit log false – disable audit log
xasecure.audit.db.is.enabled	false	true – enable audit to RDBMS false – disable audit to RDBMS
xasecure.audit.db.is.async	false	true – send audit logs to DB asynchronously false – send audit logs to DB synchronously
xasecure.audit.db.async.max.queue.size	10240	Maximum number of audit logs to keep in queue. Attempts to create audit log when the queue is at maximum will result dropping of the audit log.
xasecure.audit.db.async.max.flush.interval.ms	5000	Maximum interval between commits to database.
xasecure.audit.db.config.retry.min.interval.ms	15000	Interval between attempts to connect to the database, after a failure.
xasecure.audit.jpa.javax.persistence.jdbc.driver	None	JDBC driver to connect to the DB. Example: MySQL: net.sf.log4jdbc.DriverSpy Oracle: oracle.jdbc.OracleDriver
xasecure.audit.jpa.javax.persistence.jdbc.url	None	JDBC URL to connect to the DB.
xasecure.audit.jpa.javax.persistence.jdbc.password	None	Password to connect to the DB.
xasecure.audit.hdfs.is.enabled	false	true – enable audit to HDFS false – disable audit to HDFS
xasecure.audit.hdfs.is.async	false	true – send audit logs asynchronously false – send audit logs synchronously
xasecure.audit.hdfs.async.max.queue.size	10240	Maximum number of audit logs to keep in queue. Attempts to create audit log when the queue is at maximum will result dropping of the audit log.
xasecure.audit.hdfs.config.destination.directroy	None	Absolute path to the HDFS directory in which audit logs should be stored. See the note below on the tags supported on file/directory names.
xasecure.audit.hdfs.config.destination.file	None	Name of the HDFS file to which audit logs should be written. See the note below on the tags supported on file/directory names.
xasecure.audit.hdfs.config.destination.flush.interval.seconds	900 (15 minutes)	Interval between calls to hflush on destination HDFS file.
xasecure.audit.hdfs.config.destination.rollover.interval.seconds	86400 (1 day)	Interval between rollover of destination HDFS file.
xasecure.audit.hdfs.config.destination.open.retry.interval.seconds	60 (1 minute)	Interval between calls to flush audit logs written to staging file.
xasecure.audit.hdfs.config.local.buffer.rollover.interval.seconds	600 (10 minutes)	Interval between rollover of staging file.
	None	Absolute path to the local directory to store audit log files after sending to HDFS. See the note below on the tags supported on file/directory names.
xasecure.audit.hdfs.config.local.archive.max.file.count	None	Maximum number of files to store in archive directory.
xasecure.audit.log4j.is.enabled	false	true – enable audit to log4j false – disable audit to log4j
xasecure.audit.log4j.is.async	false	true – send audit logs asynchronously false – send audit logs synchronously

xasecure.audit.log4j.async.max.queue.size	10240	Maximum number of audit logs to keep in queue. Attempts to create audit log when the queue is at maximum will result dropping of the audit log.

The post Apache Ranger Audit Framework appeared first on Hortonworks.

↧