DEV Community: Kevin Risden

My Development Environment 2018

Kevin Risden — Thu, 06 Dec 2018 14:00:00 +0000

Overview

I was asked the other day about what my development environment looked like since I was able to test a lot of different configurations quickly. I am writing this post to capture some of the stuff I do to be able to iterate quickly. First some background on why it has historically been important for me to be able to change test environments quickly.

Background

I previously worked as a software consultant with Avalon Consulting, LLC. We worked on a variety of projects for a number of different clients. Some of the projects were long and others were shorter. I focused primarily on big data and search. Apache Hadoop with security has a lot of different configurations. It wasn’t practical to spin up cloud environments (hotel wifi sucks) for each little test. This meant I needed to find a way to test things on my 8GB Macbook Pro.

Development Laptop

I currently have 2 laptops for development. A 2012 8GB RAM Macbook Pro that is starting to show its age, but was worth every penny. A second work laptop that I won’t go into too much detail. Both laptops are configured very similarly. Key software includes:

I use my terminal quite a bit. I use it for git, ssh, docker, vagrant, etc. I typically leave my terminal up at all times since I am usually running something. I jump between Docker and Vagrant/Virtualbox quite a bit. There are lots of security and distributed computing setups where proper hostnames and DNS resolution works better with full virtual machines. There are fewer gotchas if you know you are working with “real” machines instead of fighting with Docker networking and DNS.

I owe a big shoutout to Travis CI since I use them a lot for my open source projects. I typically push a git branch to my Github fork and let Travis CI go to work. This allows me to work on multiple things at once when tests take 10s of minutes.

Intel NUC Server

I recently added an Intel NUC to my development setup to help with offloading some of the long running tests from my laptop. It also has more RAM and CPU power that allows me to run continuous integration jobs as well as more Vagrant VMs. Some of the software I have running on my Intel NUC (mostly as Docker containers):

Dnsmasq ensures that I can get a consistent DNS both on my Intel NUC and within my private network. Jenkins runs most of my continuous integration builds. It helps keep track of logs and allows me to spin up jobs for different purposes (like repeatedly testing a feature branch). Jenkins spins up separate Docker containers for each build so I don’t have to worry about dependency conflicts. Nexus allows me to cache Maven repositories, Docker images, static files, and more. This ensures that I don’t need to wait to redownload the same dependencies over and over again. Gogs is a standalone Git server that painlessly lets me mirror repos internally. This avoids me having to pull big repos from the internet over and over again. Sonarqube enables me to run some additional static build checks against some of the Jenkins builds.

Yubikey

I want to talk a little bit about my use of a Yubikey. I had been thinking about getting one for a few years and finally got one when Yubikey 5 came out. I use it all the time now for GPG and SSH. I am able to not store any private keys on my new devices and can even SSH from a Chromebook back to my server if necessary. I configured my Yubikey to handle both GPG for signing and authentication. This allows me to use GPG with SSH as well. The GPG agent takes a little configuring, but once setup you can easily use it for both GPG and SSH. I wish more websites supported U2F instead of OATH/Authenticator codes. I like the simplicity and would recommend it for most developers.

Conclusion

My setup hasn’t changed too much over the past 5 years when it comes to development laptops. I have started to use more cloud based automated testing like Travis CI. I added the Intel NUC to be able to do more testing internally across bigger VMs. I will say that I have learned more trying to fit a distributed system on an 8GB RAM laptop than anything else. (Who else can say they have run Hadoop on 3 Linux VMs and 1 Windows AD VM on 8GB of RAM). Who knows what is to come in 2019, but I am happy and productive with what I have in 2018.

Apache Hadoop YARN - “Vulnerability” FUD

Kevin Risden — Tue, 04 Dec 2018 14:00:00 +0000

Overview

There are reports of an Apache Hadoop YARN “vulnerability” but want to share some more details that have missed the few articles I’ve come across. Here are a few of the articles/links:

Demonbot vulnerability requires an unsecure cluster

The key point I want to make is that the report misleads the reader to assume that all Apache Hadoop YARN environments are insecure. This is false. The clusters described have no security and are akin to having your front door unlocked. Kerberized clusters are secure since they require a valid user account to be usable. Furthermore, clusters should not be exposed to the internet for most usecases (especially not endpoints that allow for remote job submission).

Explain the “vulnerability” like I’m five

Imagine that one day you get home and find a whole bunch of extra lamps plugged into your outlets. You are annoyed because the lamps are using your electicity. You remember that you forgot to lock your door when you went on vacation. Instead of someone stealing stuff from your home, they decided to plug in lamps.

Now you might be thinking, it is expected that something bad would happen if you left your door unlocked when you went on vacation. This is the exact same thing as an unsecure Apache Hadoop YARN cluster. No one should leave their cluster unsecured and exposed to the outside world.

Conclusion

There have been multiple reports of “big data” endpoints being exposed to the internet and not being secured. This has affected Elasticsearch, Mongodb, and others. There is no reason to expose a cluster to the internet without security. Cloudera wrote a blog post that covers the same topic as well here.

Apache Solr - Hide/Redact Senstive Properties

Kevin Risden — Tue, 27 Nov 2018 14:00:00 +0000

Overview

Apache Solr is a full text search engine that is built on Apache Lucene. One of the common questions on the solr-user mailing list (ie: here and here) is how to hide sensitive values from the Solr UI. There is a little known setting that enables hiding these sensitive values.

Apache Solr and Hiding Sensitive Properties

Apache Solr has a few places where sensitive values can be seen on the Solr UI. The keystore and truststore passwords are two examples that came up as part of SOLR-10076. Starting in Solr 6.6 and 7.0, Solr will hide any property in the /admin/info/system API that contains the word password when the system property solr.redaction.system.enabled is set to true. The /admin/info/system API is used to power the Solr UI. This works well for most cases, but the implementation is more generic enabling it to hide any custom properties.

The property solr.redaction.system.pattern is a system property that takes a regular expression. If the regular expression matches the property name then the system property value will be redacted. This can enable hiding sensitive values for custom libraries or other use cases.

The table below lays out the two properties that can be configured in Solr 6.6 or later.

Property	Default Value	Purpose
`solr.redaction.system.enabled`	`false` in Solr 6.6; `true` in Solr 7.0	Enables or disables the redaction
`solr.redaction.system.pattern`	`.password.`	Regex for the properties to redact

Apache Solr and Hiding Metrics Properties

The Solr Metrics API can leak sensitive information as well. There is a hiddenSysProps configuration that can prevent certain properties from being exposed via the metrics API. If additional properties need to be hidden then they need to be configured in the hiddenSysPropes section.

Conclusion

Currently, there is limited documentation about the available options for hiding sensitive values. It is frustrating to have to configure hiding sensitive values in two places, but there is hope for improvement. SOLR-12976 was created earlier this month to try to address the duplication and documentation.

Apache Solr - Hadoop Authentication Plugin - LDAP

Kevin Risden — Tue, 20 Nov 2018 14:00:00 +0000

Overview

Apache Solr is a full text search engine that is built on Apache Lucene. One of the questions I’ve been asked about in the past is LDAP support for Apache Solr authentication. While there are commercial additions that add LDAP support like Lucidworks Fusion, Apache Solr doesn’t have an LDAP authentication plugin out of the box. Lets explore what the current state of authentication is with Apache Solr.

Apache Solr and Authentication

Apache Solr 5.2 released with a pluggable authentication module from SOLR-7274. This paved the way for future authentication implementations such as BasicAuth (SOLR-7692) and Kerberos (SOLR-7468). In Apache Solr 6.1, delegation token support (SOLR-9200) was added to the Kerberos authentication plugin. Apache Solr 6.4 added a significant feature for hooking the Hadoop authentication framework directly into Solr as an authentication plugin (SOLR-9513). There haven’t been much more work on authentication plugins lately. Some work is being done to add a JWT authentication plugin currently (SOLR-12121). Each Solr authentication plugin provides additional capabilities for authenticating to Solr.

Hadoop Authentication, LDAP, and Apache Solr

Hadoop Authentication Framework Overview

The Hadoop authentication framework provides additional capabilities since it has added backends. The backends currently include Kerberos, AltKerberos, LDAP, SignerSecretProvider, and Multi-scheme. Each can be configured to support varying needs for authentication.

Apache Solr and Hadoop Authentication Framework

Apache Solr 6.4+ supports the Hadoop authentication framework due to the work of SOLR-9513. The Apache Solr reference guide provides guidance on how to use the Hadoop Authentication Plugin. All the necessary configuration parameters can be passed down to the Hadoop authentication framework. As more backends are added to the Hadoop authentication framework, Apache Solr just needs to upgrade the Hadoop depdendency to gain support.

Apache Solr 7.5 and LDAP

LDAP support for the Hadoop authentication framework was added in Hadoop 2.8.0 (HADOOP-12082). Sadly, the Hadoop dependency for Apache Solr 7.5 is only on 2.7.4. This means that when you try to configure the HadoopAuthenticationPlugin` with LDAP, you will get the following error:

`
Error initializing org.apache.solr.security.HadoopAuthPlugin:
javax.servlet.ServletException: java.lang.ClassNotFoundException: ldap

Manually Upgrading the Apache Solr Hadoop Dependency

Note: I don’t recommend doing this outside of experimenting and seeing what is possible.

I put together a simple test project that “manually” replaces the Hadoop 2.7.4 jars with 2.9.1 jars. This was designed to test if it is possible to configure the Solr HadoopAuthenticationPlugin with LDAP. I was able to configure Solr using the following security.json file to use the Hadoop 2.9.1 LDAP backend.

`
{
"authentication": {
"class": "solr.HadoopAuthPlugin",
"sysPropPrefix": "solr.",
"type": "ldap",
"authConfigs": [
"ldap.providerurl",
"ldap.basedn",
"ldap.enablestarttls"
],
"defaultConfigs": {
"ldap.providerurl": "ldap://ldap",
"ldap.basedn": "dc=example,dc=org",
"ldap.enablestarttls": "false"
}
}
}

With this configuration and the Hadoop 2.9.1 jars, Apache Solr was protected by LDAP. There should be more testing done to see how this plays with multiple nodes and other types of integration required. The Hadoop authentication framework has limited support for LDAP but it should be usable for some usecases.

Conclusion

Apache Solr, as of 7.5, is currently limited as to what support it has for the Hadoop authentication framework. This is due to the depenency on Apache Hadoop 2.7.4. When the Hadoop dependency is updated (SOLR-9515) in Apache Solr, there will be at least some initial support for LDAP integration out of the box with Solr.

References

Apache Hadoop - TLS and SSL Notes

Kevin Risden — Thu, 15 Nov 2018 14:00:00 +0000

Overview

I’ve collected notes on TLS/SSL for a number of years now. Most of them are related to Apache Hadoop, but others are more general. I was consulting when the POODLE and Heartbleed vulnerabilities were released. Below is a collection of TLS/SSL related references. No guarantee they are up to date but it helps to have references in one place.

TLS/SSL General

Great explaination of TLS/SSL: http://www.zytrax.com/tech/survival/ssl.html
SSL Linux certificate location: http://serverfault.com/questions/62496/sslcertificatelocationonunixlinux
SSL vs TLS: http://security.stackexchange.com/questions/5126/whatsthedifferencebetweenssltlsandhttps

Certificate Types

Generating Certificates

Existing Certificate and Key to JKS

http://stackoverflow.com/questions/11952274/howcanicreatekeystorefromanexistingcertificateabccrtandabckeyfil

openssl pkcs12 ‐export ‐in abc.crt ‐inkey abc.key ‐out abc.p12
keytool ‐importkeystore ‐srckeystore abc.p12 \
        ‐srcstoretype PKCS12 \
        ‐destkeystore abc.jks \
        ‐deststoretype JKS

Trusting CA Certificates

OpenSSL

update‐ca‐trust force‐enable
cp CERT.pem /etc/pki/tls/source/anchors/
update‐ca‐trust extract

OpenLDAP

vi /etc/openldap/ldap.conf

...
TLS_CAFILE /etc/pki/
# Comment out TLS_CERTDIR
...

Java

/usr/java/JAVA_VERSION/jre/lib/security/cacerts
/etc/pki/ca‐trust/extracted/java/cacerts

https://bugzilla.redhat.com/show_bug.cgi?id=1056224

POODLE SSLv3

What is POODLE?

Testing for POODLE

https://chrisburgess.com.au/how-to-test-for-the-sslv3-poodle-vulnerability/

# Requires a relatively recent version of openssl installed
openssl s_client ‐connect HOST:PORT ‐ssl3
# ‐tls1 ‐tls1_1 ‐tls1_2
curl ‐v3 ‐i ‐X HEAD https://HOST:PORT

Configuring Hadoop for Cipher Suites and Protocols

Each Hadoop component must be configured or have the proper version to disable certain SSL protocols and versions.

Ambari

https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.0.1/configuring-advanced-security-options-for-ambari/content/ambari_sec_optional_configure_ciphers_and_protocols_for_ambari_server.html
- security.server.disabled.ciphers=TLS_ECDHE_RSA_WITH_3DES_EDE_CBC_SHA
- security.server.disabled.protocols=SSL|SSLv2|SSLv3

Hadoop

https://issues.apache.org/jira/browse/HADOOP-11243
- Hadoop 2.5.2 + 2.6 Patches SSLFactory for TLSv1
- hadoop.ssl.enabled.protocols=TLSv1
- (JDK6 can use TLSv1, JDK7+ can use TLSv1,TLSv1.1,TLSv1.2)
https://issues.apache.org/jira/browse/HADOOP-11218
- Hadoop 2.8 Patches SSLFactory for TLSv1.1 and TLSv1.2
- Java 6 doesn’t support TLSv1.1+. Requires Java 7.
https://issues.apache.org/jira/browse/HADOOP-11260
- Hadoop 2.5.2 + 2.6 Patches Jetty to disable SSLv3

HTTPFS

https://issues.apache.org/jira/browse/HDFS-7274
- Hadoop 2.5.2 + 2.6 Disables SSLv3 in HTTPFS

Hive

https://issues.apache.org/jira/browse/HIVE-8675
- Hive 0.14 Removes SSLv3 from supported protocols
- hive.ssl.protocol.blacklist
https://issues.apache.org/jira/browse/HIVE-8827
- Hive 1.0 Adds SSLv2Hello back to supported protocols
- hive.ssl.protocol.blacklist=SSLv2,SSLv3

Oozie

https://issues.apache.org/jira/browse/OOZIE-2034
- Oozie 4.1.0 Disable SSLv3
https://issues.apache.org/jira/browse/OOZIE-2037
- Add support for TLSv1.1 and TLSv1.2
- Java 6 doesn’t support TLSv1.1+. Requires Java 7. Depends on OOZIE2036

Flume

https://issues.apache.org/jira/browse/FLUME-2520
- Flume 1.5.1 HTTPSource disable SSLv3

Hue

https://issues.cloudera.org/browse/HUE-2438
- Hue 3.8 Disable SSLv3
- line 1670 of /usr/lib/hue/desktop/core/src/desktop/lib/wsgiserver.py
- ctx.set_options(SSL.OP_NO_SSLv2 | SSL.OP_NO_SSLv3)
- ssl_cipher_list = "DEFAULT:!aNULL:!eNULL:!LOW:!EXPORT:!SSLv2" (default)

Ranger

https://issues.apache.org/jira/browse/RANGER-158
- Ranger 0.4.0 Ranger Admin and User Authentication disable SSLv3

Knox

https://issues.apache.org/jira/browse/KNOX-455
- Knox 0.5.0 Disable SSLv3
- ssl.exclude.protocols

Storm

https://issues.apache.org/jira/browse/STORM-640
- Storm 0.10.0 Disable SSLv3

Resources

Apache Knox - Performance Improvements

Kevin Risden — Tue, 13 Nov 2018 14:00:00 +0000

TL;DR

Apache Knox 1.2.0 should significantly improve:

Apache Hadoop WebHDFS write performance due to KNOX-1521
Apache Hive and GZip performance due to KNOX-1530

If you are using Java for TLS, then you should read here.

Overview

Apache Knox is a reverse proxy that simplifies security in front of a Kerberos secured Apache Hadoop cluster and other related components. On the knox-user mailing list and Knox Jira, there have been reports about Apache Knox not performing as expected. Two of the reported cases focused on Apache Hadoop WebHDFS performance specifically. I was able to reproduce the slow downs with Apache Knox although the findings were surprising. This blog details the performance findings as well as improvements that will be in Apache Knox 1.2.0.

Reproducing the performance issues

Apache Hadoop - WebHDFS

I started looking into the two reported WebHDFS performance issues (KNOX-1221 and knox-user post). I found that the issue reproduced easily on a VM on my laptop. I tested read and write performance of WebHDFS natively with curl as well as going through Apache Knox. The results as posted to KNOX-1221 were as follows:

WebHDFS Read Performance - 1GB file

Test Case	Transfer Speed	Time
Native WebHDFS	252 MB/s	3.8s
Knox w/o TLS	264 MB/s	3.6s
Knox w/ TLS	54 MB/s	20s
Parallel Knox w/ TLS	2 at ~48MB/s	22s

WebHDFS Write Performance - 1GB file

Test Case	Time
Native WebHDFS	2.6s
Knox w/o TLS	29s
Knox w/ TLS	50s

The results were very surprising since the numbers were all over the board. What was consistent was that Knox performance was poor for reading with TLS and writing regardless of TLS. Another interesting finding was that parallel reads from Knox did not slow down, but instead each connection was limited independently. Details of the analysis are found below here.

Apache HBase - HBase Rest

After analyzing WebHDFS performance, I decided to look into other services to see if the same performance slowdowns existed. I looked at Apache HBase Rest as part of KNOX-1524. I decided to test without TLS for Knox since there was a slowdown identified as part of WebHDFS already.

Scan Performance for 100 thousand rows

Test Case	Time
HBase shell	13.9s
HBase Rest - native	3.4s
HBase Rest - Knox	3.7s

The results were not too surprising. More details of the analysis are found below here.

Apache Hive - HiveServer2

I also looked into HiveServer2 performance with and without Apache Knox as part of KNOX-1524. The testing below is again without TLS.

Select * performance for 200 thousand rows

Test Case	Time
hdfs dfs -text	2.4s
beeline binary fetchSize=1000	6.2s
beeline http fetchSize=1000	7.5s
beeline http Knox fetchSize=1000	9.9s
beeline binary fetchSize=10000	7.3s
beeline http fetchSize=10000	7.9s
beeline http Knox fetchSize=10000	8.5s

This showed there was room for improvement for Hive with Knox as well. Details of the analysis are found below here.

Performance Analysis

Apache Hadoop - WebHDFS

While lookg at the WebHDFS results, I found that disabling TLS resulted in a big performance gain. Since changing ssl.enabled in gateway-site.xml was the only change, that meant that TLS was the only factor for read performance differences. I looked into Jetty performance with TLS and found there were known performance issues with the JDK. For more details, see below here.

The WebHDFS write performance difference could not be attributed to TLS performance since non TLS Knox was also ~20 seconds slower. I experimented with different buffersizes and upgrading httpclient before finding the root cause. The performance difference can be attributed to an issue with the UrlRewriteRequestStream in Apache Knox. There are multiple read methods on an InputStream and those were not implemented. For the fix details, see below here.

Apache HBase - HBase Rest

The HBase shell slowness is to be expected since it is written in JRuby and not the best tool for working with HBase. Typically the HBase Java API is used. While looking at the results, there were no big bottlenecks that jumped out from the performance test. There is some overhead due to Apache Knox but much of this is due to the extra hops.

Apache Hive - HiveServer2

It took me a few tries to create a test framework that would allow be to test the changes easily. One of the big findings was that Hive is significantly slower than hdfs dfs -text for the same file. There can be some performance improvements to HiveServer2 itself. Another finding is that HiveServer2 binary vs http modes differed significantly with the default fetchSize of 1000. My guess is that when HTTP compression was added in HIVE-17194, the fetchSize parameter should have been increased to improve over the wire efficiency. When ignoring the binary mode performance, there was still a difference between HiveServer2 http mode with and without Apache Knox. Details on the performance improvements can be found here.

Performance Improvements

Java - TLS/SSL Performance

There are some performance issues when using the default JDK TLS implementation. I found a few references about the JDK and Jetty.

I was able to test with Conscrypt and found that the performance slowdowns for TLS reads and writes went away. I also tested disabling GCM since there are references that GCM can cause performance issues with JDK 8.

The results of testing different TLS implementations are below:

Test Case	Transfer Speed	Time
Native WebHDFS	252MB/s	3.8s
Knox w/o TLS	264MB/s	3.6s
Knox w/ Conscrypt TLS	245MB/s	4.2s
Knox w/ TLS no GCM	125MB/s	8.7s
Knox w/ TLS	54.3MB/s	20s

Switching to a different TLS implementation provider for the JDK can significantly help performance. This goes across the board for any TLS handling with Java. Another otpion is to terminate TLS connections with a non Java based load balancer. Finally, turning off TLS for performance specific isolated use cases may be ok. These options are ones that should be considered when using TLS with Java.

Knox - WebHDFS Write Performance

I created KNOX-1521 to add the missing read methods on the UrlRewriteRequestStream class. This allows the underlying stream to read more efficiently than 1 byte at a time. With the changes from KNOX=1521, WebHDFS write performance is now much closer to native WebHDFS. The updated write performance after KNOX-1521 results are below:

WebHDFS Write Performance - 1GB file - KNOX-1521

Test Case	Time
Native WebHDFS	3.3s
Knox w/o TLS	29s
Knox w/o TLS w/ KNOX-1521	4.2s

Knox - GZip Handling

I found that Apache Knox had a few issues when it came to handling GZip compressed data. I opened KNOX-1530 to address the underlying issues. The big improvement being that Knox after KNOX-1530 will not decompress data that doesn’t need to be rewritten. This removes a lot of processing and should improvement Knox performance for other use cases like reading compressed files from WebHDFS and handling JS/CSS compressed files for UIs. After KNOX-1530 was addressed, the performance for Apache Hive HiveServer2 in http mode with and without Apache Knox was about the same.

Select * performance for 200 thousand rows with KNOX-1530

Test Case	Time
hdfs dfs -text	2.1s
beeline binary fetchSize=1000	5.4s
beeline http fetchSize=1000	6.8s
beeline http Knox fetchSize=1000	7.7s
beeline binary fetchSize=10000	6.8s
beeline http fetchSize=10000	7.7s
beeline http Knox fetchSize=10000	7.8s

The default fetchSize of 1000 slows down HTTP mode since there needs to be repeated requests to get more results.

Conclusion

By reproducing the WebHDFS performance bottleneck, it showed that we could improve Knox performance. WebHDFS write performance for Apache Knox 1.2.0 should be significantly faster due to KNOX-1521 changes. Hive perofrmance should be better in Apache Knox 1.2.0 due to KNOX-1530 with better GZip handling. Apache Knox 1.2.0 should be released soon with these performance improvements and more.

I posted the performance tests I used here so they can be used to find other performance bottle The performance benchmarking should be reproducible and I will use them for more performance testing soon.

The performance testing done so far is for comparison against the native endpoint and not to show absolutely best performance numbers. This type of testing found some bottlenecks that have been addressed for Apache Knox 1.2.0. All of the tests done so far are without Kerberos authentication for the backend. There could be additional performance bottlenecks when Kerberos authentication is used and that will be another area I’ll be looking into.

Apache HBase - Thrift 1 Server SPNEGO Improvements

Kevin Risden — Thu, 08 Nov 2018 14:00:00 +0000

Overview

Apache HBase provides the ability to perform realtime random read/write access to large datasets. HBase is built on top of Apache Hadoop and can scale to billions of rows and millions of columns. One of the capabilities of Apache HBase is a thrift server that provides the ability to interact with HBase from any language that supports Thrift. There are two different versions of the HBase Thrift server v1 and v2. This blog post focuses on v1 since that is the version that integrates with Hue.

Apache HBase and Hue

Hue has support for Apache HBase through the v1 thrift server. The Hue UI allows for easily interacting with HBase for both querying and inserting. It is a quick and easy way to get started with HBase. The downside is that when using the HBase thrift v1 server, there was limited support for Kerberos.

HBase Thrift V1 and Kerberos

There have been a few posts about getting the HBase Thrift V1 server to work properly with Kerberos. In many cases, the solution was to merge keytabs for the HTTP principal and the HBase server principal. The other solution was to add the HTTP principal as a proxy user. Both of these solutions require extra work that isn’t necessary. The HTTP principal should only be used for authenticating SPNEGO. The HBase server principal should be used to authenticate with the rest of HBase. I found this out after comparing the Apache Hive HiveServer2 thrift implementation with the HBase thrift server implementation.

Improving the HBase Thrift V1 Implementation

I emailed the hbase-user mailing list to see if my findings were plausible or if I was missing something. Josh Elser reviewed it and said that this change would be useful. I opened HBASE-19852 and put together a working patch over the next few months. It turns out the quick patch for our environment took some effort to contribute back to Apache HBase proper. The patch accomplished the following:

Avoid the existing 401 try/catch block by checking the authorization header up front before checking for Kerberos credentials
Add hbase.thrift.spnego.principal and hbase.thrift.spnego.keytab.file to allow configuring the SPNEGO principal specifically for the Thrift server

With the first change, this prevents the logs from being filled with messages about failing Kerberos authentication when the authorization header is empty. The second change allows the SPNEGO principal to be configured in the hbase-site.xml file. The thrift server will then be configured to use the SPNEGO principal and keytab for HTTP authentication. This prevents the need to merge keytabs and allows an administrator to use existing SPNEGO principals and keytabs that are on the host (like one setup by Ambari).

Conclusion

HBASE-19852 was reviewed and merged in June 2018. It is a part of HBase 2.1.0 and greater. The Apache HBase community was great to work with since they were patient while I worked on the patch over a few months. The new configuration options allows the HBase Thrift V1 server to work seemlessly with Kerberos and Hue. There is no longer a need to merge keytabs or perform other workarounds. This change has been in use for over a year now with success using the Hue HBase Browser with HBase and Kerberos.

Apache HBase - Snappy Compression

Kevin Risden — Tue, 06 Nov 2018 14:00:00 +0000

Overview

Apache HBase provides the ability to perform realtime random read/write access to large datasets. HBase is built on top of Apache Hadoop and can scale to billions of rows and millions of columns. One of the features of HBase is to enable different types of compression for a column family. It is recommended that testing be done for your use case, but this blog shows how Snappy compression can reduce storage needs while keeping the same query performance.

Evidence

Below are some images from some clusters where testing was done with Snappy compression. The charts show a variety of metrics from storage size to system metrics.

Conclusion

The charts above show >80% storage saving while only seeing a slight bump in mutate latencies. The clusters that this was tested on were loaded with simulated data and load. The production data matched this when deployed as well. This storage savings also helped backups and disaster recovery since we didn’t need to move as much data across the wire. References for implementing this yourself with more options for testing are below.

References

Apache Storm - Slow Topology Upload

Kevin Risden — Thu, 01 Nov 2018 14:00:00 +0000

Note: This is an old post from notes. This may not be applicable anymore but sharing in case it helps someone.

Overview

Apache Storm after HDP 2.2 seems to have a hard time with large topology jars and takes a while to upload them. There have been a few reports of Storm topology jars uploading slowly. I ran into this a few years ago. The fix is to increase the nimbus.thrift.max_buffer_size setting.

Fix

Increase nimbus.thrift.max_buffer_size from the default of 1048576 to 20485760.

References

Apache Solr - Apache Calcite Avatica Integration

Kevin Risden — Tue, 30 Oct 2018 14:00:00 +0000

Overview

Apache Solr is a full text search engine that is built on Apache Lucene. One of the capabilities of Apache Solr is to handle SQL like statements. This was introduced in Solr 6.0 and refined in subsequent releases. Initially the SQL support used the Presto SQL parser. This was replaced by Apache Calcite due to Presto not having an optimizer. Calcite provides the ability to push down execution of SQL to Apache Solr.

Apache Calcite Avatica is a subproject of Apache Calcite and provides a JDBC driver as well as JDBC server. The Avatica architecture diagram displays how this fits together.

Apache Solr and Apache Calcite Avatica

Apache Solr has historically built its own JDBC driver implementation. This takes quite a bit of effort since the JDBC specification has a lot of methods that need to be implemented. SOLR-9963 was created to try to integrate Apache Calcite Avatica into Solr. This would provide an endpoint for the Avatica JDBC driver and remove the need for a separate Apache Solr JDBC driver implementation.

Integrating Apache Calcite Avatica as an Apache Solr Handler

Since Apache Calcite Avatica is implemented in Jetty just like Apache Solr, I had the idea to add Avatica as just another handler in Solr. This would expose all the features of Avatica without changing any internals of Solr. The Avatica handler could then use the existing Calcite engine within Apache Solr to handle the queries.

I created SOLR-9963 and by early February 2017 I had a working example of the integration of Avatica and Solr. I was able to use the existing Avatica JDBC driver directly with Apache Solr without any issues. Sadly I haven’t had time to finish merging this change yet.

Testing Apache Solr with Apache Calcite Avatica Handler

One of the cool features of Apache Calcite Avatica is that you can interact with it over pure REST with a JSON payload. I created a simple test script to show how this was possible even with Apache Solr.

./test_avatica_solr.sh "http://localhost:8983/solr/test/avatica" "select * from test limit 10"

test_avatica_solr.sh

#!/usr/bin/env bash

set -u
#set -x

AVATICA=$1
SQL=$2

CONNECTION_ID="conn-$(whoami)-$(date +%s)"
MAX_ROW_COUNT=100
NUM_ROWS=2
OFFSET=0

echo "Open connection"
curl -i -w "\n" "$AVATICA" -H "Content-Type: application/json" --data "{\"request\": \"openConnection\",\"connectionId\": \"${CONNECTION_ID}\"}"

# Example of how to set connection properties with info key
#curl -i "$AVATICA" -H "Content-Type: application/json" --data "{\"request\": \"openConnection\",\"connectionId\": \"${CONNECTION_ID}\",\"info\": {\"zk\": \"$ZK\",\"lex\": \"MYSQL\"}}"
echo

echo "Create statement"
STATEMENTRSP=$(curl -s "$AVATICA" -H "Content-Type: application/json" --data "{\"request\": \"createStatement\",\"connectionId\": \"${CONNECTION_ID}\"}")
STATEMENTID=$(echo "$STATEMENTRSP" | jq .statementId)
echo

echo "PrepareAndExecuteRequest"
curl -i -w "\n" "$AVATICA" -H "Content-Type: application/json" --data "{\"request\": \"prepareAndExecute\",\"connectionId\": \"${CONNECTION_ID}\",\"statementId\": $STATEMENTID,\"sql\": \"$SQL\",\"maxRowCount\": ${MAX_ROW_COUNT}, \"maxRowsInFirstFrame\": ${NUM_ROWS}}"
echo

# Loop through all the results
ISDONE=false
while ! $ISDONE; do
  OFFSET=$((OFFSET + NUM_ROWS))
  echo "FetchRequest - Offset=$OFFSET"
  FETCHRSP=$(curl -s "$AVATICA" -H "Content-Type: application/json" --data "{\"request\": \"fetch\",\"connectionId\": \"${CONNECTION_ID}\",\"statementId\": $STATEMENTID,\"offset\": ${OFFSET},\"fetchMaxRowCount\": ${NUM_ROWS}}")
  echo "$FETCHRSP"
  ISDONE=$(echo "$FETCHRSP" | jq .frame.done)
  echo
done

echo "Close statement"
curl -i -w "\n" "$AVATICA" -H "Content-Type: application/json" --data "{\"request\": \"closeStatement\",\"connectionId\": \"${CONNECTION_ID}\",\"statementId\": $STATEMENTID}"
echo

echo "Close connection"
curl -i -w "\n" "$AVATICA" -H "Content-Type: application/json" --data "{\"request\": \"closeConnection\",\"connectionId\": \"${CONNECTION_ID}\"}"
echo

What is next?

If this feature looks interesting it would be good to add your thoughts to SOLR-9963. If there is interest then we can work towards getting SOLR-9963 merged. The Apache Solr JDBC driver would need to then switch to wrapping an Avatica JDBC driver. Overall this should improve the SQL experience that comes with Apache Solr.

Apache Solr - Leading Wildcard Queries and ReversedWildcardFilterFactory

Kevin Risden — Thu, 25 Oct 2018 14:00:00 +0000

Overview

Apache Solr is a full text search engine that is built on Apache Lucene. Recently, I was looking into performance where the query had leading wildcards. There have been many questions over the years about leading wildcard queries. It was surprising to me that there are few references explaining what leading wildcard queries are and how they are implemented behind the scenes. There are also no references that explain how to verify that leading wildcards are being processed efficiently.

So in this blog I cover the following:

What are leading wildcard queries?
Why are leading wildcard queries inefficient?
How to improve leading wildcard queries
ReversedWildcardFilterFactory Implementation

What are leading wildcard queries?

Leading wildcard queries are term queries that use the asterick (*) in the beginning of the term. For an example, you could look for all colors that end in ed with color:*ed. The asterick (*) takes the place of one or more characters. There is another variation where the question mark (?) is used as a placeholder for a single character. I am focusing on leading wildcard queries only and not trailing (ie: color:re*) or other combinations (ie: color:*e*). For more details, see the Apache Reference Guide Wildcard Searches page.

Why are leading wildcard queries inefficient?

Apache Lucene, the library that backs Apache Solr and Elasticsearch, is designed to search for tokens. Tokens are the representation of a piece of text data after it has been tokenized and analyzed. Lucene is very good at exact matches since it can efficiently query the index for matches. When leading wildcards are involved, there is a lot more work that needs to be done since the index is not optimized for this type of lookup.

A leading wildcard query must iterate through all of the terms in the index to see if they match the query. For even moderately sized indices this can be time consuming. With the asterick (*) at the beginning of the query, this means that there can be many matches throughout the index. The question mark (?) can be significantly more performant since Lucene doesn’t have to check as much. The iteration through the terms cannot stop until it has gone through the entire index for matches. This can cause poor caching if the index doesn’t fit in memory as well as other problems.

How to improve leading wildcard queries

The best way to improve leading wildcard queries is to remove them if possible. In many cases, there is a better way to handle the query by different tokenization or analyzing. If the use case requires leading wildcard queries then there is one trick that can help improve performance. One way to improve performance is to reverse the token during indexing which basically changes a leading wildcard query into a trailing wildcard query. A trailing wildcard query can be executed much more efficiently since only part of the index needs to be examined.

Apache Solr has a token filter called the ReversedWildcardFilterFactory that emits reversed tokens. This can be used when constructing fieldTypes for fields that may need to handle leading wildcard queries. There is an example of this in the _default config set called text_general_rev. This shows how to configure the ReversedWildcardFilterFactory for a field. It is important to note that the index and query analyzer phases are different. The ReversedWildcardFilterFactory MUST only be implemented as an index analyzer. The query side is handled automatically.

For reference, here is the text_general_rev fieldType definition:

<fieldType name="text_general_rev" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ReversedWildcardFilterFactory" withOriginal="true"
                maxPosAsterisk="3" maxPosQuestion="2" maxFractionAsterisk="0.33"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
</fieldType>

`ReversedWildcardFilterFactory` Implementation

When the ReversedWildcardFilterFactory is setup for a field in Solr, the field has two different tokens emitted, original and reversed, during indexing. The screenshot below shows the Analysis tab showing how the token is created for a simple string abcdefg.

The extra reversed tokens will increase the index size, but this is usually an acceptable tradeoff since the other option is very slow leading wildcard queries.

When a query uses the field with ReversedWildcardFilterFactory, Solr internally evaluates whether to search for the original or reversed query string. One annoying part, since this optimization is internal to Solr, is that there is no indication to the user that the query string was reversed. Even with debug=true, the parsed query is the same since the AutomatonQuery#toString() method doesn’t provide information on the automaton. The screenshot below shows a leading wildcard query with no indication that it is working correctly.

I was able to confirm the expected behavior by remotely debugging a running Solr server. This showed that the query was properly reversing the automaton based on the parameters for the ReversedWildcardFilterFactory.

The only place I’ve been able to find in the Solr UI that shows the ReversedWildcardFilterFactory actually did anything is in the Schema section under the collection. Then you have to select the field and then click the “Load Term Info” button to get details about the underlying terms. The screenshot below shows the terms for the a_txt_rev field.

Conclusion

Solr and the ReversedWildcardFilterFactory can help improve the performance of leading wildcard queries if they are absolutely required. When I’ve explained over the years that ReversedWildcardFilterFactory would solve leading wildcard issues, I hadn’t looked at the internals. This post forced me to look at the internals about how Lucene and Solr work with leading wildcards. I checked multiple versions of Solr (4.3.x, 4.10.x, 5.5.x, 6.3.x, and 7.5.x) initially thinking that the query was not using the reversed tokens. It wasn’t until I used a debugger to check that I could convince myself that the query was being handled properly. Better debug logging for this case would have helped tremendously.

Solr Setup Reference

I used the following to setup Apache Solr for reproducing all the screenshots above. There are also command line versions for gathering the same information programatically.

./bin/solr start -c
./bin/solr create -c test -n basic_configs
echo '1,abcdefg,abcdefg' | ./bin/post -c test -type text/csv -params "fieldnames=id,a_txt,a_txt_rev" -d
curl "http://localhost:8983/solr/test/select?q=*:*"
curl "http://localhost:8983/solr/test/select?q=a_txt:abcdefg&debug=true"
curl "http://localhost:8983/solr/test/select?q=a_txt_rev:abcdefg&debug=true"
curl "http://localhost:8983/solr/test/admin/luke?fl=a_txt_rev&numTerms=2"

Apache Solr - Running on Apache Hadoop HDFS

Kevin Risden — Tue, 23 Oct 2018 14:00:00 +0000

Overview

Apache Solr is a full text search engine that is built on Apache Lucene. I’ve been working with Apache Solr for the past six years. Some of these were pure Solr installations, but many were integrated with Apache Hadoop. This includes both Hortonworks HDP Search as well as Cloudera Search. Performance for Solr on HDFS is a common question so writing this post to help share some of my experience.

Apache Hadoop HDFS

Apache Hadoop contains a filesystem called Hadoop Distributed File System (HDFS). HDFS is designed to scale to petabytes of data on commodity hardware. The definition of commodity hardware has changed over the years, but the premise is that the latest and greatest hardware is not needed. HDFS is used by a variety of workloads from Apache HBase to Apache Spark. Performance on HDFS tends to favor large files for both reading and writing. HDFS also uses all available disks for I/O which can be helpful for large clusters.

Apache Solr and HDFS

Apache Solr can run on HDFS since the early 4.x versions. Cloudera Search added this capability to be able to use the existing HDFS storage for search. Hortonworks HDP Search, since it is based on Apache Solr, has support for HDFS as well. Since HDFS is not a local filesystem, Apache Solr implements a block cache that is designed to help cache HDFS blocks in memory. With the HDFS block cache for querying, Apache Solr can have slower but similar performance to local indices. The HDFS block cache is not used for merging, indexing, or read once use cases. This means that there are some areas where Apache Solr with HDFS can be slower.

Apache Solr Performance

If you are looking for the best performance with the fewest variations, then SSDs and ample memory is where you should be looking. If you are budget constrained, then spinning disks with memory can also provide adequate performance. Solr on HDFS can perform just as well as local disks given the right amount of memory. The common “it depends” caveat will come down to the specific use case. For large scale analytics then Solr on HDFS performs well. For high speed indexing then you will need SSDs since the write performance of Solr on HDFS is not going to match.

Most of the time when dealing with performance issues with Solr, I found that it is not the underlying hardware to be the problem. Typically the way the data is indexed or queried can have a huge impact on performance. The standard debugging and improvements here help with all different types of hardware.

Apache Solr on HDFS - Best Practices

Shutdown Apache Solr Cleanly

Make sure you give Apache Solr plenty of time to shutdown cleanly. Older versions of the solr script waited only 5 seconds before shutting down. Increase the sleep time to ensure that you do not leave write.lock files on HDFS from an unclean shutdown.

Ulimits must be configured correctly

Ensure that you have the proper ulimits for the user running Solr. It will cause huge issues when you can’t use Solr due to ulimits that are too low.

Use a Zookeeper chroot

With Apache Hadoop, many different pieces of software use Zookeeper. It will help keep things organized if you use a chroot specifically for Solr.

Make a directory on HDFS for Solr

Make a directory on HDFS for Solr that isn’t used for anything else. This will make sure you don’t cause problems with other processes reading/writing from that location. It also makes it possible to set permissions to ensure only the Solr user has access.

HDFS Block Cache must be tuned

Ensure that the HDFS Block Cache is enabled and that it is tuned properly. By default the block cache does not have enough slabs for good performance. Each slab for the HDFS block cache is by default 128MB (solr.hdfs.blockcache.blocksperbank:16834 * 8KB). The number of slabs determines how much memory will be used for caching. Since the HDFS block cache is stored offheap, Java must also be allowed to allocate up to that amount of direct memory with -XX:MaxDirectMemorySize.

Here is a handy table to show relationship between number of slabs, MaxDirectMemorySize, and the HDFS block cache size.

`-Dsolr.hdfs.blockcache.slab.count`	`-XX:MaxDirectMemorySize`	HDFS Block Cache Size
1	250MB	128MB
8	2GB	1GB
20	4GB	2.5GB
40	8GB	5GB
100	15GB	12.5GB
200	30GB	25GB

When configured correctly, Solr will print out a calculation of the memory required in the logs like so:

Block cache target memory usage, slab size of [134217728] will allocate [40] slabs and use ~[5368709120] bytes

Ensure that HDFS Short Circuit Reads are enabled

HDFS Short Circuit Reads allow the HDFS library to read from a local socket instead of making a network call. This can significantly improve read performance.

Example Configuration

# Solr HDFS - setup
## Use HDFS by default for its collection data and tlogs
SOLR_OPTS="$SOLR_OPTS -Dsolr.directoryFactory=HdfsDirectoryFactory \
    -Dsolr.lock.type=hdfs \
    -Dsolr.hdfs.home=$(hdfs getconf -confKey fs.defaultFS)/apps/solr \
    -Dsolr.hdfs.confdir=/etc/hadoop/conf"

## If HDFS Kerberos enabled, uncomment the following
#SOLR_OPTS="$SOLR_OPTS -Dsolr.hdfs.security.kerberos.enabled=true \
# -Dsolr.hdfs.security.kerberos.keytabfile=/etc/security/keytabs/solr.keytab \
# -Dsolr.hdfs.security.kerberos.principal=solr@REALM"

# Solr HDFS - performance
## Enable the HDFS Block Cache to take the place of memory mapping files
SOLR_OPTS="$SOLR_OPTS -Dsolr.hdfs.blockcache.enabled=true \
    -Dsolr.hdfs.blockcache.global=true \
    -Dsolr.hdfs.blockcache.read.enabled=true \
    -Dsolr.hdfs.blockcache.write.enabled=false"

## Size the HDFS Block Cache
SOLR_OPTS="$SOLR_OPTS -Dsolr.hdfs.blockcache.blocksperbank=16384 \
    -Dsolr.hdfs.blockcache.slab.count=200"

## Enable direct memory allocation to allocate HDFS Block Cache off heap
SOLR_OPTS="$SOLR_OPTS -Dsolr.hdfs.blockcache.direct.memory.allocation=true \
    -XX:MaxDirectMemorySize=30g -XX:+UseLargePages"

## Enable HDFS Short Circuit reads if possible
### Note: This path is different for Cloudera. It must be the path to the HDFS native libraries
SOLR_OPTS="$SOLR_OPTS -Djava.library.path=:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64:/usr/hdp/current/hadoop-client/lib/native"

## If Near Real Time (NRT) enable HDFS NRT caching directory, uncomment the following
#SOLR_OPTS="$SOLR_OPTS -Dsolr.hdfs.nrtcachingdirectory.enable=true \
# -Dsolr.hdfs.nrtcachingdirectory.maxmergesizemb=16 \
# -Dsolr.hdfs.nrtcachingdirectory.maxcachedmb=192"

Conclusion

It is possible to get reasonable performance out of Apache Solr running on Apache Hadoop HDFS. If budget allows then SSDs will give better performance for both indexing and querying. Finally, given the proper amount of memory, even spinning disks will give adequate performance for Apache Solr.

DEV Community: Kevin Risden

My Development Environment 2018

Overview

Background

Development Laptop

Intel NUC Server

Yubikey

Conclusion

Apache Hadoop YARN - “Vulnerability” FUD

Overview

Demonbot vulnerability requires an unsecure cluster

Explain the “vulnerability” like I’m five

Conclusion

Apache Solr - Hide/Redact Senstive Properties

Overview

Apache Solr and Hiding Sensitive Properties

Apache Solr and Hiding Metrics Properties

Conclusion

Apache Solr - Hadoop Authentication Plugin - LDAP

Overview

Apache Solr and Authentication

Hadoop Authentication, LDAP, and Apache Solr

Hadoop Authentication Framework Overview

Apache Solr and Hadoop Authentication Framework

Apache Solr 7.5 and LDAP

Manually Upgrading the Apache Solr Hadoop Dependency

Conclusion

References

Apache Hadoop - TLS and SSL Notes

Overview

TLS/SSL General

Certificate Types

Generating Certificates

Existing Certificate and Key to JKS

Trusting CA Certificates

OpenSSL

OpenLDAP

Java

POODLE ­ SSLv3

What is POODLE?

Testing for POODLE

Configuring Hadoop for Cipher Suites and Protocols

Ambari

Hadoop

HTTPFS

Hive

Oozie

Flume

Hue

Ranger

Knox

Storm

Resources

Apache Knox - Performance Improvements

TL;DR

Overview

Reproducing the performance issues

Apache Hadoop - WebHDFS

Apache HBase - HBase Rest

Apache Hive - HiveServer2

Performance Analysis

Apache Hadoop - WebHDFS

Apache HBase - HBase Rest

Apache Hive - HiveServer2

Performance Improvements

Java - TLS/SSL Performance

Knox - WebHDFS Write Performance

Knox - GZip Handling

Conclusion

Apache HBase - Thrift 1 Server SPNEGO Improvements

Overview

Apache HBase and Hue

HBase Thrift V1 and Kerberos

Improving the HBase Thrift V1 Implementation

Conclusion

Apache HBase - Snappy Compression

Overview

Evidence

Conclusion

References

POODLE SSLv3

`ReversedWildcardFilterFactory` Implementation