<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kevin Risden</title>
    <description>The latest articles on DEV Community by Kevin Risden (@risdenk).</description>
    <link>https://dev.to/risdenk</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F63424%2Fab7ba335-8536-4ffc-a378-a9a0889a61d2.jpeg</url>
      <title>DEV Community: Kevin Risden</title>
      <link>https://dev.to/risdenk</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/risdenk"/>
    <language>en</language>
    <item>
      <title>My Development Environment 2018</title>
      <dc:creator>Kevin Risden</dc:creator>
      <pubDate>Thu, 06 Dec 2018 14:00:00 +0000</pubDate>
      <link>https://dev.to/risdenk/my-development-environment-2018-2b5p</link>
      <guid>https://dev.to/risdenk/my-development-environment-2018-2b5p</guid>
      <description>&lt;h3&gt;
  
  
  Overview
&lt;/h3&gt;

&lt;p&gt;I was asked the other day about what my development environment looked like since I was able to test a lot of different configurations quickly. I am writing this post to capture some of the stuff I do to be able to iterate quickly. First some background on why it has historically been important for me to be able to change test environments quickly.&lt;/p&gt;

&lt;h4&gt;
  
  
  Background
&lt;/h4&gt;

&lt;p&gt;I previously worked as a software consultant with &lt;a href="https://www.avalonconsult.com/"&gt;Avalon Consulting, LLC&lt;/a&gt;. We worked on a variety of projects for a number of different clients. Some of the projects were long and others were shorter. I focused primarily on big data and search. &lt;a href="https://hadoop.apache.org/"&gt;Apache Hadoop&lt;/a&gt; with security has a lot of different configurations. It wasn’t practical to spin up cloud environments (hotel wifi sucks) for each little test. This meant I needed to find a way to test things on my 8GB Macbook Pro.&lt;/p&gt;

&lt;h3&gt;
  
  
  Development Laptop
&lt;/h3&gt;

&lt;p&gt;I currently have 2 laptops for development. A 2012 8GB RAM Macbook Pro that is starting to show its age, but was worth every penny. A second work laptop that I won’t go into too much detail. Both laptops are configured very similarly. Key software includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.iterm2.com/"&gt;iTerm2&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://brew.sh/"&gt;Homebrew&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.zsh.org/"&gt;Zsh&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/robbyrussell/oh-my-zsh"&gt;oh-my-zsh&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://git-scm.com/"&gt;git&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.docker.com/docker-for-mac/"&gt;Docker for Mac&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.virtualbox.org/"&gt;VirtualBox&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.vagrantup.com/"&gt;Vagrant&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.jetbrains.com/idea/"&gt;IntelliJ IDEA Ultimate&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.google.com/chrome/"&gt;Chrome&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I use my terminal quite a bit. I use it for git, ssh, docker, vagrant, etc. I typically leave my terminal up at all times since I am usually running something. I jump between Docker and Vagrant/Virtualbox quite a bit. There are lots of security and distributed computing setups where proper hostnames and DNS resolution works better with full virtual machines. There are fewer gotchas if you know you are working with “real” machines instead of fighting with Docker networking and DNS.&lt;/p&gt;

&lt;p&gt;I owe a big shoutout to &lt;a href="https://travis-ci.com/"&gt;Travis CI&lt;/a&gt; since I use them a lot for my open source projects. I typically push a git branch to my Github fork and let Travis CI go to work. This allows me to work on multiple things at once when tests take 10s of minutes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Intel NUC Server
&lt;/h3&gt;

&lt;p&gt;I recently added an &lt;a href="https://simplynuc.com/8i5beh-kit/"&gt;Intel NUC&lt;/a&gt; to my development setup to help with offloading some of the long running tests from my laptop. It also has more RAM and CPU power that allows me to run continuous integration jobs as well as more Vagrant VMs. Some of the software I have running on my Intel NUC (mostly as Docker containers):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://en.wikipedia.org/wiki/Dnsmasq"&gt;Dnsmasq&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://jenkins.io/"&gt;Jenkins&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.sonatype.com/download-oss-sonatype"&gt;Nexus 3&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gogs.io/"&gt;Gogs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.sonarqube.org/"&gt;Sonarqube&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Dnsmasq ensures that I can get a consistent DNS both on my Intel NUC and within my private network. Jenkins runs most of my continuous integration builds. It helps keep track of logs and allows me to spin up jobs for different purposes (like repeatedly testing a feature branch). Jenkins spins up separate Docker containers for each build so I don’t have to worry about dependency conflicts. Nexus allows me to cache Maven repositories, Docker images, static files, and more. This ensures that I don’t need to wait to redownload the same dependencies over and over again. Gogs is a standalone Git server that painlessly lets me mirror repos internally. This avoids me having to pull big repos from the internet over and over again. Sonarqube enables me to run some additional static build checks against some of the Jenkins builds.&lt;/p&gt;

&lt;h3&gt;
  
  
  Yubikey
&lt;/h3&gt;

&lt;p&gt;I want to talk a little bit about my use of a Yubikey. I had been thinking about getting one for a few years and finally got one when Yubikey 5 came out. I use it all the time now for GPG and SSH. I am able to not store any private keys on my new devices and can even SSH from a Chromebook back to my server if necessary. I configured my Yubikey to handle both GPG for signing and authentication. This allows me to use GPG with SSH as well. The GPG agent takes a little configuring, but once setup you can easily use it for both GPG and SSH. I wish more websites supported U2F instead of OATH/Authenticator codes. I like the simplicity and would recommend it for most developers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;My setup hasn’t changed too much over the past 5 years when it comes to development laptops. I have started to use more cloud based automated testing like Travis CI. I added the Intel NUC to be able to do more testing internally across bigger VMs. I will say that I have learned more trying to fit a distributed system on an 8GB RAM laptop than anything else. (Who else can say they have run Hadoop on 3 Linux VMs and 1 Windows AD VM on 8GB of RAM). Who knows what is to come in 2019, but I am happy and productive with what I have in 2018.&lt;/p&gt;

</description>
      <category>development</category>
      <category>environment</category>
      <category>2018</category>
    </item>
    <item>
      <title>Apache Hadoop YARN - “Vulnerability” FUD</title>
      <dc:creator>Kevin Risden</dc:creator>
      <pubDate>Tue, 04 Dec 2018 14:00:00 +0000</pubDate>
      <link>https://dev.to/risdenk/apache-hadoop-yarn---vulnerability-fud-936</link>
      <guid>https://dev.to/risdenk/apache-hadoop-yarn---vulnerability-fud-936</guid>
      <description>&lt;h3&gt;
  
  
  Overview
&lt;/h3&gt;

&lt;p&gt;There are reports of an &lt;a href="http://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/YARN.html"&gt;Apache Hadoop YARN&lt;/a&gt; “vulnerability” but want to share some more details that have missed the few articles I’ve come across. Here are a few of the articles/links:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.extrahop.com/company/blog/2018/detect-demonbot-exploiting-hadoop-yarn-remote-code-execution/"&gt;https://www.extrahop.com/company/blog/2018/detect-demonbot-exploiting-hadoop-yarn-remote-code-execution/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/vulhub/vulhub/blob/master/hadoop/unauthorized-yarn/exploit.py"&gt;https://github.com/vulhub/vulhub/blob/master/hadoop/unauthorized-yarn/exploit.py&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Demonbot vulnerability requires an &lt;strong&gt;unsecure&lt;/strong&gt; cluster
&lt;/h3&gt;

&lt;p&gt;The key point I want to make is that the report misleads the reader to assume that all Apache Hadoop YARN environments are insecure. This is &lt;strong&gt;false&lt;/strong&gt;. The clusters described have no security and are akin to having your front door unlocked. Kerberized clusters are secure since they require a valid user account to be usable. Furthermore, clusters should not be exposed to the internet for most usecases (especially not endpoints that allow for remote job submission).&lt;/p&gt;

&lt;h3&gt;
  
  
  Explain the “vulnerability” like I’m five
&lt;/h3&gt;

&lt;p&gt;Imagine that one day you get home and find a whole bunch of extra lamps plugged into your outlets. You are annoyed because the lamps are using your electicity. You remember that you forgot to lock your door when you went on vacation. Instead of someone stealing stuff from your home, they decided to plug in lamps.&lt;/p&gt;

&lt;p&gt;Now you might be thinking, it is expected that something bad would happen if you left your door unlocked when you went on vacation. This is the exact same thing as an unsecure Apache Hadoop YARN cluster. No one should leave their cluster unsecured and exposed to the outside world.&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;There have been multiple reports of “big data” endpoints being exposed to the internet and not being secured. This has affected Elasticsearch, Mongodb, and others. There is no reason to expose a cluster to the internet without security. Cloudera wrote a blog post that covers the same topic as well &lt;a href="https://blog.cloudera.com/blog/2018/11/protecting-hadoop-clusters-from-malware-attacks/"&gt;here&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>bigdata</category>
      <category>apache</category>
      <category>hadoop</category>
      <category>yarn</category>
    </item>
    <item>
      <title>Apache Solr - Hide/Redact Senstive Properties</title>
      <dc:creator>Kevin Risden</dc:creator>
      <pubDate>Tue, 27 Nov 2018 14:00:00 +0000</pubDate>
      <link>https://dev.to/risdenk/apache-solr---hideredact-senstive-properties-56ca</link>
      <guid>https://dev.to/risdenk/apache-solr---hideredact-senstive-properties-56ca</guid>
      <description>&lt;h3&gt;
  
  
  Overview
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://lucene.apache.org/solr" rel="noopener noreferrer"&gt;Apache Solr&lt;/a&gt; is a full text search engine that is built on &lt;a href="https://lucene.apache.org/solr/" rel="noopener noreferrer"&gt;Apache Lucene&lt;/a&gt;. One of the common questions on the &lt;a href="http://lucene.apache.org/solr/community.html#mailing-lists-irc" rel="noopener noreferrer"&gt;solr-user&lt;/a&gt; mailing list (ie: &lt;a href="http://lucene.472066.n3.nabble.com/Disabling-jvm-properties-from-ui-td4413066.html" rel="noopener noreferrer"&gt;here&lt;/a&gt; and &lt;a href="http://lucene.472066.n3.nabble.com/jira-Commented-SOLR-11369-Zookeeper-credentials-are-showed-up-on-the-Solr-Admin-GUI-td4405383.html" rel="noopener noreferrer"&gt;here&lt;/a&gt;) is how to hide sensitive values from the &lt;a href="https://lucene.apache.org/solr/guide/7_5/overview-of-the-solr-admin-ui.html" rel="noopener noreferrer"&gt;Solr UI&lt;/a&gt;. There is a little known setting that enables hiding these sensitive values.&lt;/p&gt;

&lt;h3&gt;
  
  
  Apache Solr and Hiding Sensitive Properties
&lt;/h3&gt;

&lt;p&gt;Apache Solr has a few places where sensitive values can be seen on the Solr UI. The keystore and truststore passwords are two examples that came up as part of &lt;a href="https://issues.apache.org/jira/browse/SOLR-10076" rel="noopener noreferrer"&gt;SOLR-10076&lt;/a&gt;. Starting in Solr 6.6 and 7.0, Solr will hide any property in the &lt;code&gt;/admin/info/system&lt;/code&gt; API that contains the word &lt;code&gt;password&lt;/code&gt; when the system property &lt;code&gt;solr.redaction.system.enabled&lt;/code&gt; is set to true. The &lt;code&gt;/admin/info/system&lt;/code&gt; API is used to power the Solr UI. This works well for most cases, but the implementation is more generic enabling it to hide any custom properties.&lt;/p&gt;

&lt;p&gt;The property &lt;code&gt;solr.redaction.system.pattern&lt;/code&gt; is a system property that takes a regular expression. If the regular expression matches the property name then the system property value will be redacted. This can enable hiding sensitive values for custom libraries or other use cases.&lt;/p&gt;

&lt;p&gt;The table below lays out the two properties that can be configured in Solr 6.6 or later.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Property&lt;/th&gt;
&lt;th&gt;Default Value&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;solr.redaction.system.enabled&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;false&lt;/code&gt; in Solr 6.6; &lt;code&gt;true&lt;/code&gt; in Solr 7.0&lt;/td&gt;
&lt;td&gt;Enables or disables the redaction&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;solr.redaction.system.pattern&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;.*password.*&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Regex for the properties to redact&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Apache Solr and Hiding Metrics Properties
&lt;/h3&gt;

&lt;p&gt;The &lt;a href="https://lucene.apache.org/solr/guide/7_5/metrics-reporting.html" rel="noopener noreferrer"&gt;Solr Metrics API&lt;/a&gt; can leak sensitive information as well. There is a &lt;a href="https://lucene.apache.org/solr/guide/7_5/metrics-reporting.html#the-metrics-hiddensysprops-element" rel="noopener noreferrer"&gt;&lt;code&gt;hiddenSysProps&lt;/code&gt; configuration&lt;/a&gt; that can prevent certain properties from being exposed via the metrics API. If additional properties need to be hidden then they need to be configured in the &lt;code&gt;hiddenSysPropes&lt;/code&gt; section.&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;Currently, there is limited documentation about the available options for hiding sensitive values. It is frustrating to have to configure hiding sensitive values in two places, but there is hope for improvement. &lt;a href="https://issues.apache.org/jira/browse/SOLR-12976" rel="noopener noreferrer"&gt;SOLR-12976&lt;/a&gt; was created earlier this month to try to address the duplication and documentation.&lt;/p&gt;

</description>
      <category>bigdata</category>
      <category>apache</category>
      <category>solr</category>
      <category>security</category>
    </item>
    <item>
      <title>Apache Solr - Hadoop Authentication Plugin - LDAP</title>
      <dc:creator>Kevin Risden</dc:creator>
      <pubDate>Tue, 20 Nov 2018 14:00:00 +0000</pubDate>
      <link>https://dev.to/risdenk/apache-solr---hadoop-authentication-plugin---ldap-2fa3</link>
      <guid>https://dev.to/risdenk/apache-solr---hadoop-authentication-plugin---ldap-2fa3</guid>
      <description>&lt;h3&gt;
  
  
  Overview
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://lucene.apache.org/solr" rel="noopener noreferrer"&gt;Apache Solr&lt;/a&gt; is a full text search engine that is built on &lt;a href="https://lucene.apache.org/solr/" rel="noopener noreferrer"&gt;Apache Lucene&lt;/a&gt;. One of the questions I’ve been asked about in the past is LDAP support for Apache Solr authentication. While there are commercial additions that add LDAP support like &lt;a href="https://lucidworks.com/products/fusion-server/" rel="noopener noreferrer"&gt;Lucidworks Fusion&lt;/a&gt;, Apache Solr doesn’t have an LDAP authentication plugin out of the box. Lets explore what the current state of authentication is with Apache Solr.&lt;/p&gt;

&lt;h3&gt;
  
  
  Apache Solr and Authentication
&lt;/h3&gt;

&lt;p&gt;Apache Solr 5.2 released with a pluggable authentication module from &lt;a href="https://issues.apache.org/jira/browse/SOLR-7274" rel="noopener noreferrer"&gt;SOLR-7274&lt;/a&gt;. This paved the way for future authentication implementations such as &lt;code&gt;BasicAuth&lt;/code&gt; (&lt;a href="https://issues.apache.org/jira/browse/SOLR-7692" rel="noopener noreferrer"&gt;SOLR-7692&lt;/a&gt;) and Kerberos (&lt;a href="https://issues.apache.org/jira/browse/SOLR-7468" rel="noopener noreferrer"&gt;SOLR-7468&lt;/a&gt;). In Apache Solr 6.1, delegation token support (&lt;a href="https://issues.apache.org/jira/browse/SOLR-9200" rel="noopener noreferrer"&gt;SOLR-9200&lt;/a&gt;) was added to the Kerberos authentication plugin. Apache Solr 6.4 added a significant feature for hooking the &lt;a href="https://hadoop.apache.org/docs/current/hadoop-auth/Configuration.html" rel="noopener noreferrer"&gt;Hadoop authentication framework&lt;/a&gt; directly into Solr as an authentication plugin (&lt;a href="https://issues.apache.org/jira/browse/SOLR-9513" rel="noopener noreferrer"&gt;SOLR-9513&lt;/a&gt;). There haven’t been much more work on authentication plugins lately. Some work is being done to add a JWT authentication plugin currently (&lt;a href="https://issues.apache.org/jira/browse/SOLR-12121" rel="noopener noreferrer"&gt;SOLR-12121&lt;/a&gt;). Each Solr authentication plugin provides additional capabilities for authenticating to Solr.&lt;/p&gt;

&lt;h3&gt;
  
  
  Hadoop Authentication, LDAP, and Apache Solr
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Hadoop Authentication Framework Overview
&lt;/h4&gt;

&lt;p&gt;The &lt;a href="https://hadoop.apache.org/docs/current/hadoop-auth/Configuration.html" rel="noopener noreferrer"&gt;Hadoop authentication framework&lt;/a&gt; provides additional capabilities since it has added backends. The backends currently include Kerberos, AltKerberos, LDAP, SignerSecretProvider, and Multi-scheme. Each can be configured to support varying needs for authentication.&lt;/p&gt;

&lt;h4&gt;
  
  
  Apache Solr and Hadoop Authentication Framework
&lt;/h4&gt;

&lt;p&gt;Apache Solr 6.4+ supports the Hadoop authentication framework due to the work of &lt;a href="https://issues.apache.org/jira/browse/SOLR-9513" rel="noopener noreferrer"&gt;SOLR-9513&lt;/a&gt;. The &lt;a href="https://lucene.apache.org/solr/guide/7_5/hadoop-authentication-plugin.html" rel="noopener noreferrer"&gt;Apache Solr reference guide&lt;/a&gt; provides guidance on how to use the Hadoop Authentication Plugin. All the necessary configuration parameters can be passed down to the Hadoop authentication framework. As more backends are added to the Hadoop authentication framework, Apache Solr just needs to upgrade the Hadoop depdendency to gain support.&lt;/p&gt;

&lt;h4&gt;
  
  
  Apache Solr 7.5 and LDAP
&lt;/h4&gt;

&lt;p&gt;LDAP support for the Hadoop authentication framework was added in Hadoop 2.8.0 (&lt;a href="https://issues.apache.org/jira/browse/HADOOP-12082" rel="noopener noreferrer"&gt;HADOOP-12082&lt;/a&gt;). Sadly, the Hadoop dependency for Apache Solr 7.5 is only on &lt;a href="https://github.com/apache/lucene-solr/blob/branch_7_5/lucene/ivy-versions.properties#L156" rel="noopener noreferrer"&gt;2.7.4&lt;/a&gt;. This means that when you try to configure the HadoopAuthenticationPlugin` with LDAP, you will get the following error:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;`&lt;br&gt;
Error initializing org.apache.solr.security.HadoopAuthPlugin: &lt;br&gt;
javax.servlet.ServletException: java.lang.ClassNotFoundException: ldap&lt;/p&gt;

&lt;p&gt;`&lt;code&gt;&lt;/code&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Manually Upgrading the Apache Solr Hadoop Dependency
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; I don’t recommend doing this outside of experimenting and seeing what is possible.&lt;/p&gt;

&lt;p&gt;I put together a &lt;a href="https://github.com/risdenk/test-solr-hadoopauthenticationplugin-ldap" rel="noopener noreferrer"&gt;simple test project&lt;/a&gt; that “manually” replaces the Hadoop 2.7.4 jars with 2.9.1 jars. This was designed to test if it is possible to configure the Solr &lt;code&gt;HadoopAuthenticationPlugin&lt;/code&gt; with LDAP. I was able to configure Solr using the following &lt;code&gt;security.json&lt;/code&gt; file to use the Hadoop 2.9.1 LDAP backend.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;`&lt;br&gt;
{&lt;br&gt;
    "authentication": {&lt;br&gt;
        "class": "solr.HadoopAuthPlugin",&lt;br&gt;
        "sysPropPrefix": "solr.",&lt;br&gt;
        "type": "ldap",&lt;br&gt;
        "authConfigs": [&lt;br&gt;
            "ldap.providerurl",&lt;br&gt;
            "ldap.basedn",&lt;br&gt;
            "ldap.enablestarttls"&lt;br&gt;
        ],&lt;br&gt;
        "defaultConfigs": {&lt;br&gt;
            "ldap.providerurl": "ldap://ldap",&lt;br&gt;
            "ldap.basedn": "dc=example,dc=org",&lt;br&gt;
            "ldap.enablestarttls": "false"&lt;br&gt;
        }&lt;br&gt;
    }&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;`&lt;code&gt;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;With this configuration and the Hadoop 2.9.1 jars, Apache Solr was protected by LDAP. There should be more testing done to see how this plays with multiple nodes and other types of integration required. The Hadoop authentication framework has limited support for LDAP but it should be usable for some usecases.&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;Apache Solr, as of 7.5, is currently limited as to what support it has for the Hadoop authentication framework. This is due to the depenency on Apache Hadoop 2.7.4. When the Hadoop dependency is updated (&lt;a href="https://issues.apache.org/jira/browse/SOLR-9515" rel="noopener noreferrer"&gt;SOLR-9515&lt;/a&gt;) in Apache Solr, there will be at least some initial support for LDAP integration out of the box with Solr.&lt;/p&gt;

&lt;h4&gt;
  
  
  References
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://lucene.apache.org/solr/guide/7_5/securing-solr.html" rel="noopener noreferrer"&gt;https://lucene.apache.org/solr/guide/7_5/securing-solr.html&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://lucene.apache.org/solr/guide/7_5/hadoop-authentication-plugin.html" rel="noopener noreferrer"&gt;https://lucene.apache.org/solr/guide/7_5/hadoop-authentication-plugin.html&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://issues.apache.org/jira/browse/SOLR-9513" rel="noopener noreferrer"&gt;https://issues.apache.org/jira/browse/SOLR-9513&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://stackoverflow.com/questions/50647431/ldap-integration-with-solr" rel="noopener noreferrer"&gt;https://stackoverflow.com/questions/50647431/ldap-integration-with-solr&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://community.hortonworks.com/questions/130989/solr-ldap-integration.html" rel="noopener noreferrer"&gt;https://community.hortonworks.com/questions/130989/solr-ldap-integration.html&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/apache/lucene-solr/blob/branch_7_5/lucene/ivy-versions.properties#L156" rel="noopener noreferrer"&gt;https://github.com/apache/lucene-solr/blob/branch_7_5/lucene/ivy-versions.properties#L156&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://issues.apache.org/jira/browse/HADOOP-12082" rel="noopener noreferrer"&gt;https://issues.apache.org/jira/browse/HADOOP-12082&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>bigdata</category>
      <category>apache</category>
      <category>solr</category>
      <category>hadoop</category>
    </item>
    <item>
      <title>Apache Hadoop - TLS and SSL Notes</title>
      <dc:creator>Kevin Risden</dc:creator>
      <pubDate>Thu, 15 Nov 2018 14:00:00 +0000</pubDate>
      <link>https://dev.to/risdenk/apache-hadoop---tls-and-ssl-notes-4ngc</link>
      <guid>https://dev.to/risdenk/apache-hadoop---tls-and-ssl-notes-4ngc</guid>
      <description>&lt;h3&gt;
  
  
  Overview
&lt;/h3&gt;

&lt;p&gt;I’ve collected notes on &lt;a href="https://en.wikipedia.org/wiki/Transport_Layer_Security" rel="noopener noreferrer"&gt;TLS/SSL&lt;/a&gt; for a number of years now. Most of them are related to &lt;a href="https://hadoop.apache.org/" rel="noopener noreferrer"&gt;Apache Hadoop&lt;/a&gt;, but others are more general. I was consulting when the &lt;a href="https://en.wikipedia.org/wiki/POODLE" rel="noopener noreferrer"&gt;POODLE&lt;/a&gt; and &lt;a href="https://en.wikipedia.org/wiki/Heartbleed" rel="noopener noreferrer"&gt;Heartbleed&lt;/a&gt; vulnerabilities were released. Below is a collection of TLS/SSL related references. No guarantee they are up to date but it helps to have references in one place.&lt;/p&gt;

&lt;h3&gt;
  
  
  TLS/SSL General
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Great explaination of TLS/SSL: &lt;a href="http://www.zytrax.com/tech/survival/ssl.html" rel="noopener noreferrer"&gt;http://www.zytrax.com/tech/survival/ssl.html&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;SSL Linux certificate location: &lt;a href="http://serverfault.com/questions/62496/ssl%C2%ADcertificate%C2%ADlocation%C2%ADon%C2%ADunix%C2%ADlinux" rel="noopener noreferrer"&gt;http://serverfault.com/questions/62496/ssl­certificate­location­on­unix­linux&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;SSL vs TLS: &lt;a href="http://security.stackexchange.com/questions/5126/whats%C2%ADthe%C2%ADdifference%C2%ADbetween%C2%ADssl%C2%ADtls%C2%ADand%C2%ADhttps" rel="noopener noreferrer"&gt;http://security.stackexchange.com/questions/5126/whats­the­difference­between­ssl­tls­and­https&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Certificate Types
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="http://unmitigatedrisk.com/?p=381" rel="noopener noreferrer"&gt;http://unmitigatedrisk.com/?p=381&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cm_sg_guide_ssl_certs.html" rel="noopener noreferrer"&gt;http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cm_sg_guide_ssl_certs.html&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Generating Certificates
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.sslshopper.com/article-most-common-openssl-commands.html" rel="noopener noreferrer"&gt;https://www.sslshopper.com/article-most-common-openssl-commands.html&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://support.ssl.com/Knowledgebase/Article/View/19/0/der-vs-crt-vs-cer-vs-pem-certificates-and-how-to-convert-them" rel="noopener noreferrer"&gt;https://support.ssl.com/Knowledgebase/Article/View/19/0/der-vs-crt-vs-cer-vs-pem-certificates-and-how-to-convert-them&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Existing Certificate and Key to JKS
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="http://stackoverflow.com/questions/11952274/how%C2%ADcan%C2%ADi%C2%ADcreate%C2%ADkeystore%C2%ADfrom%C2%ADan%C2%ADexisting%C2%ADcertificate%C2%ADabc%C2%ADcrt%C2%ADand%C2%ADabc%C2%ADkey%C2%ADfil" rel="noopener noreferrer"&gt;http://stackoverflow.com/questions/11952274/how­can­i­create­keystore­from­an­existing­certificate­abc­crt­and­abc­key­fil&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;openssl pkcs12 ‐export ‐in abc.crt ‐inkey abc.key ‐out abc.p12
keytool ‐importkeystore ‐srckeystore abc.p12 \
        ‐srcstoretype PKCS12 \
        ‐destkeystore abc.jks \
        ‐deststoretype JKS

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Trusting CA Certificates
&lt;/h3&gt;

&lt;h4&gt;
  
  
  OpenSSL
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;update‐ca‐trust force‐enable
cp CERT.pem /etc/pki/tls/source/anchors/
update‐ca‐trust extract

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  OpenLDAP
&lt;/h4&gt;

&lt;p&gt;&lt;code&gt;vi /etc/openldap/ldap.conf&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;...
TLS_CAFILE /etc/pki/
# Comment out TLS_CERTDIR
...

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Java
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/usr/java/JAVA_VERSION/jre/lib/security/cacerts
/etc/pki/ca‐trust/extracted/java/cacerts

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;&lt;a href="https://bugzilla.redhat.com/show_bug.cgi?id=1056224" rel="noopener noreferrer"&gt;https://bugzilla.redhat.com/show_bug.cgi?id=1056224&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  POODLE ­ SSLv3
&lt;/h3&gt;

&lt;h4&gt;
  
  
  What is POODLE?
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://poodle.io/servers.html" rel="noopener noreferrer"&gt;https://poodle.io/servers.html&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.openssl.org/docs/apps/ciphers.html#SSL%C2%ADv3.0%C2%ADcipher%C2%ADsuites" rel="noopener noreferrer"&gt;https://www.openssl.org/docs/apps/ciphers.html#SSL­v3.0­cipher­suites&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Testing for POODLE
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://chrisburgess.com.au/how-to-test-for-the-sslv3-poodle-vulnerability/" rel="noopener noreferrer"&gt;https://chrisburgess.com.au/how-to-test-for-the-sslv3-poodle-vulnerability/&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Requires a relatively recent version of openssl installed
openssl s_client ‐connect HOST:PORT ‐ssl3
# ‐tls1 ‐tls1_1 ‐tls1_2
curl ‐v3 ‐i ‐X HEAD https://HOST:PORT

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Configuring Hadoop for Cipher Suites and Protocols
&lt;/h3&gt;

&lt;p&gt;Each Hadoop component must be configured or have the proper version to disable certain SSL protocols and versions.&lt;/p&gt;

&lt;h4&gt;
  
  
  Ambari
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.0.1/configuring-advanced-security-options-for-ambari/content/ambari_sec_optional_configure_ciphers_and_protocols_for_ambari_server.html" rel="noopener noreferrer"&gt;https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.0.1/configuring-advanced-security-options-for-ambari/content/ambari_sec_optional_configure_ciphers_and_protocols_for_ambari_server.html&lt;/a&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;security.server.disabled.ciphers=TLS_ECDHE_RSA_WITH_3DES_EDE_CBC_SHA&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;security.server.disabled.protocols=SSL|SSLv2|SSLv3&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h4&gt;
  
  
  Hadoop
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://issues.apache.org/jira/browse/HADOOP-11243" rel="noopener noreferrer"&gt;https://issues.apache.org/jira/browse/HADOOP-11243&lt;/a&gt;

&lt;ul&gt;
&lt;li&gt;Hadoop 2.5.2 + 2.6 ­ Patches SSLFactory for TLSv1&lt;/li&gt;
&lt;li&gt;&lt;code&gt;hadoop.ssl.enabled.protocols=TLSv1&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;(JDK6 can use TLSv1, JDK7+ can use TLSv1,TLSv1.1,TLSv1.2)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;a href="https://issues.apache.org/jira/browse/HADOOP-11218" rel="noopener noreferrer"&gt;&lt;/a&gt;&lt;a href="https://issues.apache.org/jira/browse/HADOOP-11218" rel="noopener noreferrer"&gt;https://issues.apache.org/jira/browse/HADOOP-11218&lt;/a&gt;

&lt;ul&gt;
&lt;li&gt;Hadoop 2.8 ­ Patches SSLFactory for TLSv1.1 and TLSv1.2&lt;/li&gt;
&lt;li&gt;Java 6 doesn’t support TLSv1.1+. Requires Java 7.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;a href="https://issues.apache.org/jira/browse/HADOOP-11260" rel="noopener noreferrer"&gt;&lt;/a&gt;&lt;a href="https://issues.apache.org/jira/browse/HADOOP-11260" rel="noopener noreferrer"&gt;https://issues.apache.org/jira/browse/HADOOP-11260&lt;/a&gt;

&lt;ul&gt;
&lt;li&gt;Hadoop 2.5.2 + 2.6 ­ Patches Jetty to disable SSLv3&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h4&gt;
  
  
  HTTPFS
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://issues.apache.org/jira/browse/HDFS-7274" rel="noopener noreferrer"&gt;https://issues.apache.org/jira/browse/HDFS-7274&lt;/a&gt;

&lt;ul&gt;
&lt;li&gt;Hadoop 2.5.2 + 2.6 ­ Disables SSLv3 in HTTPFS&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h4&gt;
  
  
  Hive
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://issues.apache.org/jira/browse/HIVE-8675" rel="noopener noreferrer"&gt;https://issues.apache.org/jira/browse/HIVE-8675&lt;/a&gt;

&lt;ul&gt;
&lt;li&gt;Hive 0.14 ­ Removes SSLv3 from supported protocols&lt;/li&gt;
&lt;li&gt;&lt;code&gt;hive.ssl.protocol.blacklist&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;a href="https://issues.apache.org/jira/browse/HIVE-8827" rel="noopener noreferrer"&gt;&lt;/a&gt;&lt;a href="https://issues.apache.org/jira/browse/HIVE-8827" rel="noopener noreferrer"&gt;https://issues.apache.org/jira/browse/HIVE-8827&lt;/a&gt;

&lt;ul&gt;
&lt;li&gt;Hive 1.0 ­ Adds &lt;code&gt;SSLv2Hello&lt;/code&gt; back to supported protocols&lt;/li&gt;
&lt;li&gt;&lt;code&gt;hive.ssl.protocol.blacklist=SSLv2,SSLv3&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h4&gt;
  
  
  Oozie
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://issues.apache.org/jira/browse/OOZIE-2034" rel="noopener noreferrer"&gt;https://issues.apache.org/jira/browse/OOZIE-2034&lt;/a&gt;

&lt;ul&gt;
&lt;li&gt;Oozie 4.1.0 ­ Disable SSLv3&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;a href="https://issues.apache.org/jira/browse/OOZIE-2037" rel="noopener noreferrer"&gt;&lt;/a&gt;&lt;a href="https://issues.apache.org/jira/browse/OOZIE-2037" rel="noopener noreferrer"&gt;https://issues.apache.org/jira/browse/OOZIE-2037&lt;/a&gt;

&lt;ul&gt;
&lt;li&gt;Add support for TLSv1.1 and TLSv1.2&lt;/li&gt;
&lt;li&gt;Java 6 doesn’t support TLSv1.1+. Requires Java 7. Depends on OOZIE­2036&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h4&gt;
  
  
  Flume
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://issues.apache.org/jira/browse/FLUME-2520" rel="noopener noreferrer"&gt;https://issues.apache.org/jira/browse/FLUME-2520&lt;/a&gt;

&lt;ul&gt;
&lt;li&gt;Flume 1.5.1 ­ HTTPSource disable SSLv3&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h4&gt;
  
  
  Hue
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://issues.cloudera.org/browse/HUE-2438" rel="noopener noreferrer"&gt;https://issues.cloudera.org/browse/HUE-2438&lt;/a&gt;

&lt;ul&gt;
&lt;li&gt;Hue 3.8 ­ Disable SSLv3&lt;/li&gt;
&lt;li&gt;line 1670 of &lt;code&gt;/usr/lib/hue/desktop/core/src/desktop/lib/wsgiserver.py&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ctx.set_options(SSL.OP_NO_SSLv2 | SSL.OP_NO_SSLv3)&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ssl_cipher_list = "DEFAULT:!aNULL:!eNULL:!LOW:!EXPORT:!SSLv2"&lt;/code&gt; (default)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h4&gt;
  
  
  Ranger
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://issues.apache.org/jira/browse/RANGER-158" rel="noopener noreferrer"&gt;https://issues.apache.org/jira/browse/RANGER-158&lt;/a&gt;

&lt;ul&gt;
&lt;li&gt;Ranger 0.4.0 ­ Ranger Admin and User Authentication disable SSLv3&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h4&gt;
  
  
  Knox
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://issues.apache.org/jira/browse/KNOX-455" rel="noopener noreferrer"&gt;https://issues.apache.org/jira/browse/KNOX-455&lt;/a&gt;

&lt;ul&gt;
&lt;li&gt;Knox 0.5.0 ­ Disable SSLv3&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ssl.exclude.protocols&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h4&gt;
  
  
  Storm
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://issues.apache.org/jira/browse/STORM-640" rel="noopener noreferrer"&gt;https://issues.apache.org/jira/browse/STORM-640&lt;/a&gt;

&lt;ul&gt;
&lt;li&gt;Storm 0.10.0 ­ Disable SSLv3&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h3&gt;
  
  
  Resources
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="http://sysadvent.blogspot.co.uk/2010/12/day-3-debugging-ssltls-with-openssl1.html" rel="noopener noreferrer"&gt;http://sysadvent.blogspot.co.uk/2010/12/day-3-debugging-ssltls-with-openssl1.html&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gist.github.com/jankronquist/6412839" rel="noopener noreferrer"&gt;https://gist.github.com/jankronquist/6412839&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>bigdata</category>
      <category>apache</category>
      <category>hadoop</category>
      <category>tls</category>
    </item>
    <item>
      <title>Apache Knox - Performance Improvements</title>
      <dc:creator>Kevin Risden</dc:creator>
      <pubDate>Tue, 13 Nov 2018 14:00:00 +0000</pubDate>
      <link>https://dev.to/risdenk/apache-knox---performance-improvements-28f6</link>
      <guid>https://dev.to/risdenk/apache-knox---performance-improvements-28f6</guid>
      <description>&lt;h3&gt;
  
  
  TL;DR
&lt;/h3&gt;

&lt;p&gt;Apache Knox 1.2.0 should significantly improve:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/WebHDFS.html" rel="noopener noreferrer"&gt;Apache Hadoop WebHDFS&lt;/a&gt; write performance due to &lt;a href="https://issues.apache.org/jira/browse/KNOX-1521" rel="noopener noreferrer"&gt;KNOX-1521&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://hive.apache.org/" rel="noopener noreferrer"&gt;Apache Hive&lt;/a&gt; and GZip performance due to &lt;a href="https://issues.apache.org/jira/browse/KNOX-1530" rel="noopener noreferrer"&gt;KNOX-1530&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you are using Java for TLS, then you should read here.&lt;/p&gt;

&lt;h3&gt;
  
  
  Overview
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://knox.apache.org/" rel="noopener noreferrer"&gt;Apache Knox&lt;/a&gt; is a reverse proxy that simplifies security in front of a Kerberos secured &lt;a href="https://hadoop.apache.org/" rel="noopener noreferrer"&gt;Apache Hadoop&lt;/a&gt; cluster and other related components. On the &lt;a href="https://mail-archives.apache.org/mod_mbox/knox-user/201809.mbox/%3CCACEuXj475wey-AzxO%2Bqf162Qe7ChEB8oNj1Hd6O1E4VNd8cH7g%40mail.gmail.com%3E" rel="noopener noreferrer"&gt;knox-user mailing list&lt;/a&gt; and &lt;a href="https://issues.apache.org/jira/browse/KNOX-1221" rel="noopener noreferrer"&gt;Knox Jira&lt;/a&gt;, there have been reports about Apache Knox not performing as expected. Two of the reported cases focused on &lt;a href="https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/WebHDFS.html" rel="noopener noreferrer"&gt;Apache Hadoop WebHDFS&lt;/a&gt; performance specifically. I was able to reproduce the slow downs with Apache Knox although the findings were surprising. This blog details the performance findings as well as improvements that will be in Apache Knox 1.2.0.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reproducing the performance issues
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Apache Hadoop - WebHDFS
&lt;/h4&gt;

&lt;p&gt;I started looking into the two reported WebHDFS performance issues (&lt;a href="https://issues.apache.org/jira/browse/KNOX-1221" rel="noopener noreferrer"&gt;KNOX-1221&lt;/a&gt; and &lt;a href="https://mail-archives.apache.org/mod_mbox/knox-user/201809.mbox/%3CCACEuXj475wey-AzxO%2Bqf162Qe7ChEB8oNj1Hd6O1E4VNd8cH7g%40mail.gmail.com%3E" rel="noopener noreferrer"&gt;knox-user post&lt;/a&gt;). I found that the issue reproduced easily on a VM on my laptop. I tested read and write performance of WebHDFS natively with curl as well as going through Apache Knox. The results as posted to &lt;a href="https://issues.apache.org/jira/browse/KNOX-1221" rel="noopener noreferrer"&gt;KNOX-1221&lt;/a&gt; were as follows:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;WebHDFS Read Performance - 1GB file&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Test Case&lt;/th&gt;
&lt;th&gt;Transfer Speed&lt;/th&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Native WebHDFS&lt;/td&gt;
&lt;td&gt;252 MB/s&lt;/td&gt;
&lt;td&gt;3.8s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Knox w/o TLS&lt;/td&gt;
&lt;td&gt;264 MB/s&lt;/td&gt;
&lt;td&gt;3.6s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Knox w/ TLS&lt;/td&gt;
&lt;td&gt;54 MB/s&lt;/td&gt;
&lt;td&gt;20s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Parallel Knox w/ TLS&lt;/td&gt;
&lt;td&gt;2 at ~48MB/s&lt;/td&gt;
&lt;td&gt;22s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;WebHDFS Write Performance - 1GB file&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Test Case&lt;/th&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Native WebHDFS&lt;/td&gt;
&lt;td&gt;2.6s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Knox w/o TLS&lt;/td&gt;
&lt;td&gt;29s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Knox w/ TLS&lt;/td&gt;
&lt;td&gt;50s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The results were very surprising since the numbers were all over the board. What was consistent was that Knox performance was poor for reading with TLS and writing regardless of TLS. Another interesting finding was that parallel reads from Knox did not slow down, but instead each connection was limited independently. Details of the analysis are found below here.&lt;/p&gt;

&lt;h4&gt;
  
  
  Apache HBase - HBase Rest
&lt;/h4&gt;

&lt;p&gt;After analyzing WebHDFS performance, I decided to look into other services to see if the same performance slowdowns existed. I looked at Apache HBase Rest as part of &lt;a href="https://issues.apache.org/jira/browse/KNOX-1525" rel="noopener noreferrer"&gt;KNOX-1524&lt;/a&gt;. I decided to test without TLS for Knox since there was a slowdown identified as part of WebHDFS already.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scan Performance for 100 thousand rows&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Test Case&lt;/th&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;HBase shell&lt;/td&gt;
&lt;td&gt;13.9s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HBase Rest - native&lt;/td&gt;
&lt;td&gt;3.4s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HBase Rest - Knox&lt;/td&gt;
&lt;td&gt;3.7s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The results were not too surprising. More details of the analysis are found below here.&lt;/p&gt;

&lt;h4&gt;
  
  
  Apache Hive - HiveServer2
&lt;/h4&gt;

&lt;p&gt;I also looked into HiveServer2 performance with and without Apache Knox as part of &lt;a href="https://issues.apache.org/jira/browse/KNOX-1524" rel="noopener noreferrer"&gt;KNOX-1524&lt;/a&gt;. The testing below is again without TLS.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Select * performance for 200 thousand rows&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Test Case&lt;/th&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;hdfs dfs -text&lt;/td&gt;
&lt;td&gt;2.4s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;beeline binary fetchSize=1000&lt;/td&gt;
&lt;td&gt;6.2s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;beeline http fetchSize=1000&lt;/td&gt;
&lt;td&gt;7.5s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;beeline http Knox fetchSize=1000&lt;/td&gt;
&lt;td&gt;9.9s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;beeline binary fetchSize=10000&lt;/td&gt;
&lt;td&gt;7.3s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;beeline http fetchSize=10000&lt;/td&gt;
&lt;td&gt;7.9s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;beeline http Knox fetchSize=10000&lt;/td&gt;
&lt;td&gt;8.5s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This showed there was room for improvement for Hive with Knox as well. Details of the analysis are found below here.&lt;/p&gt;

&lt;h3&gt;
  
  
  Performance Analysis
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Apache Hadoop - WebHDFS
&lt;/h4&gt;

&lt;p&gt;While lookg at the WebHDFS results, I found that disabling TLS resulted in a big performance gain. Since changing &lt;code&gt;ssl.enabled&lt;/code&gt; in &lt;code&gt;gateway-site.xml&lt;/code&gt; was the only change, that meant that TLS was the only factor for read performance differences. I looked into Jetty performance with TLS and found there were known performance issues with the JDK. For more details, see below here.&lt;/p&gt;

&lt;p&gt;The WebHDFS write performance difference could not be attributed to TLS performance since non TLS Knox was also ~20 seconds slower. I experimented with different buffersizes and upgrading httpclient before finding the root cause. The performance difference can be attributed to an issue with the &lt;code&gt;UrlRewriteRequestStream&lt;/code&gt; in Apache Knox. There are multiple read methods on an &lt;code&gt;InputStream&lt;/code&gt; and those were not implemented. For the fix details, see below here.&lt;/p&gt;

&lt;h4&gt;
  
  
  Apache HBase - HBase Rest
&lt;/h4&gt;

&lt;p&gt;The &lt;a href="https://hbase.apache.org/book.html#shell" rel="noopener noreferrer"&gt;HBase shell&lt;/a&gt; slowness is to be expected since it is written in &lt;a href="https://www.jruby.org/" rel="noopener noreferrer"&gt;JRuby&lt;/a&gt; and not the best tool for working with HBase. Typically the &lt;a href="https://hbase.apache.org/book.html#hbase_apis" rel="noopener noreferrer"&gt;HBase Java API&lt;/a&gt; is used. While looking at the results, there were no big bottlenecks that jumped out from the performance test. There is some overhead due to Apache Knox but much of this is due to the extra hops.&lt;/p&gt;

&lt;h4&gt;
  
  
  Apache Hive - HiveServer2
&lt;/h4&gt;

&lt;p&gt;It took me a few tries to create a test framework that would allow be to test the changes easily. One of the big findings was that Hive is significantly slower than &lt;code&gt;hdfs dfs -text&lt;/code&gt; for the same file. There can be some performance improvements to HiveServer2 itself. Another finding is that HiveServer2 binary vs http modes differed significantly with the default &lt;code&gt;fetchSize&lt;/code&gt; of 1000. My guess is that when HTTP compression was added in &lt;a href="https://issues.apache.org/jira/browse/HIVE-17194" rel="noopener noreferrer"&gt;HIVE-17194&lt;/a&gt;, the &lt;code&gt;fetchSize&lt;/code&gt; parameter should have been increased to improve over the wire efficiency. When ignoring the binary mode performance, there was still a difference between HiveServer2 http mode with and without Apache Knox. Details on the performance improvements can be found here.&lt;/p&gt;

&lt;h3&gt;
  
  
  Performance Improvements
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Java - TLS/SSL Performance
&lt;/h4&gt;

&lt;p&gt;There are some performance issues when using the default JDK TLS implementation. I found a few references about the JDK and Jetty.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://nbsoftsolutions.com/blog/the-cost-of-tls-in-java-and-solutions" rel="noopener noreferrer"&gt;https://nbsoftsolutions.com/blog/the-cost-of-tls-in-java-and-solutions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://nbsoftsolutions.com/blog/dropwizard-1-3-upcoming-tls-improvements" rel="noopener noreferrer"&gt;https://nbsoftsolutions.com/blog/dropwizard-1-3-upcoming-tls-improvements&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://webtide.com/conscrypting-native-ssl-for-jetty/" rel="noopener noreferrer"&gt;https://webtide.com/conscrypting-native-ssl-for-jetty/&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I was able to test with &lt;a href="https://github.com/google/conscrypt/" rel="noopener noreferrer"&gt;Conscrypt&lt;/a&gt; and found that the performance slowdowns for TLS reads and writes went away. I also tested disabling GCM since there are references that GCM can cause performance issues with JDK 8.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.wowza.com/docs/how-to-improve-ssl-performance-with-java-8" rel="noopener noreferrer"&gt;https://www.wowza.com/docs/how-to-improve-ssl-performance-with-java-8&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://stackoverflow.com/questions/25992131/slow-aes-gcm-encryption-and-decryption-with-java-8u20" rel="noopener noreferrer"&gt;https://stackoverflow.com/questions/25992131/slow-aes-gcm-encryption-and-decryption-with-java-8u20&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The results of testing different TLS implementations are below:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Test Case&lt;/th&gt;
&lt;th&gt;Transfer Speed&lt;/th&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Native WebHDFS&lt;/td&gt;
&lt;td&gt;252MB/s&lt;/td&gt;
&lt;td&gt;3.8s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Knox w/o TLS&lt;/td&gt;
&lt;td&gt;264MB/s&lt;/td&gt;
&lt;td&gt;3.6s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Knox w/ Conscrypt TLS&lt;/td&gt;
&lt;td&gt;245MB/s&lt;/td&gt;
&lt;td&gt;4.2s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Knox w/ TLS no GCM&lt;/td&gt;
&lt;td&gt;125MB/s&lt;/td&gt;
&lt;td&gt;8.7s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Knox w/ TLS&lt;/td&gt;
&lt;td&gt;54.3MB/s&lt;/td&gt;
&lt;td&gt;20s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Switching to a different TLS implementation provider for the JDK can significantly help performance. This goes across the board for any TLS handling with Java. Another otpion is to terminate TLS connections with a non Java based load balancer. Finally, turning off TLS for performance specific isolated use cases may be ok. These options are ones that should be considered when using TLS with Java.&lt;/p&gt;

&lt;h4&gt;
  
  
  Knox - WebHDFS Write Performance
&lt;/h4&gt;

&lt;p&gt;I created &lt;a href="https://issues.apache.org/jira/browse/KNOX-1521" rel="noopener noreferrer"&gt;KNOX-1521&lt;/a&gt; to add the missing &lt;code&gt;read&lt;/code&gt; methods on the &lt;code&gt;UrlRewriteRequestStream&lt;/code&gt; class. This allows the underlying stream to read more efficiently than 1 byte at a time. With the changes from &lt;a href="https://issues.apache.org/jira/browse/KNOX-1521" rel="noopener noreferrer"&gt;KNOX=1521&lt;/a&gt;, WebHDFS write performance is now much closer to native WebHDFS. The updated write performance after &lt;a href="https://issues.apache.org/jira/browse/KNOX-1521" rel="noopener noreferrer"&gt;KNOX-1521&lt;/a&gt; results are below:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;WebHDFS Write Performance - 1GB file - KNOX-1521&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Test Case&lt;/th&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Native WebHDFS&lt;/td&gt;
&lt;td&gt;3.3s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Knox w/o TLS&lt;/td&gt;
&lt;td&gt;29s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Knox w/o TLS w/ KNOX-1521&lt;/td&gt;
&lt;td&gt;4.2s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h4&gt;
  
  
  Knox - GZip Handling
&lt;/h4&gt;

&lt;p&gt;I found that Apache Knox had a few issues when it came to handling GZip compressed data. I opened &lt;a href="https://issues.apache.org/jira/browse/KNOX-1530" rel="noopener noreferrer"&gt;KNOX-1530&lt;/a&gt; to address the underlying issues. The big improvement being that Knox after &lt;a href="https://issues.apache.org/jira/browse/KNOX-1530" rel="noopener noreferrer"&gt;KNOX-1530&lt;/a&gt; will not decompress data that doesn’t need to be rewritten. This removes a lot of processing and should improvement Knox performance for other use cases like reading compressed files from WebHDFS and handling JS/CSS compressed files for UIs. After &lt;a href="https://issues.apache.org/jira/browse/KNOX-1530" rel="noopener noreferrer"&gt;KNOX-1530&lt;/a&gt; was addressed, the &lt;a href="https://issues.apache.org/jira/browse/KNOX-1524?focusedCommentId=16673639&amp;amp;page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16673639" rel="noopener noreferrer"&gt;performance for Apache Hive HiveServer2 in http mode with and without Apache Knox&lt;/a&gt; was about the same.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Select * performance for 200 thousand rows with &lt;a href="https://issues.apache.org/jira/browse/KNOX-1530" rel="noopener noreferrer"&gt;KNOX-1530&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Test Case&lt;/th&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;hdfs dfs -text&lt;/td&gt;
&lt;td&gt;2.1s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;beeline binary fetchSize=1000&lt;/td&gt;
&lt;td&gt;5.4s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;beeline http fetchSize=1000&lt;/td&gt;
&lt;td&gt;6.8s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;beeline http Knox fetchSize=1000&lt;/td&gt;
&lt;td&gt;7.7s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;beeline binary fetchSize=10000&lt;/td&gt;
&lt;td&gt;6.8s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;beeline http fetchSize=10000&lt;/td&gt;
&lt;td&gt;7.7s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;beeline http Knox fetchSize=10000&lt;/td&gt;
&lt;td&gt;7.8s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The default &lt;code&gt;fetchSize&lt;/code&gt; of 1000 slows down HTTP mode since there needs to be repeated requests to get more results.&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;By reproducing the WebHDFS performance bottleneck, it showed that we could improve Knox performance. WebHDFS write performance for Apache Knox 1.2.0 should be significantly faster due to &lt;a href="https://issues.apache.org/jira/browse/KNOX-1521" rel="noopener noreferrer"&gt;KNOX-1521&lt;/a&gt; changes. Hive perofrmance should be better in Apache Knox 1.2.0 due to &lt;a href="https://issues.apache.org/jira/browse/KNOX-1530" rel="noopener noreferrer"&gt;KNOX-1530&lt;/a&gt; with better GZip handling. Apache Knox 1.2.0 should be released soon with these performance improvements and more.&lt;/p&gt;

&lt;p&gt;I posted the performance tests I used &lt;a href="https://github.com/risdenk/knox-performance-tests" rel="noopener noreferrer"&gt;here&lt;/a&gt; so they can be used to find other performance bottle The performance benchmarking should be reproducible and I will use them for more performance testing soon.&lt;/p&gt;

&lt;p&gt;The performance testing done so far is for comparison against the native endpoint and not to show absolutely best performance numbers. This type of testing found some bottlenecks that have been addressed for Apache Knox 1.2.0. All of the tests done so far are without Kerberos authentication for the backend. There could be additional performance bottlenecks when Kerberos authentication is used and that will be another area I’ll be looking into.&lt;/p&gt;

</description>
      <category>bigdata</category>
      <category>apache</category>
      <category>knox</category>
      <category>security</category>
    </item>
    <item>
      <title>Apache HBase - Thrift 1 Server SPNEGO Improvements</title>
      <dc:creator>Kevin Risden</dc:creator>
      <pubDate>Thu, 08 Nov 2018 14:00:00 +0000</pubDate>
      <link>https://dev.to/risdenk/apache-hbase---thrift-1-server-spnego-improvements-4d9d</link>
      <guid>https://dev.to/risdenk/apache-hbase---thrift-1-server-spnego-improvements-4d9d</guid>
      <description>&lt;h3&gt;
  
  
  Overview
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://hbase.apache.org/"&gt;Apache HBase&lt;/a&gt; provides the ability to perform realtime random read/write access to large datasets. HBase is built on top of &lt;a href="https://hadoop.apache.org/"&gt;Apache Hadoop&lt;/a&gt; and can scale to billions of rows and millions of columns. One of the capabilities of Apache HBase is a &lt;a href="https://hbase.apache.org/book.html#thrift"&gt;thrift server&lt;/a&gt; that provides the ability to interact with HBase from any language that supports &lt;a href="https://thrift.apache.org/"&gt;Thrift&lt;/a&gt;. There are two different versions of the HBase Thrift server v1 and v2. This blog post focuses on v1 since that is the version that integrates with &lt;a href="https://gethue.com/"&gt;Hue&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Apache HBase and Hue
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://gethue.com/the-web-ui-for-hbase-hbase-browser/"&gt;Hue has support for Apache HBase&lt;/a&gt; through the v1 thrift server. The Hue UI allows for easily interacting with HBase for both querying and inserting. It is a quick and easy way to get started with HBase. The downside is that when using the HBase thrift v1 server, there was limited support for Kerberos.&lt;/p&gt;

&lt;h3&gt;
  
  
  HBase Thrift V1 and Kerberos
&lt;/h3&gt;

&lt;p&gt;There have been a few &lt;a href="http://grokbase.com/p/cloudera/cdh-user/133pgawryt/hbase-thrift-with-kerberos-appears-to-ignore-keytab"&gt;posts&lt;/a&gt; about getting the HBase Thrift V1 server to work properly with Kerberos. In many cases, the solution was to merge keytabs for the HTTP principal and the HBase server principal. The other solution was to add the HTTP principal as a proxy user. Both of these solutions require extra work that isn’t necessary. The HTTP principal should only be used for authenticating SPNEGO. The HBase server principal should be used to authenticate with the rest of HBase. I found this out after comparing the Apache Hive HiveServer2 thrift implementation with the HBase thrift server implementation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Improving the HBase Thrift V1 Implementation
&lt;/h3&gt;

&lt;p&gt;I emailed the &lt;a href="http://mail-archives.apache.org/mod_mbox/hbase-user/201801.mbox/%3CCAJU9nmh5YtZ%2BmAQSLo91yKm8pRVzAPNLBU9vdVMCcxHRtRqgoA%40mail.gmail.com%3E"&gt;hbase-user mailing list&lt;/a&gt; to see if my findings were plausible or if I was missing something. Josh Elser reviewed it and said that this change would be useful. I opened &lt;a href="https://issues.apache.org/jira/browse/HBASE-19852"&gt;HBASE-19852&lt;/a&gt; and put together a working patch over the next few months. It turns out the quick patch for our environment took some effort to contribute back to Apache HBase proper. The patch accomplished the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Avoid the existing 401 try/catch block by checking the authorization header up front before checking for Kerberos credentials&lt;/li&gt;
&lt;li&gt;Add &lt;code&gt;hbase.thrift.spnego.principal&lt;/code&gt; and &lt;code&gt;hbase.thrift.spnego.keytab.file&lt;/code&gt; to allow configuring the SPNEGO principal specifically for the Thrift server&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With the first change, this prevents the logs from being filled with messages about failing Kerberos authentication when the authorization header is empty. The second change allows the SPNEGO principal to be configured in the hbase-site.xml file. The thrift server will then be configured to use the SPNEGO principal and keytab for HTTP authentication. This prevents the need to merge keytabs and allows an administrator to use existing SPNEGO principals and keytabs that are on the host (like one setup by Ambari).&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://issues.apache.org/jira/browse/HBASE-19852"&gt;HBASE-19852&lt;/a&gt; was reviewed and merged in June 2018. It is a part of HBase 2.1.0 and greater. The Apache HBase community was great to work with since they were patient while I worked on the patch over a few months. The new configuration options allows the HBase Thrift V1 server to work seemlessly with Kerberos and Hue. There is no longer a need to merge keytabs or perform other workarounds. This change has been in use for over a year now with success using the Hue HBase Browser with HBase and Kerberos.&lt;/p&gt;

</description>
      <category>bigdata</category>
      <category>apache</category>
      <category>hbase</category>
      <category>thrift</category>
    </item>
    <item>
      <title>Apache HBase - Snappy Compression</title>
      <dc:creator>Kevin Risden</dc:creator>
      <pubDate>Tue, 06 Nov 2018 14:00:00 +0000</pubDate>
      <link>https://dev.to/risdenk/apache-hbase---snappy-compression-f6l</link>
      <guid>https://dev.to/risdenk/apache-hbase---snappy-compression-f6l</guid>
      <description>&lt;h3&gt;
  
  
  Overview
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://hbase.apache.org/"&gt;Apache HBase&lt;/a&gt; provides the ability to perform realtime random read/write access to large datasets. HBase is built on top of &lt;a href="https://hadoop.apache.org/"&gt;Apache Hadoop&lt;/a&gt; and can scale to billions of rows and millions of columns. One of the features of HBase is to enable &lt;a href="https://hbase.apache.org/book.html#compression"&gt;different types of compression&lt;/a&gt; for a column family. It is recommended that testing be done for your use case, but this blog shows how &lt;a href="https://en.wikipedia.org/wiki/Snappy_(compression)"&gt;Snappy compression&lt;/a&gt; can reduce storage needs while keeping the same query performance.&lt;/p&gt;

&lt;h3&gt;
  
  
  Evidence
&lt;/h3&gt;

&lt;p&gt;Below are some images from some clusters where testing was done with Snappy compression. The charts show a variety of metrics from storage size to system metrics.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--OhXvv_Ws--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://risdenk.github.io/images/posts/2018-11-06/dev_grafana_hbase_get_mutate_latencies.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--OhXvv_Ws--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://risdenk.github.io/images/posts/2018-11-06/dev_grafana_hbase_get_mutate_latencies.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--linhfWtS--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://risdenk.github.io/images/posts/2018-11-06/dev_grafana_hbase_size.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--linhfWtS--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://risdenk.github.io/images/posts/2018-11-06/dev_grafana_hbase_size.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--CnQjjbf5--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://risdenk.github.io/images/posts/2018-11-06/test_grafana_hbase_get_mutate_latencies.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--CnQjjbf5--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://risdenk.github.io/images/posts/2018-11-06/test_grafana_hbase_get_mutate_latencies.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--wfKR38hi--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://risdenk.github.io/images/posts/2018-11-06/test_grafana_hbase_size.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--wfKR38hi--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://risdenk.github.io/images/posts/2018-11-06/test_grafana_hbase_size.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--TSV8lipJ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://risdenk.github.io/images/posts/2018-11-06/test_grafana_system_disk_io.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--TSV8lipJ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://risdenk.github.io/images/posts/2018-11-06/test_grafana_system_disk_io.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--t85MyCDo--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://risdenk.github.io/images/posts/2018-11-06/test_grafana_system_iowait.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--t85MyCDo--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://risdenk.github.io/images/posts/2018-11-06/test_grafana_system_iowait.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--E4qI3voD--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://risdenk.github.io/images/posts/2018-11-06/test_grafana_system_user.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--E4qI3voD--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://risdenk.github.io/images/posts/2018-11-06/test_grafana_system_user.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;The charts above show &amp;gt;80% storage saving while only seeing a slight bump in mutate latencies. The clusters that this was tested on were loaded with simulated data and load. The production data matched this when deployed as well. This storage savings also helped backups and disaster recovery since we didn’t need to move as much data across the wire. References for implementing this yourself with more options for testing are below.&lt;/p&gt;

&lt;h4&gt;
  
  
  References
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://community.hortonworks.com/articles/54761/compression-in-hbase.html"&gt;https://community.hortonworks.com/articles/54761/compression-in-hbase.html&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://hadoop-hbase.blogspot.com/2016/02/hbase-compression-vs-blockencoding_17.html"&gt;http://hadoop-hbase.blogspot.com/2016/02/hbase-compression-vs-blockencoding_17.html&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blogs.apache.org/hbase/entry/the_effect_of_columnfamily_rowkey"&gt;https://blogs.apache.org/hbase/entry/the_effect_of_columnfamily_rowkey&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017-01-performance-comparison-different-file-formats-and-storage-engines"&gt;https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017-01-performance-comparison-different-file-formats-and-storage-engines&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://blog.erdemagaoglu.com/post/4605524309/lzo-vs-snappy-vs-lzf-vs-zlib-a-comparison-of"&gt;http://blog.erdemagaoglu.com/post/4605524309/lzo-vs-snappy-vs-lzf-vs-zlib-a-comparison-of&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://hbase.apache.org/book.html#compression"&gt;https://hbase.apache.org/book.html#compression&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://hbase.apache.org/book.html#data.block.encoding.enable"&gt;https://hbase.apache.org/book.html#data.block.encoding.enable&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>bigdata</category>
      <category>apache</category>
      <category>hbase</category>
      <category>snappy</category>
    </item>
    <item>
      <title>Apache Storm - Slow Topology Upload</title>
      <dc:creator>Kevin Risden</dc:creator>
      <pubDate>Thu, 01 Nov 2018 14:00:00 +0000</pubDate>
      <link>https://dev.to/risdenk/apache-storm---slow-topology-upload-2gf0</link>
      <guid>https://dev.to/risdenk/apache-storm---slow-topology-upload-2gf0</guid>
      <description>&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; This is an old post from notes. This may not be applicable anymore but sharing in case it helps someone.&lt;/p&gt;

&lt;h3&gt;
  
  
  Overview
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://storm.apache.org/"&gt;Apache Storm&lt;/a&gt; after HDP 2.2 seems to have a hard time with large topology jars and takes a while to upload them. There have been a &lt;a href="https://mail-archives.apache.org/mod_mbox/storm-user/201603.mbox/%3CCAPC1M2i3OpKhC3n_+oTJke45Efuxq2PxMVurx71oEU-=Nqd9gQ@mail.gmail.com%3E"&gt;few&lt;/a&gt; &lt;a href="https://community.hortonworks.com/questions/24517/topology%C2%ADcode%C2%ADdistribution%C2%ADtakes%C2%ADtoo%C2%ADmuch%C2%ADtime.html"&gt;reports&lt;/a&gt; of Storm topology jars uploading slowly. I ran into this a few years ago. The fix is to increase the &lt;code&gt;nimbus.thrift.max_buffer_size&lt;/code&gt; setting.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix
&lt;/h3&gt;

&lt;p&gt;Increase &lt;code&gt;nimbus.thrift.max_buffer_size&lt;/code&gt; from the default of 1048576 to 20485760.&lt;/p&gt;

&lt;h3&gt;
  
  
  References
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://mail-archives.apache.org/mod_mbox/storm-user/201403.mbox/%3CFC98EE12-4AED-4D06-9917-C449B96EB08A@gmail.com%3E"&gt;https://mail-archives.apache.org/mod_mbox/storm-user/201403.mbox/%3CFC98EE12-4AED-4D06-9917-C449B96EB08A@gmail.com%3E&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://stackoverflow.com/questions/27092653/storm-supervisor-connectivity-error-downloading-the-jar-from-nimbus"&gt;http://stackoverflow.com/questions/27092653/storm-supervisor-connectivity-error-downloading-the-jar-from-nimbus&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://qnalist.com/questions/4768442/nimbus-fails-after-uploading-topology-reading-too-large-of-frame-size"&gt;https://qnalist.com/questions/4768442/nimbus-fails-after-uploading-topology-reading-too-large-of-frame-size&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>bigdata</category>
      <category>apache</category>
      <category>storm</category>
      <category>topology</category>
    </item>
    <item>
      <title>Apache Solr - Apache Calcite Avatica Integration</title>
      <dc:creator>Kevin Risden</dc:creator>
      <pubDate>Tue, 30 Oct 2018 14:00:00 +0000</pubDate>
      <link>https://dev.to/risdenk/apache-solr---apache-calcite-avatica-integration-3hjb</link>
      <guid>https://dev.to/risdenk/apache-solr---apache-calcite-avatica-integration-3hjb</guid>
      <description>&lt;h3&gt;
  
  
  Overview
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://lucene.apache.org/solr"&gt;Apache Solr&lt;/a&gt; is a full text search engine that is built on &lt;a href="https://lucene.apache.org/solr/"&gt;Apache Lucene&lt;/a&gt;. One of the capabilities of Apache Solr is to handle SQL like statements. This was introduced in Solr 6.0 and refined in subsequent releases. Initially the SQL support used the &lt;a href="https://github.com/prestodb/presto/blob/master/presto-parser/src/main/java/com/facebook/presto/sql/parser/SqlParser.java"&gt;Presto SQL parser&lt;/a&gt;. This was replaced by &lt;a href="https://calcite.apache.org/"&gt;Apache Calcite&lt;/a&gt; due to Presto not having an optimizer. Calcite provides the ability to push down execution of SQL to Apache Solr.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://calcite.apache.org/avatica/"&gt;Apache Calcite Avatica&lt;/a&gt; is a subproject of Apache Calcite and provides a JDBC driver as well as JDBC server. The Avatica architecture diagram displays how this fits together.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--trGQvjnq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://raw.githubusercontent.com/julianhyde/share/master/slides/avatica-architecture.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--trGQvjnq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://raw.githubusercontent.com/julianhyde/share/master/slides/avatica-architecture.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Apache Solr and Apache Calcite Avatica
&lt;/h3&gt;

&lt;p&gt;Apache Solr has historically built its own JDBC driver implementation. This takes quite a bit of effort since the JDBC specification has a lot of methods that need to be implemented. &lt;a href="https://issues.apache.org/jira/browse/SOLR-9963"&gt;SOLR-9963&lt;/a&gt; was created to try to integrate Apache Calcite Avatica into Solr. This would provide an endpoint for the Avatica JDBC driver and remove the need for a separate Apache Solr JDBC driver implementation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Integrating Apache Calcite Avatica as an Apache Solr Handler
&lt;/h3&gt;

&lt;p&gt;Since Apache Calcite Avatica is implemented in Jetty just like Apache Solr, I had the idea to add Avatica as just another handler in Solr. This would expose all the features of Avatica without changing any internals of Solr. The Avatica handler could then use the existing Calcite engine within Apache Solr to handle the queries.&lt;/p&gt;

&lt;p&gt;I created &lt;a href="https://issues.apache.org/jira/browse/SOLR-9963"&gt;SOLR-9963&lt;/a&gt; and by early February 2017 I had a working example of the integration of Avatica and Solr. I was able to use the existing Avatica JDBC driver directly with Apache Solr without any issues. Sadly I haven’t had time to finish merging this change yet.&lt;/p&gt;

&lt;h3&gt;
  
  
  Testing Apache Solr with Apache Calcite Avatica Handler
&lt;/h3&gt;

&lt;p&gt;One of the cool features of Apache Calcite Avatica is that you can interact with it over pure REST with a JSON payload. I created a simple test script to show how this was possible even with Apache Solr.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;./test_avatica_solr.sh "http://localhost:8983/solr/test/avatica" "select * from test limit 10"&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;test_avatica_solr.sh&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#!/usr/bin/env bash

set -u
#set -x

AVATICA=$1
SQL=$2

CONNECTION_ID="conn-$(whoami)-$(date +%s)"
MAX_ROW_COUNT=100
NUM_ROWS=2
OFFSET=0

echo "Open connection"
curl -i -w "\n" "$AVATICA" -H "Content-Type: application/json" --data "{\"request\": \"openConnection\",\"connectionId\": \"${CONNECTION_ID}\"}"

# Example of how to set connection properties with info key
#curl -i "$AVATICA" -H "Content-Type: application/json" --data "{\"request\": \"openConnection\",\"connectionId\": \"${CONNECTION_ID}\",\"info\": {\"zk\": \"$ZK\",\"lex\": \"MYSQL\"}}"
echo

echo "Create statement"
STATEMENTRSP=$(curl -s "$AVATICA" -H "Content-Type: application/json" --data "{\"request\": \"createStatement\",\"connectionId\": \"${CONNECTION_ID}\"}")
STATEMENTID=$(echo "$STATEMENTRSP" | jq .statementId)
echo

echo "PrepareAndExecuteRequest"
curl -i -w "\n" "$AVATICA" -H "Content-Type: application/json" --data "{\"request\": \"prepareAndExecute\",\"connectionId\": \"${CONNECTION_ID}\",\"statementId\": $STATEMENTID,\"sql\": \"$SQL\",\"maxRowCount\": ${MAX_ROW_COUNT}, \"maxRowsInFirstFrame\": ${NUM_ROWS}}"
echo

# Loop through all the results
ISDONE=false
while ! $ISDONE; do
  OFFSET=$((OFFSET + NUM_ROWS))
  echo "FetchRequest - Offset=$OFFSET"
  FETCHRSP=$(curl -s "$AVATICA" -H "Content-Type: application/json" --data "{\"request\": \"fetch\",\"connectionId\": \"${CONNECTION_ID}\",\"statementId\": $STATEMENTID,\"offset\": ${OFFSET},\"fetchMaxRowCount\": ${NUM_ROWS}}")
  echo "$FETCHRSP"
  ISDONE=$(echo "$FETCHRSP" | jq .frame.done)
  echo
done

echo "Close statement"
curl -i -w "\n" "$AVATICA" -H "Content-Type: application/json" --data "{\"request\": \"closeStatement\",\"connectionId\": \"${CONNECTION_ID}\",\"statementId\": $STATEMENTID}"
echo

echo "Close connection"
curl -i -w "\n" "$AVATICA" -H "Content-Type: application/json" --data "{\"request\": \"closeConnection\",\"connectionId\": \"${CONNECTION_ID}\"}"
echo

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;h3&gt;
  
  
  What is next?
&lt;/h3&gt;

&lt;p&gt;If this feature looks interesting it would be good to add your thoughts to &lt;a href="https://issues.apache.org/jira/browse/SOLR-9963"&gt;SOLR-9963&lt;/a&gt;. If there is interest then we can work towards getting SOLR-9963 merged. The Apache Solr JDBC driver would need to then switch to wrapping an Avatica JDBC driver. Overall this should improve the SQL experience that comes with Apache Solr.&lt;/p&gt;

</description>
      <category>bigdata</category>
      <category>apache</category>
      <category>solr</category>
      <category>calcite</category>
    </item>
    <item>
      <title>Apache Solr - Leading Wildcard Queries and ReversedWildcardFilterFactory</title>
      <dc:creator>Kevin Risden</dc:creator>
      <pubDate>Thu, 25 Oct 2018 14:00:00 +0000</pubDate>
      <link>https://dev.to/risdenk/apache-solr---leading-wildcard-queries-and-reversedwildcardfilterfactory-48kc</link>
      <guid>https://dev.to/risdenk/apache-solr---leading-wildcard-queries-and-reversedwildcardfilterfactory-48kc</guid>
      <description>&lt;h3&gt;
  
  
  Overview
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://lucene.apache.org/solr"&gt;Apache Solr&lt;/a&gt; is a full text search engine that is built on &lt;a href="https://lucene.apache.org/solr/"&gt;Apache Lucene&lt;/a&gt;. Recently, I was looking into performance where the query had leading wildcards. There have been &lt;a href="https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201109.mbox/%3C1315989749353-3335240.post%40n3.nabble.com%3E"&gt;many&lt;/a&gt; &lt;a href="https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201502.mbox/%3CCACtr6ybiKq_nyTdBk_82%3DjErHc3jOkFhC_vEUP9ymcbgCkEm2Q%40mail.gmail.com%3E"&gt;questions&lt;/a&gt; over the years about leading wildcard queries. It was surprising to me that there are few references explaining what leading wildcard queries are and how they are implemented behind the scenes. There are also no references that explain how to verify that leading wildcards are being processed efficiently.&lt;/p&gt;

&lt;p&gt;So in this blog I cover the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What are leading wildcard queries?&lt;/li&gt;
&lt;li&gt;Why are leading wildcard queries inefficient?&lt;/li&gt;
&lt;li&gt;How to improve leading wildcard queries&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ReversedWildcardFilterFactory&lt;/code&gt; Implementation&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What are leading wildcard queries?
&lt;/h3&gt;

&lt;p&gt;Leading wildcard queries are term queries that use the asterick (&lt;code&gt;*&lt;/code&gt;) in the beginning of the term. For an example, you could look for all colors that end in &lt;code&gt;ed&lt;/code&gt; with &lt;code&gt;color:*ed&lt;/code&gt;. The asterick (&lt;code&gt;*&lt;/code&gt;) takes the place of one or more characters. There is another variation where the question mark (&lt;code&gt;?&lt;/code&gt;) is used as a placeholder for a single character. I am focusing on leading wildcard queries only and not trailing (ie: &lt;code&gt;color:re*&lt;/code&gt;) or other combinations (ie: &lt;code&gt;color:*e*&lt;/code&gt;). For more details, see the &lt;a href="https://lucene.apache.org/solr/guide/7_5/the-standard-query-parser.html#wildcard-searches"&gt;Apache Reference Guide Wildcard Searches page&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why are leading wildcard queries inefficient?
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://lucene.apache.org/"&gt;Apache Lucene&lt;/a&gt;, the library that backs &lt;a href="https://lucene.apache.org/solr/"&gt;Apache Solr&lt;/a&gt; and &lt;a href="https://www.elastic.co/products/elasticsearch"&gt;Elasticsearch&lt;/a&gt;, is designed to search for tokens. &lt;a href="https://lucene.apache.org/core/7_5_0/test-framework/org/apache/lucene/analysis/Token.html"&gt;Tokens&lt;/a&gt; are the representation of a piece of text data after it has been &lt;a href="https://lucene.apache.org/solr/guide/7_5/understanding-analyzers-tokenizers-and-filters.html"&gt;tokenized and analyzed&lt;/a&gt;. Lucene is very good at exact matches since it can efficiently query the index for matches. When leading wildcards are involved, there is a lot more work that needs to be done since the index is not optimized for this type of lookup.&lt;/p&gt;

&lt;p&gt;A leading wildcard query must iterate through all of the terms in the index to see if they match the query. For even moderately sized indices this can be time consuming. With the asterick (&lt;code&gt;*&lt;/code&gt;) at the beginning of the query, this means that there can be many matches throughout the index. The question mark (&lt;code&gt;?&lt;/code&gt;) can be significantly more performant since Lucene doesn’t have to check as much. The iteration through the terms cannot stop until it has gone through the entire index for matches. This can cause poor caching if the index doesn’t fit in memory as well as other problems.&lt;/p&gt;

&lt;h3&gt;
  
  
  How to improve leading wildcard queries
&lt;/h3&gt;

&lt;p&gt;The best way to improve leading wildcard queries is to remove them if possible. In many cases, there is a better way to handle the query by different tokenization or analyzing. If the use case requires leading wildcard queries then there is one trick that can help improve performance. One way to improve performance is to reverse the token during indexing which basically changes a leading wildcard query into a trailing wildcard query. A trailing wildcard query can be executed much more efficiently since only part of the index needs to be examined.&lt;/p&gt;

&lt;p&gt;Apache Solr has a token filter called the &lt;a href="https://lucene.apache.org/solr/7_5_0/solr-core/org/apache/solr/analysis/ReversedWildcardFilterFactory.html"&gt;&lt;code&gt;ReversedWildcardFilterFactory&lt;/code&gt;&lt;/a&gt; that emits reversed tokens. This can be used when constructing fieldTypes for fields that may need to handle leading wildcard queries. There is an example of this in the &lt;code&gt;_default&lt;/code&gt; &lt;a href="https://lucene.apache.org/solr/guide/7_5/config-sets.html"&gt;config set&lt;/a&gt; called &lt;a href="https://github.com/apache/lucene-solr/blob/branch_7_5/solr/server/solr/configsets/_default/conf/managed-schema#L440"&gt;&lt;code&gt;text_general_rev&lt;/code&gt;&lt;/a&gt;. This shows how to configure the &lt;code&gt;ReversedWildcardFilterFactory&lt;/code&gt; for a field. It is important to note that the &lt;a href="https://lucene.apache.org/solr/guide/7_5/analyzers.html#analysis-phases"&gt;index and query analyzer phases&lt;/a&gt; are different. The &lt;code&gt;ReversedWildcardFilterFactory&lt;/code&gt; MUST only be implemented as an index analyzer. The query side is handled automatically.&lt;/p&gt;

&lt;p&gt;For reference, here is the &lt;code&gt;text_general_rev&lt;/code&gt; fieldType definition:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;fieldType name="text_general_rev" class="solr.TextField" positionIncrementGap="100"&amp;gt;
    &amp;lt;analyzer type="index"&amp;gt;
        &amp;lt;tokenizer class="solr.StandardTokenizerFactory"/&amp;gt;
        &amp;lt;filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /&amp;gt;
        &amp;lt;filter class="solr.LowerCaseFilterFactory"/&amp;gt;
        &amp;lt;filter class="solr.ReversedWildcardFilterFactory" withOriginal="true"
                maxPosAsterisk="3" maxPosQuestion="2" maxFractionAsterisk="0.33"/&amp;gt;
      &amp;lt;/analyzer&amp;gt;
      &amp;lt;analyzer type="query"&amp;gt;
        &amp;lt;tokenizer class="solr.StandardTokenizerFactory"/&amp;gt;
        &amp;lt;filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/&amp;gt;
        &amp;lt;filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /&amp;gt;
        &amp;lt;filter class="solr.LowerCaseFilterFactory"/&amp;gt;
    &amp;lt;/analyzer&amp;gt;
&amp;lt;/fieldType&amp;gt;

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;code&gt;ReversedWildcardFilterFactory&lt;/code&gt; Implementation
&lt;/h3&gt;

&lt;p&gt;When the &lt;code&gt;ReversedWildcardFilterFactory&lt;/code&gt; is setup for a field in Solr, the field has two different tokens emitted, original and reversed, during indexing. The screenshot below shows the &lt;a href="https://lucene.apache.org/solr/guide/7_5/analysis-screen.html"&gt;Analysis tab&lt;/a&gt; showing how the token is created for a simple string &lt;code&gt;abcdefg&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--7O3b8wHB--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://risdenk.github.io/images/posts/2018-10-25/test_analysis.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--7O3b8wHB--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://risdenk.github.io/images/posts/2018-10-25/test_analysis.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The extra reversed tokens will increase the index size, but this is usually an acceptable tradeoff since the other option is very slow leading wildcard queries.&lt;/p&gt;

&lt;p&gt;When a query uses the field with &lt;code&gt;ReversedWildcardFilterFactory&lt;/code&gt;, Solr &lt;a href="https://github.com/apache/lucene-solr/blob/branch_7_5/solr/core/src/java/org/apache/solr/parser/SolrQueryParserBase.java#L1192"&gt;internally evaluates&lt;/a&gt; whether to search for the original or reversed query string. One annoying part, since this optimization is internal to Solr, is that there is no indication to the user that the query string was reversed. Even with &lt;a href="https://lucene.apache.org/solr/guide/7_5/common-query-parameters.html"&gt;&lt;code&gt;debug=true&lt;/code&gt;&lt;/a&gt;, the parsed query is the same since the &lt;a href="https://github.com/apache/lucene-solr/blob/branch_7_5/solr/core/src/java/org/apache/solr/parser/SolrQueryParserBase.java#L1213"&gt;&lt;code&gt;AutomatonQuery#toString()&lt;/code&gt; method&lt;/a&gt; doesn’t provide information on the automaton. The screenshot below shows a leading wildcard query with no indication that it is working correctly.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--AO3x9prp--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://risdenk.github.io/images/posts/2018-10-25/wildcard_debug_query.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--AO3x9prp--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://risdenk.github.io/images/posts/2018-10-25/wildcard_debug_query.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I was able to confirm the expected behavior by remotely debugging a running Solr server. This showed that the query was properly reversing the automaton based on the parameters for the &lt;a href="https://lucene.apache.org/solr/7_5_0/solr-core/org/apache/solr/analysis/ReversedWildcardFilterFactory.html"&gt;&lt;code&gt;ReversedWildcardFilterFactory&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The only place I’ve been able to find in the Solr UI that shows the ReversedWildcardFilterFactory actually did anything is in the &lt;a href="https://lucene.apache.org/solr/guide/7_5/stream-screen.html"&gt;Schema section&lt;/a&gt; under the collection. Then you have to select the field and then click the “Load Term Info” button to get details about the underlying terms. The screenshot below shows the terms for the &lt;code&gt;a_txt_rev&lt;/code&gt; field.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--rMwpMrPB--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://risdenk.github.io/images/posts/2018-10-25/a_txt_rev_terms.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--rMwpMrPB--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://risdenk.github.io/images/posts/2018-10-25/a_txt_rev_terms.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;Solr and the &lt;code&gt;ReversedWildcardFilterFactory&lt;/code&gt; can help improve the performance of leading wildcard queries if they are absolutely required. When I’ve explained over the years that &lt;code&gt;ReversedWildcardFilterFactory&lt;/code&gt; would solve leading wildcard issues, I hadn’t looked at the internals. This post forced me to look at the internals about how Lucene and Solr work with leading wildcards. I checked multiple versions of Solr (4.3.x, 4.10.x, 5.5.x, 6.3.x, and 7.5.x) initially thinking that the query was not using the reversed tokens. It wasn’t until I used a debugger to check that I could convince myself that the query was being handled properly. Better debug logging for this case would have helped tremendously.&lt;/p&gt;

&lt;h4&gt;
  
  
  Solr Setup Reference
&lt;/h4&gt;

&lt;p&gt;I used the following to setup Apache Solr for reproducing all the screenshots above. There are also command line versions for gathering the same information programatically.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;./bin/solr start -c
./bin/solr create -c test -n basic_configs
echo '1,abcdefg,abcdefg' | ./bin/post -c test -type text/csv -params "fieldnames=id,a_txt,a_txt_rev" -d
curl "http://localhost:8983/solr/test/select?q=*:*"
curl "http://localhost:8983/solr/test/select?q=a_txt:abcdefg&amp;amp;debug=true"
curl "http://localhost:8983/solr/test/select?q=a_txt_rev:abcdefg&amp;amp;debug=true"
curl "http://localhost:8983/solr/test/admin/luke?fl=a_txt_rev&amp;amp;numTerms=2" 

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



</description>
      <category>bigdata</category>
      <category>apache</category>
      <category>solr</category>
      <category>lucene</category>
    </item>
    <item>
      <title>Apache Solr - Running on Apache Hadoop HDFS</title>
      <dc:creator>Kevin Risden</dc:creator>
      <pubDate>Tue, 23 Oct 2018 14:00:00 +0000</pubDate>
      <link>https://dev.to/risdenk/apache-solr---running-on-apache-hadoop-hdfs-17pc</link>
      <guid>https://dev.to/risdenk/apache-solr---running-on-apache-hadoop-hdfs-17pc</guid>
      <description>&lt;h3&gt;
  
  
  Overview
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://lucene.apache.org/solr"&gt;Apache Solr&lt;/a&gt; is a full text search engine that is built on &lt;a href="https://lucene.apache.org/solr/"&gt;Apache Lucene&lt;/a&gt;. I’ve been working with Apache Solr for the past six years. Some of these were pure Solr installations, but many were integrated with Apache Hadoop. This includes both Hortonworks HDP Search as well as Cloudera Search. Performance for Solr on HDFS is a common question so writing this post to help share some of my experience.&lt;/p&gt;

&lt;h3&gt;
  
  
  Apache Hadoop HDFS
&lt;/h3&gt;

&lt;p&gt;Apache Hadoop contains a filesystem called Hadoop Distributed File System (HDFS). HDFS is designed to scale to petabytes of data on commodity hardware. The definition of commodity hardware has changed over the years, but the premise is that the latest and greatest hardware is not needed. HDFS is used by a variety of workloads from Apache HBase to Apache Spark. Performance on HDFS tends to favor large files for both reading and writing. HDFS also uses all available disks for I/O which can be helpful for large clusters.&lt;/p&gt;

&lt;h3&gt;
  
  
  Apache Solr and HDFS
&lt;/h3&gt;

&lt;p&gt;Apache Solr can run on HDFS since the early 4.x versions. Cloudera Search added this capability to be able to use the existing HDFS storage for search. Hortonworks HDP Search, since it is based on Apache Solr, has support for HDFS as well. Since HDFS is not a local filesystem, Apache Solr implements a block cache that is designed to help cache HDFS blocks in memory. With the HDFS block cache for querying, Apache Solr can have slower but similar performance to local indices. The HDFS block cache is not used for merging, indexing, or read once use cases. This means that there are some areas where Apache Solr with HDFS can be slower.&lt;/p&gt;

&lt;h3&gt;
  
  
  Apache Solr Performance
&lt;/h3&gt;

&lt;p&gt;If you are looking for the best performance with the fewest variations, then SSDs and ample memory is where you should be looking. If you are budget constrained, then spinning disks with memory can also provide adequate performance. Solr on HDFS can perform just as well as local disks given the right amount of memory. The common “it depends” caveat will come down to the specific use case. For large scale analytics then Solr on HDFS performs well. For high speed indexing then you will need SSDs since the write performance of Solr on HDFS is not going to match.&lt;/p&gt;

&lt;p&gt;Most of the time when dealing with performance issues with Solr, I found that it is not the underlying hardware to be the problem. Typically the way the data is indexed or queried can have a huge impact on performance. The standard debugging and improvements here help with all different types of hardware.&lt;/p&gt;

&lt;h3&gt;
  
  
  Apache Solr on HDFS - Best Practices
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Shutdown Apache Solr Cleanly
&lt;/h4&gt;

&lt;p&gt;Make sure you give Apache Solr plenty of time to shutdown cleanly. Older versions of the &lt;code&gt;solr&lt;/code&gt; script waited only 5 seconds before shutting down. Increase the sleep time to ensure that you do not leave &lt;code&gt;write.lock&lt;/code&gt; files on HDFS from an unclean shutdown.&lt;/p&gt;

&lt;h4&gt;
  
  
  Ulimits must be configured correctly
&lt;/h4&gt;

&lt;p&gt;Ensure that you have the proper ulimits for the user running Solr. It will cause huge issues when you can’t use Solr due to ulimits that are too low.&lt;/p&gt;

&lt;h4&gt;
  
  
  Use a Zookeeper chroot
&lt;/h4&gt;

&lt;p&gt;With Apache Hadoop, many different pieces of software use Zookeeper. It will help keep things organized if you use a chroot specifically for Solr.&lt;/p&gt;

&lt;h4&gt;
  
  
  Make a directory on HDFS for Solr
&lt;/h4&gt;

&lt;p&gt;Make a directory on HDFS for Solr that isn’t used for anything else. This will make sure you don’t cause problems with other processes reading/writing from that location. It also makes it possible to set permissions to ensure only the Solr user has access.&lt;/p&gt;

&lt;h4&gt;
  
  
  HDFS Block Cache must be tuned
&lt;/h4&gt;

&lt;p&gt;Ensure that the HDFS Block Cache is enabled and that it is tuned properly. By default the block cache does not have enough slabs for good performance. Each slab for the HDFS block cache is by default 128MB (&lt;code&gt;solr.hdfs.blockcache.blocksperbank&lt;/code&gt;:16834 * 8KB). The number of slabs determines how much memory will be used for caching. Since the HDFS block cache is stored offheap, Java must also be allowed to allocate up to that amount of direct memory with &lt;code&gt;-XX:MaxDirectMemorySize&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Here is a handy table to show relationship between number of slabs, MaxDirectMemorySize, and the HDFS block cache size.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;code&gt;-Dsolr.hdfs.blockcache.slab.count&lt;/code&gt;&lt;/th&gt;
&lt;th&gt;&lt;code&gt;-XX:MaxDirectMemorySize&lt;/code&gt;&lt;/th&gt;
&lt;th&gt;HDFS Block Cache Size&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;250MB&lt;/td&gt;
&lt;td&gt;128MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;2GB&lt;/td&gt;
&lt;td&gt;1GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;td&gt;4GB&lt;/td&gt;
&lt;td&gt;2.5GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;40&lt;/td&gt;
&lt;td&gt;8GB&lt;/td&gt;
&lt;td&gt;5GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;15GB&lt;/td&gt;
&lt;td&gt;12.5GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;td&gt;30GB&lt;/td&gt;
&lt;td&gt;25GB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;When configured correctly, Solr will print out a calculation of the memory required in the logs like so:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Block cache target memory usage, slab size of [134217728] will allocate [40] slabs and use ~[5368709120] bytes

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;h4&gt;
  
  
  Ensure that HDFS Short Circuit Reads are enabled
&lt;/h4&gt;

&lt;p&gt;HDFS Short Circuit Reads allow the HDFS library to read from a local socket instead of making a network call. This can significantly improve read performance.&lt;/p&gt;

&lt;h4&gt;
  
  
  Example Configuration
&lt;/h4&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Solr HDFS - setup
## Use HDFS by default for its collection data and tlogs
SOLR_OPTS="$SOLR_OPTS -Dsolr.directoryFactory=HdfsDirectoryFactory \
    -Dsolr.lock.type=hdfs \
    -Dsolr.hdfs.home=$(hdfs getconf -confKey fs.defaultFS)/apps/solr \
    -Dsolr.hdfs.confdir=/etc/hadoop/conf"

## If HDFS Kerberos enabled, uncomment the following
#SOLR_OPTS="$SOLR_OPTS -Dsolr.hdfs.security.kerberos.enabled=true \
# -Dsolr.hdfs.security.kerberos.keytabfile=/etc/security/keytabs/solr.keytab \
# -Dsolr.hdfs.security.kerberos.principal=solr@REALM"

# Solr HDFS - performance
## Enable the HDFS Block Cache to take the place of memory mapping files
SOLR_OPTS="$SOLR_OPTS -Dsolr.hdfs.blockcache.enabled=true \
    -Dsolr.hdfs.blockcache.global=true \
    -Dsolr.hdfs.blockcache.read.enabled=true \
    -Dsolr.hdfs.blockcache.write.enabled=false"

## Size the HDFS Block Cache
SOLR_OPTS="$SOLR_OPTS -Dsolr.hdfs.blockcache.blocksperbank=16384 \
    -Dsolr.hdfs.blockcache.slab.count=200"

## Enable direct memory allocation to allocate HDFS Block Cache off heap
SOLR_OPTS="$SOLR_OPTS -Dsolr.hdfs.blockcache.direct.memory.allocation=true \
    -XX:MaxDirectMemorySize=30g -XX:+UseLargePages"

## Enable HDFS Short Circuit reads if possible
### Note: This path is different for Cloudera. It must be the path to the HDFS native libraries
SOLR_OPTS="$SOLR_OPTS -Djava.library.path=:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64:/usr/hdp/current/hadoop-client/lib/native"

## If Near Real Time (NRT) enable HDFS NRT caching directory, uncomment the following
#SOLR_OPTS="$SOLR_OPTS -Dsolr.hdfs.nrtcachingdirectory.enable=true \
# -Dsolr.hdfs.nrtcachingdirectory.maxmergesizemb=16 \
# -Dsolr.hdfs.nrtcachingdirectory.maxcachedmb=192"

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;It is possible to get reasonable performance out of Apache Solr running on Apache Hadoop HDFS. If budget allows then SSDs will give better performance for both indexing and querying. Finally, given the proper amount of memory, even spinning disks will give adequate performance for Apache Solr.&lt;/p&gt;

&lt;h4&gt;
  
  
  References
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://wiki.apache.org/solr/SolrPerformanceProblems#SSD"&gt;https://wiki.apache.org/solr/SolrPerformanceProblems#SSD&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://sbdevel.wordpress.com/2013/06/06/memory-is-overrated/"&gt;https://sbdevel.wordpress.com/2013/06/06/memory-is-overrated/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://community.hortonworks.com/questions/27567/write-performance-in-hdfs.html"&gt;https://community.hortonworks.com/questions/27567/write-performance-in-hdfs.html&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.cloudera.com/blog/2014/03/the-truth-about-mapreduce-performance-on-ssds/"&gt;https://blog.cloudera.com/blog/2014/03/the-truth-about-mapreduce-performance-on-ssds/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://community.hortonworks.com/questions/4858/solrcloud-performance-hdfs-indexdata.html"&gt;https://community.hortonworks.com/questions/4858/solrcloud-performance-hdfs-indexdata.html&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.slideshare.net/lucidworks/solr-on-hdfs-final-mark-miller"&gt;https://www.slideshare.net/lucidworks/solr-on-hdfs-final-mark-miller&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://issues.apache.org/jira/browse/SOLR-7393"&gt;https://issues.apache.org/jira/browse/SOLR-7393&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://blog.cloudera.com/blog/2017/06/apache-solr-memory-tuning-for-production/"&gt;http://blog.cloudera.com/blog/2017/06/apache-solr-memory-tuning-for-production/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://blog.cloudera.com/blog/2017/06/solr-memory-tuning-for-production-part-2/"&gt;http://blog.cloudera.com/blog/2017/06/solr-memory-tuning-for-production-part-2/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://community.plm.automation.siemens.com/t5/Developer-Space/Running-Solr-on-S3/td-p/449360"&gt;https://community.plm.automation.siemens.com/t5/Developer-Space/Running-Solr-on-S3/td-p/449360&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>bigdata</category>
      <category>apache</category>
      <category>solr</category>
      <category>hadoop</category>
    </item>
  </channel>
</rss>
