<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: shivamanipatil</title>
    <description>The latest articles on DEV Community by shivamanipatil (@q1ra).</description>
    <link>https://dev.to/q1ra</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F331748%2F45851109-1d2e-40b8-bd5d-0de5b17df38f.jpeg</url>
      <title>DEV Community: shivamanipatil</title>
      <link>https://dev.to/q1ra</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/q1ra"/>
    <language>en</language>
    <item>
      <title>Digital certificates And PKI</title>
      <dc:creator>shivamanipatil</dc:creator>
      <pubDate>Sat, 11 Nov 2023 11:54:40 +0000</pubDate>
      <link>https://dev.to/q1ra/digital-certificates-and-pki-3jm6</link>
      <guid>https://dev.to/q1ra/digital-certificates-and-pki-3jm6</guid>
      <description>&lt;p&gt;In this post, I will talk about public key cryptography, how encryption/decryption happens in asymmetric key cryptography, and how digital signatures and certificates are useful.&lt;/p&gt;

&lt;h3&gt;
  
  
  Public key cryptography
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Also, known as asymmetric key cryptography. Asymmetric because there are 2 keys - public and private. &lt;/li&gt;
&lt;li&gt;Public key is available to anyone in public. But the private key remains with the original owner. &lt;/li&gt;
&lt;li&gt;Can be used to encrypt/decrypt data or to verify the authenticity of the data(digital signatures). &lt;/li&gt;
&lt;li&gt;&lt;em&gt;The thing to note is that public/private can be used interchangeably (they are similar in that sense) i.e. data encrypted by one can be decrypted by another. For the sake of convention, in this post, we will use the terms public key for encryption and private key for decryption.&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;RSA is an example of one such cryptosystem.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Encryption and decryption
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;In asymmetric key cryptography. The public key is available to the public. &lt;/li&gt;
&lt;li&gt;The act of encrypting the data involves using some cryptographic algorithm with inputs as the public key and data being encrypted. &lt;/li&gt;
&lt;li&gt;The end result is seemingly random-looking data. Obtaining original data from this random-looking data involves using the cryptographic algorithm with inputs as the private key and encrypted data.&lt;/li&gt;
&lt;li&gt;Since public keys are available to the public, the sender can encrypt the data which can only be decrypted by the receiver.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Digital signatures
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Digital signatures are generally used to verify the integrity of the data. Practically speaking it follows, that party A generates some data known as a digital signature for the actual data. And when sending the actual data this digital signature is also sent with it.
Other parties now can actually verify that the received data is correct and is really sent by party A by using the digital signature.&lt;/li&gt;
&lt;li&gt;The way this works using public key cryptography is : 

&lt;ol&gt;
&lt;li&gt;The sender generates the digital signature(can be thought of as a process similar to encrypting) using its private key (using some crypto algo).&lt;/li&gt;
&lt;li&gt;The receiver now has this digital signature of the data, the data and, the public key of the sender. We have already discussed that the public and private keys are interchangeable in that data encrypted by one can be decrypted using the other.&lt;/li&gt;
&lt;li&gt;Due to this receiver can retrieve the data using the sender's public key and the digital signature. If the sent data by the sender and retrieved data are the same, then the verification is successful, suggesting that the data was indeed sent by the sender.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;Practically, the same digital signature can be performed on the subset of the data (or even the hash of the data!) and we could still perform similar verification.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Digital certificate
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;This is an application of digital signatures. Used to establish the identity of the owner(of the certificate) and also its public key. Digital certificates will be associated with a pair of public and private keys. The server/owner will have the private key and the public key will be distributed along with the digital certificate.&lt;/li&gt;
&lt;li&gt;A question arises, with just this information in the digital certificate, can some malicious party pose as another entity using a forged digital certificate? i.e. we have yet to establish/verify that the public key really belongs to the said owner.&lt;/li&gt;
&lt;li&gt;To solve this problem we have issuers of the digital certificate, and the issuer itself has its public/private key pair. And their job is to verify the ownership and sign the digital certificate using their private key. This issuer's digital signature is also present in the digital certificate.&lt;/li&gt;
&lt;li&gt;To understand this end-to-end on how an end user can verify the digital cert, consider this flow :

&lt;ol&gt;
&lt;li&gt;The user wants to verify the server A digital certificate. The certificate has server A public key, issuer A details, and issuer A digital signature.&lt;/li&gt;
&lt;li&gt;Assume that the user trusts issuer A and also has its public key. Then it can verify using the issuer's digital signature that the server A digital certificate was indeed signed by the issuer thereby confirming if the digital certificate is valid.&lt;/li&gt;
&lt;li&gt;It can be clearly seen here that a chain of trust is established, that is to verify the digital certificate of server A we have to verify/trust the issuer's digital signature.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;The issuer here acts as the &lt;em&gt;Certificate Authority(CA)&lt;/em&gt;. And the job of a CA is to verify the identities and issue digital certificates for the entities. In our case, Issuer A was the CA for server A digital certificate. What this really means is that if you trust a CA then you automatically trust certificates signed/issued by this CA!&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;The process of signing a digital certificate by an issuer effectively means signing it using the issuer's private key or using the issuer's own digital certificate(both are the same technically, just different uses of language).&lt;/em&gt; FYI, the Digital certificate owner is also known as subject in technical terms.&lt;/li&gt;
&lt;li&gt;X.509 is one of the standards for public key certificates.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Digital certificate types
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Generally there are 3 types :&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;em&gt;Root certificate&lt;/em&gt; : 

&lt;ul&gt;
&lt;li&gt;These certificates are at the top of the chain of trust and therefore are self-signed (i.e. signed using their own private key).&lt;/li&gt;
&lt;li&gt;These are also trust anchors because the trust for these is to be assumed and we cannot verify or derive the trust for these.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Intermediate certificate&lt;/em&gt; : 

&lt;ul&gt;
&lt;li&gt;These are signed by the root certificate(same as root CA). e.g. issuer A in the above example. &lt;/li&gt;
&lt;li&gt;To verify intermediate certificates, clients will use root CA cert public keys.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Leaf certificate&lt;/em&gt; : 

&lt;ul&gt;
&lt;li&gt;Signed by the intermediate certificate (same as intermediate CA). &lt;/li&gt;
&lt;li&gt;These are terminal nodes i.e. these cannot be used to issue new digital certificates. (The entity owning this is not a CA!)&lt;/li&gt;
&lt;li&gt;The role of intermediate CA is to delegate/proxy the role of signing from root CA.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;I am using CA and their digital certificates interchangeably here. Although the CA is the one actually owning its digital certificate (so technically not the same) the act of signing itself involves the CA digital certificate(or the private key).&lt;/em&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Till now we have discussed how subjects can acquire digital certificates(for their public keys) from the CA. And how those digital certificates can be verified by other parties by establishing the chain of trust till root CA. &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;&lt;em&gt;It is apparent that the applications (e.g. web browsers) need to trust some CA certs for which verification cannot be derived (e.g root CA certs). To facilitate the validation of these operating systems and applications have something called trust stores containing the information about CAs they can trust!&lt;/em&gt;&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Hopefully, this gave you an overview of Public key cryptography, digital signatures, digital certificates and Certification authority. These concepts are widely used. One notable application is TLS, but that is a story for a different post.&lt;/p&gt;

</description>
      <category>security</category>
      <category>cryptography</category>
      <category>computerscience</category>
      <category>cybersecurity</category>
    </item>
    <item>
      <title>What is Wi-Fi anyway?</title>
      <dc:creator>shivamanipatil</dc:creator>
      <pubDate>Thu, 09 Nov 2023 17:20:58 +0000</pubDate>
      <link>https://dev.to/q1ra/what-is-wi-fi-anyway-18n3</link>
      <guid>https://dev.to/q1ra/what-is-wi-fi-anyway-18n3</guid>
      <description>&lt;p&gt;You might ask what you mean by 'What is Wi-Fi?'😕 - I connect my device to the WiFi network and boom I am accessing the internet. Well, I held this abstraction for quite some time and the time had come to delve into a lower abstraction, because why not, there is no harm in knowing some background details. For the sake of not losing my mind, I wouldn't dive into super lower-level details nor I have looked into them 😃. But this should at least give you an understanding of how 'WiFi' operates.&lt;/p&gt;

&lt;h3&gt;
  
  
  Background
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;Let's clear (or loosely define) some background terms&lt;/em&gt; :&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Network&lt;/strong&gt;: Group of computers communicating and sharing information/resources with each other.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Wireless network&lt;/strong&gt;: Network without physical connections(wires, cables etc.).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Physical Layer&lt;/strong&gt;(TCP/IP model Layer 1): The actual medium between the nodes of a network. Maybe an easier thing to imagine would be how bits are sent as signals between the computer nodes are handled by this layer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data link layer&lt;/strong&gt;(TCP/IP model Layer 2): This layer handles framing, addressing, reliable connectivity, error detection/correction, collision detection etc. It makes sense that when data is being transmitted through the physical layer, some upper layer needs to handle these problems for effective communication.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Frame&lt;/strong&gt;: Unit of transport at the data link layer. Would consist of some headers and the actual data (payload) being sent. Headers would help in establishing the connection(source/destination MAC addresses) and some other control data.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  General scenario
&lt;/h3&gt;

&lt;p&gt;Consider what happens when you try to connect to a WiFi network for the first time through your phone. &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;You see the name of the wifi network listed in your phone's wifi settings. The technical term for this is actually called SSID (service set identifier).&lt;/li&gt;
&lt;li&gt;Some networks would require you to put in a password(most home wifi networks) and others would not(some mall/coffee shop networks).&lt;/li&gt;
&lt;li&gt;You click connect and you are able to access the internet (in most cases!).&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;So a simple question arises what exactly is called Wi-Fi here? &lt;/p&gt;
&lt;/blockquote&gt;

&lt;ul&gt;
&lt;li&gt;WiFi is a standard used to enable wireless networking. And it operates at the &lt;em&gt;physical layer&lt;/em&gt; and the &lt;em&gt;data link layer&lt;/em&gt;. &lt;/li&gt;
&lt;li&gt;At the physical layer, it enables communication between the nodes using radio waves. At the data link layer, it mostly does the things mentioned in the background section.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;How does my phone even know/discover that a wireless network exists and information about it?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ul&gt;
&lt;li&gt;Routers periodically broadcast information about their network. It contains information such as SSID (the name!), security, signal strength etc. Technically called WiFi beacon signal. &lt;/li&gt;
&lt;li&gt;And since this broadcast is via radio waves anyone within the vicinity can receive them.&lt;/li&gt;
&lt;li&gt;Your phone periodically checks for these signals and lists them. WiFi standards would have defined the structure of the signal in the first place. And since your phone supports the WiFi standard it can interpret it.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;OK I was in a coffee shop and connected to a network without a password then what happened?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ul&gt;
&lt;li&gt;The client (mobile) node sends an association request to the router, the router acknowledges the request and welcomes the client onto the network. It would receive an IP address through DHCP servers.&lt;/li&gt;
&lt;li&gt;Once the IP address is assigned the client is part of the network. The communication required by the client will be handled and routed by the router (e.g. internet access).&lt;/li&gt;
&lt;li&gt;The concern here is that this was an open network so there is no encryption of any sort between the router and the client. Anyone can intercept the WiFi radio signal and read the unencrypted frame payload. You still might be secured if communicating over TLS (Transport layer security). &lt;/li&gt;
&lt;li&gt;But communication happening lower than TLS or using different transport like UDP might still be affected (e.g. DNS queries over UDP).&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Well, what happens when I connect to a protected WiFi network requiring some password?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ul&gt;
&lt;li&gt;In this case, the network would have some type of &lt;strong&gt;Security type&lt;/strong&gt; standard configured. Some examples of it are WPA/WPA2/WPA3. Most probably, you would have seen these on your phone when connecting to a password-protected WiFi network. This information is also broadcast along with wifi beacon signals.&lt;/li&gt;
&lt;li&gt;This password detail is sent along with the association request to the router. With that, a 4-way handshake occurs between the client and the router. The end result would be that the client and router would have the same secret key with them, which will be used to encrypt/decrypt (both with the same key!) the frame payload. Thus, securing our connection.&lt;/li&gt;
&lt;li&gt;These WPA standards typically differ in how the handshakes occur and how the symmetric key is generated.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Got it, let's say I am connected to a protected WiFi network how does the overall flow look like now when the client wants to communicate with some node out there on the internet?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ol&gt;
&lt;li&gt;The client generates the layer 2 frames and the frame payload itself has layer 3 details (IP address details of the node on the internet). The frame payload is encrypted using the symmetric key generated via 4-way handshake. &lt;/li&gt;
&lt;li&gt;The client after looking at the local routing table infers that to send the data onto the internet the network router is the default gateway. So the router's MAC address is filled in the layer 2 destination frame header. And the frame is sent over the physical layer (i.e. radio waves for wifi). &lt;/li&gt;
&lt;li&gt;The router intercepts the frames decrypts the payload and inspects that the IP address belongs to some node over the internet. 4. It then repacks this IP packet into another frame with the destination being the ISP (of course there might be intermediaries).&lt;/li&gt;
&lt;li&gt;And similar thing happens in reverse for the response received (for our client).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I know I have omitted many details, but the point was to get to a level where we have a general idea of how the stuff works. Hope, this is useful to people looking for more details on the working of the WiFi.&lt;/p&gt;

</description>
      <category>computerscience</category>
      <category>wifi</category>
      <category>learning</category>
      <category>web</category>
    </item>
    <item>
      <title>HyperLogLog : Memory vs Accuracy</title>
      <dc:creator>shivamanipatil</dc:creator>
      <pubDate>Tue, 17 May 2022 15:51:53 +0000</pubDate>
      <link>https://dev.to/q1ra/hyperloglog-12bn</link>
      <guid>https://dev.to/q1ra/hyperloglog-12bn</guid>
      <description>&lt;h2&gt;
  
  
  Table of contents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Streaming and sketching algorithms&lt;/li&gt;
&lt;li&gt;Count Distinct problem&lt;/li&gt;
&lt;li&gt;Hyperloglog(HLL)&lt;/li&gt;
&lt;li&gt;HLL improvements&lt;/li&gt;
&lt;li&gt;Demo&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Streaming and sketching algorithms
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Algorithms for data streams that examine data a few times and have low memory requirements.&lt;/li&gt;
&lt;li&gt;Storing individual elements is not possible in data streams as in most cases the size is enormous.&lt;/li&gt;
&lt;li&gt;We try to get approximate or "good enough" estimates while storing the "sketch" of the data(i.e some representation of data) but not the data itself. The "sketch" size should be much less than the data size.&lt;/li&gt;
&lt;li&gt;In short, we are doing a tradeoff between accuracy and memory.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Count Distinct problem
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Finding the number of unique/distinct elements in a multiset(set with repeated elements). &lt;/li&gt;
&lt;li&gt;e.g., unique visitors to a website and unique IP addresses through a router.&lt;/li&gt;
&lt;li&gt;Simple solution to this could be to keep a map of every element and its count. The main drawback of this approach is the storage of the map, as in the worst case, it is O(n).&lt;/li&gt;
&lt;li&gt;A sketching algorithm could prove helpful for this case as, in most cases, we would not care about 100% accuracy.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Hyperloglog(HLL)
&lt;/h2&gt;

&lt;p&gt;Sketching algorithm designed to solve the count-distinct problem within acceptable error rate. Described in the paper "HyperLogLog: the analysis of near-optimal cardinality estimation algorithm", published by Flajolet, Fusy, Gandouet and Meunier in 2007. Improvements were proposed in "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm", published by Stefan Heulem, Marc Nunkesse and Alexander Hall.&lt;/p&gt;

&lt;p&gt;Based on simple obervation : &lt;strong&gt;On average, a sequence of k consecutive zeros in binary form of a number will occur once in every 2^k distinct entries.&lt;/strong&gt; E.g. if we have a large enough collection of fixed-width numbers (e.g. 32/64/128 bit) and while going through it and we find a number starting with k consecutive zeros in binary form, we can be almost sure that there are at least 2^k numbers in that collection.&lt;/p&gt;

&lt;h3&gt;
  
  
  Simple Estimator
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;To keep the input data uniform and evenly distributed we use a hash function. Now each entry is of fixed width(e.g 32/64/128 bits).&lt;/li&gt;
&lt;li&gt;While going through the collection we simply keep the count of longest consecutive sequence of zeros. &lt;/li&gt;
&lt;li&gt;Example of simple estimator (credits : engineering.fb.com) : &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--SEuwzzPB--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://engineering.fb.com/wp-content/uploads/2018/12/HLL31.png%3Fresize%3D800%2C400" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--SEuwzzPB--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://engineering.fb.com/wp-content/uploads/2018/12/HLL31.png%3Fresize%3D800%2C400" alt="Simple Estimator" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We will define a &lt;strong&gt;counter/estimator&lt;/strong&gt; as data structure which holds longest consecutive sequence of zeros for a collection ( R ). &lt;/li&gt;
&lt;li&gt;So, according to our theory this estimator will give estimate/cardinality as E = 2^R. &lt;/li&gt;
&lt;li&gt;Problem with this is that estimates will be only in powers of 2 and there can be lot of variability as only a single value having largest sequence of zeros can affect the total estimate.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Multiple estimators and HLL
&lt;/h3&gt;

&lt;p&gt;To solve the issue faced by simple estimator we will use multiple estimators(let's say of m number) and divide the input collection between them. So, now each estimator has its own ( R ). We get single estimate using harmonic mean of individual estimates of counters.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--OUkUkwWB--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/1actnz7rc0srab126zjs.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--OUkUkwWB--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/1actnz7rc0srab126zjs.PNG" alt="Image description" width="433" height="127"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A estimator is basically just a register or a single variable holding longest sequence of consecutive zeros. Each estimator can be thought of as a bin. So we have m bin i.e m registers (or just array of size m really) to imitate m estimator. &lt;/p&gt;

&lt;p&gt;To divide the input collection between the m estimators we can reserve some starting bits of hashed value for bin index and rest bits for counting longest sequence of consecutive zeros.&lt;/p&gt;

&lt;p&gt;Illustration of this (credits : engineering.fb.com) : &lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--52e500vP--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://engineering.fb.com/wp-content/uploads/2018/12/HLL5.png%3Fresize%3D800%2C400" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--52e500vP--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://engineering.fb.com/wp-content/uploads/2018/12/HLL5.png%3Fresize%3D800%2C400" alt="Bin buckets" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is the basis of the hyperloglog algorithm. The algorithm described in the paper with some error corrections is as follows idea behind it remains same though :&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--cKReNfEh--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/5z6otk28a6yci9uzv5cu.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--cKReNfEh--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/5z6otk28a6yci9uzv5cu.PNG" alt="Image description" width="337" height="490"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  HLL parameters and storage analysis
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--hQz4nfw---/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/30bdtsnn52gzi7vl8cu8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--hQz4nfw---/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/30bdtsnn52gzi7vl8cu8.png" alt="Image description" width="546" height="413"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;From above illustration we can clearely see that only 12 KB storage is required for a very large ndistinct value and relatively small error rate.&lt;/p&gt;

&lt;h2&gt;
  
  
  HLL improvements
&lt;/h2&gt;

&lt;p&gt;Improvements were proposed in "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm", published by Stefan Heulem, Marc Nunkesse and Alexander Hall :&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Using 64 bit hash function&lt;/li&gt;
&lt;li&gt;Estimating small cardinalities&lt;/li&gt;
&lt;li&gt;Sparse representation&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  64 bit hash function
&lt;/h3&gt;

&lt;p&gt;Hash function with L bits will distinguish at most 2^L values. Beyond 4 billion unique entries 32 bit hash function wouldn't be &lt;br&gt;
useful. 64 bit hash function would be usable till 1.8 * 10^19 cardinalities (which should suffice for all cases).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--RzX9cSOr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/i0utdvdvxdpidjiaokih.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--RzX9cSOr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/i0utdvdvxdpidjiaokih.PNG" alt="Image description" width="336" height="229"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here you can see size increased by just m bits or 2 KB(if m = 2^14 with error=0.008125).&lt;/p&gt;

&lt;h3&gt;
  
  
  Estimating small cardinalities
&lt;/h3&gt;

&lt;p&gt;Raw estimate of original hll algorithm gives large error for small cardinalities. To reduce the error original algorithm uses linear counting for cardinalities less than 5*m/2. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://storage.googleapis.com/pub-tools-public-publication-data/pdf/40671.pdf"&gt;Improvement paper&lt;/a&gt; found out that the error for small cardinalities is due to a bias(i.e algorithm overestimates the cardinalities). They also could correct the bias by precomputing the bias values and substracting them from the raw estimate.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--coTq0yzh--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/a2m858uspvvjh5qi0bfh.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--coTq0yzh--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/a2m858uspvvjh5qi0bfh.PNG" alt="Image description" width="504" height="631"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This figure from &lt;a href="https://storage.googleapis.com/pub-tools-public-publication-data/pdf/40671.pdf"&gt;Improvement paper&lt;/a&gt; shows the bias for raw estimate. Also, it shows linear counting is still better than bias-correction for smaller cardinalities.&lt;/p&gt;

&lt;p&gt;From this they concluded to perform linear counting till 11500 for p = 14 and bias correction after that.&lt;/p&gt;

&lt;h3&gt;
  
  
  Sparse representation
&lt;/h3&gt;

&lt;p&gt;When n &amp;lt;&amp;lt; m, then most of the bins(2^14 if p=14) are empty and memory is wasted. So instead we can store just pairs of (idx, R) values (idx - index of bin, R - longest consecutive zeros sequence observed by that bin). &lt;/p&gt;

&lt;p&gt;And when size of these pairs increases than the normal(dense) mode of m bins then we can convert to the normal(dense) mode. Thing to note is that we can merge pairs with same idx and keep just the pair with largest R.&lt;/p&gt;

&lt;h2&gt;
  
  
  Demo
&lt;/h2&gt;

&lt;p&gt;For the demo I will be using &lt;a href="https://github.com/conversant/postgres_hyperloglog"&gt;postgress_hyperloglog&lt;/a&gt;. This is a postgreSQL extension implementing hyperloglog estimator. You can follow the installation steps from the repo README.&lt;/p&gt;

&lt;h3&gt;
  
  
  Example 1:
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--xPw0lVRh--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0ov9xe8py6pi1fqeq7nn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--xPw0lVRh--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0ov9xe8py6pi1fqeq7nn.png" alt="Image description" width="880" height="117"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here, we are generating 1,10000 series and then doing hyperloglog_distinct aggregate on these. &lt;br&gt;
hyperloglog_distinct aggregate just creates a hll estimator, adds column elements and returns the estimate.&lt;br&gt;
The hll estimates 9998.401034851891 which is pretty close to 10000 cardinality.&lt;/p&gt;

&lt;h3&gt;
  
  
  Example 2:
&lt;/h3&gt;

&lt;p&gt;You can serialize these estimators, save them and load them again to operate on later.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--iFutQfvA--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/kh0ugyqne23hzkvmosof.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--iFutQfvA--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/kh0ugyqne23hzkvmosof.png" alt="Image description" width="880" height="45"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Xa2dhsPS--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ajrdzt3qhtrzbp6snjt9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Xa2dhsPS--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ajrdzt3qhtrzbp6snjt9.png" alt="Image description" width="880" height="493"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;hyperloglog_accum - creates hll estimator, adds column elements and returns the serialized estimator.&lt;br&gt;
Output in second picture is actually base64 encoded byte string of the estimator.&lt;/p&gt;

&lt;h3&gt;
  
  
  Example 3:
&lt;/h3&gt;

&lt;p&gt;We can also get estimate from this accumulated result(example 2).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--cwSVTJYr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/g1r1296bztb2dqp9n08e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--cwSVTJYr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/g1r1296bztb2dqp9n08e.png" alt="Image description" width="880" height="143"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Internal hll data structure is binary compatible with base64 encoded byte string - so they are interoperable. This way we can perform hyperloglog_get_estimate operation on the hll estimator returned by hyperloglog_accum. &lt;/p&gt;

&lt;p&gt;For other examples you can refer to &lt;a href="https://github.com/conversant/postgres_hyperloglog/tree/master/test/sql"&gt;sql references&lt;/a&gt; and try it out yourself. &lt;/p&gt;

&lt;p&gt;References :&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;"HyperLogLog: the analysis of near-optimal cardinality estimation algorithm", published by Flajolet, Fusy, Gandouet and Meunier&lt;/li&gt;
&lt;li&gt;"HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm", published by Stefan Heulem, Marc Nunkesse and Alexander Hall&lt;/li&gt;
&lt;li&gt;&lt;a href="https://engineering.fb.com/2018/12/13/data-infrastructure/hyperloglog/"&gt;https://engineering.fb.com/2018/12/13/data-infrastructure/hyperloglog/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/conversant/postgres_hyperloglog"&gt;https://github.com/conversant/postgres_hyperloglog&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>bigdata</category>
      <category>algorithms</category>
      <category>hyperloglog</category>
      <category>sketchingalgorithm</category>
    </item>
    <item>
      <title>Spark aggregation with native API's</title>
      <dc:creator>shivamanipatil</dc:creator>
      <pubDate>Mon, 28 Feb 2022 13:26:23 +0000</pubDate>
      <link>https://dev.to/q1ra/spark-aggregation-with-native-apis-47pi</link>
      <guid>https://dev.to/q1ra/spark-aggregation-with-native-apis-47pi</guid>
      <description>&lt;h2&gt;
  
  
  Table of contents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Spark aggregation Overview&lt;/li&gt;
&lt;li&gt;TypedImperativeAggregate[T] abstract class&lt;/li&gt;
&lt;li&gt;Example&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Spark aggregation Overview
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://spark.apache.org/docs/latest/sql-ref-functions-udf-aggregate.html"&gt;User Defined Aggregate Functions&lt;/a&gt; can be used. But are restrictive and require workarounds even for  basic requirements.&lt;/li&gt;
&lt;li&gt;Aggregates are unevaluable expressions and cannot have eval and doGenCode method.&lt;/li&gt;
&lt;li&gt;Basic requirement would be to use user defined java objects as internal spark aggregation buffer type.&lt;/li&gt;
&lt;li&gt;And, passing extra arguments to aggregates e.g aggregate(col, 0.24)&lt;/li&gt;
&lt;li&gt;Spark provides &lt;strong&gt;TypedImperativeAggregate[T]&lt;/strong&gt; contract for such requirement (imperative as in expressed in terms of imperative initialize, update, and merge methods).&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  TypedImperativeAggregate[T] abstract class
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="nc"&gt;TestAggregation&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;child&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Expression&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; 
  &lt;span class="k"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;TypedImperativeAggregate&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;T&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;  &lt;span class="o"&gt;{&lt;/span&gt;

  &lt;span class="c1"&gt;// Check input types&lt;/span&gt;
  &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;checkInputDataTypes&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;TypeCheckResult&lt;/span&gt;

  &lt;span class="c1"&gt;// Initialize T&lt;/span&gt;
  &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;createAggregationBuffer&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;T&lt;/span&gt;

  &lt;span class="c1"&gt;// Update T with row&lt;/span&gt;
  &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;buffer&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;T&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;inputRow&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;InternalRow&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;T&lt;/span&gt;

  &lt;span class="c1"&gt;// Merge Intermediate buffers onto first buffer&lt;/span&gt;
  &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;merge&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;buffer&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;T&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;other&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;T&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;T&lt;/span&gt;

  &lt;span class="c1"&gt;// Final value&lt;/span&gt;
  &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;eval&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;buffer&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;T&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Any&lt;/span&gt; 

  &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;withNewMutableAggBufferOffset&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;newOffset&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Int&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;TestAggregation&lt;/span&gt; 

  &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;withNewInputAggBufferOffset&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;newOffset&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Int&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;TestAggregation&lt;/span&gt; 

  &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;children&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Seq&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;Expression&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;

  &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;nullable&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Boolean&lt;/span&gt;

  &lt;span class="c1"&gt;// Datatype of output&lt;/span&gt;
  &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;dataType&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;DataType&lt;/span&gt;

  &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;prettyName&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;

  &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;serialize&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;T&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Array&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;Byte&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt; 

  &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;deserialize&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bytes&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Array&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;Byte&lt;/span&gt;&lt;span class="o"&gt;])&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;T&lt;/span&gt; 
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Example
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;case class Average&lt;/code&gt; holds count and sum of elements and also acts as internal aggregate buffer.&lt;/li&gt;
&lt;li&gt;Aggregate takes in a numeric column and an extra argument n and return avg(column) * n.&lt;/li&gt;
&lt;li&gt;In SparkSQL this will look like :
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;multiply_average&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;salary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;average_salary&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Spark alchemy's NativeFunctionRegistration is used to register functions to spark.&lt;/li&gt;
&lt;li&gt;Aggregate Code :
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;com.swoop.alchemy.spark.expressions.NativeFunctionRegistration&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.spark.sql.SparkSession&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.spark.&lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="nc"&gt;SparkConf&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;SparkContext&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.spark.sql.catalyst.InternalRow&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.spark.sql.catalyst.analysis.TypeCheckResult&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.spark.sql.catalyst.expressions._&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.spark.sql.types._&lt;/span&gt;

&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;java.io.&lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="nc"&gt;ByteArrayInputStream&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;ByteArrayOutputStream&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;ObjectInputStream&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;ObjectOutputStream&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;


&lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Average&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Long&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Long&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AvgTest&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
                  &lt;span class="n"&gt;child&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Expression&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
                  &lt;span class="n"&gt;nExpression&lt;/span&gt; &lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Expression&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
                  &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;mutableAggBufferOffset&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
                  &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;inputAggBufferOffset&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;TypedImperativeAggregate&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;Average&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;  &lt;span class="o"&gt;{&lt;/span&gt;

&lt;span class="c1"&gt;//  private lazy val n: Long = nExpression.eval().asInstanceOf[Long]&lt;/span&gt;
  &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;this&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;child&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Expression&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;this&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;child&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Literal&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;this&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;child&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Expression&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nExpression&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Expression&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;this&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;child&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nExpression&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

  &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;checkInputDataTypes&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;TypeCheckResult&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nv"&gt;child&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;dataType&lt;/span&gt; &lt;span class="k"&gt;match&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="nc"&gt;LongType&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nv"&gt;TypeCheckResult&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;TypeCheckSuccess&lt;/span&gt;
      &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="k"&gt;_&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nv"&gt;TypeCheckResult&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;TypeCheckFailure&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="s"&gt;"$prettyName only supports long input"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;

  &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;createAggregationBuffer&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Average&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Average&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;

  &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;buffer&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Average&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;inputRow&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;InternalRow&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Average&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;value&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;child&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;eval&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputRow&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="nv"&gt;buffer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;sum&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="nv"&gt;value&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;asInstanceOf&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;Long&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;
    &lt;span class="nv"&gt;buffer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;count&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="n"&gt;buffer&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;

  &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;merge&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;buffer&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Average&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;other&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Average&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Average&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nv"&gt;buffer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;sum&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="nv"&gt;other&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;sum&lt;/span&gt;
    &lt;span class="nv"&gt;buffer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;count&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="nv"&gt;other&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;count&lt;/span&gt;
    &lt;span class="n"&gt;buffer&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;

  &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;eval&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;buffer&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Average&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Any&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;n&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;nExpression&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;eval&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="py"&gt;asInstanceOf&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;Int&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;
    &lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="nv"&gt;buffer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;sum&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="o"&gt;)/(&lt;/span&gt;&lt;span class="nv"&gt;buffer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;count&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;

  &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;withNewMutableAggBufferOffset&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;newOffset&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Int&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;AvgTest&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
    &lt;span class="nf"&gt;copy&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mutableAggBufferOffset&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="n"&gt;newOffset&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

  &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;withNewInputAggBufferOffset&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;newOffset&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Int&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;AvgTest&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
    &lt;span class="nf"&gt;copy&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputAggBufferOffset&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="n"&gt;newOffset&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

  &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;children&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Seq&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;Expression&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Seq&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;child&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nExpression&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

  &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;nullable&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Boolean&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

  &lt;span class="c1"&gt;// The result type is the same as the input type.&lt;/span&gt;
  &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;dataType&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;DataType&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;child&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;dataType&lt;/span&gt;

  &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;prettyName&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"avg_test"&lt;/span&gt;

  &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;serialize&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Average&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Array&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;Byte&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;stream&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;ByteArrayOutputStream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ByteArrayOutputStream&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;oos&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ObjectOutputStream&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="nv"&gt;oos&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;writeObject&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="nv"&gt;oos&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;close&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
    &lt;span class="nv"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;toByteArray&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;

  &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;deserialize&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bytes&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Array&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;Byte&lt;/span&gt;&lt;span class="o"&gt;])&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Average&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;ois&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ObjectInputStream&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ByteArrayInputStream&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bytes&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;value&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;ois&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;readObject&lt;/span&gt;
    &lt;span class="nv"&gt;ois&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;close&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
    &lt;span class="nv"&gt;value&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;asInstanceOf&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;Average&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Driver code :
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="k"&gt;object&lt;/span&gt; &lt;span class="nc"&gt;TestAgg&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;object&lt;/span&gt; &lt;span class="nc"&gt;BegRegister&lt;/span&gt; &lt;span class="k"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;NativeFunctionRegistration&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;expressions&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Map&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;String&lt;/span&gt;, &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;ExpressionInfo&lt;/span&gt;, &lt;span class="kt"&gt;FunctionBuilder&lt;/span&gt;&lt;span class="o"&gt;)]&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Map&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
      &lt;span class="n"&gt;expression&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;AvgTest&lt;/span&gt;&lt;span class="o"&gt;](&lt;/span&gt;&lt;span class="s"&gt;"multiply_average"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Array&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="o"&gt;])&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Unit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;conf&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;SparkConf&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="py"&gt;setMaster&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"local[*]"&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="py"&gt;setAppName&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"FirstDemo"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;sc&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;SparkContext&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;conf&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;spark&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;SparkSession&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="py"&gt;appName&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Demo"&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="py"&gt;config&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;conf&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="py"&gt;getOrCreate&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;


    &lt;span class="nv"&gt;BegRegister&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;registerFunctions&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
      &lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;df&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;spark&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;read&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;json&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"src/test/resources/employees.json"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
      &lt;span class="nv"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;createOrReplaceTempView&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"employees"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
      &lt;span class="nv"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;show&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
      &lt;span class="cm"&gt;/*
      +-------+------+
      |   name|salary|
      +-------+------+
      |Michael|  3000|
      |   Andy|  4500|
      | Justin|  3500|
      |  Berta|  4000|
      +-------+------+
       */&lt;/span&gt;
      &lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;result&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;spark&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;sql&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"SELECT multiply_average(salary) as average_salary FROM employees"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
      &lt;span class="nv"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;show&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
    &lt;span class="cm"&gt;/*
      +--------------+
      |average_salary|
      +--------------+
      |          3750|
      +--------------+
     */&lt;/span&gt;
      &lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;result1&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;spark&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;sql&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"SELECT multiply_average(salary, 2) as average_salary FROM employees"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
      &lt;span class="nv"&gt;result1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;show&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
      &lt;span class="cm"&gt;/*
      +--------------+
      |average_salary|
      +--------------+
      |          7500|
      +--------------+
     */&lt;/span&gt;
      &lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;result2&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;spark&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;sql&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"SELECT multiply_average(salary, 3) as average_salary FROM employees"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
      &lt;span class="nv"&gt;result2&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;show&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
      &lt;span class="cm"&gt;/*
      +--------------+
      |average_salary|
      +--------------+
      |         11250|
      +--------------+
      */&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Here, &lt;code&gt;nExpression&lt;/code&gt; represents our &lt;code&gt;n&lt;/code&gt; argument. Other lines are self-explanatory. &lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>spark</category>
      <category>apache</category>
      <category>scala</category>
      <category>analytics</category>
    </item>
    <item>
      <title>Spark Catalyst Optimizer and spark Expression basics</title>
      <dc:creator>shivamanipatil</dc:creator>
      <pubDate>Mon, 28 Feb 2022 13:19:10 +0000</pubDate>
      <link>https://dev.to/q1ra/spark-catalyst-optimizer-and-spark-expression-basics-1cd3</link>
      <guid>https://dev.to/q1ra/spark-catalyst-optimizer-and-spark-expression-basics-1cd3</guid>
      <description>&lt;h2&gt;
  
  
  Table of contents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Overview&lt;/li&gt;
&lt;li&gt;Trees&lt;/li&gt;
&lt;li&gt;Rules&lt;/li&gt;
&lt;li&gt;Expression&lt;/li&gt;
&lt;li&gt;CodegenFallback&lt;/li&gt;
&lt;li&gt;Example of spark native function using Unary expression&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Spark Catalyst Overview
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Core of Spark dataframe API and SQL queries.&lt;/li&gt;
&lt;li&gt;Supports cost based and rule based optimization.&lt;/li&gt;
&lt;li&gt;Built to be extensible : 

&lt;ul&gt;
&lt;li&gt;Adding new optimization techniques and features&lt;/li&gt;
&lt;li&gt;Extending the optimizier for custom use cases&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;At core it uses trees&lt;/li&gt;
&lt;li&gt;On top of it various libraries are written for query processing, optimization and execution.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Trees
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Trees in Catalyst consists of node objects.&lt;/li&gt;
&lt;li&gt;Node - type and zero/more children&lt;/li&gt;
&lt;li&gt;E.g If Literal(v: Int), Attribute(name: String), Add(l: TreeNode, r: TreeNode) are simple node types then x+(2+5) can be represented as Add(Attribute(x), Add(Literal(2), Literal(5))).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--d38iBlp4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/bt2j5rxtrqtn6fgwhh7v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--d38iBlp4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/bt2j5rxtrqtn6fgwhh7v.png" alt="Catalyst Tree" width="284" height="171"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Rules
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Function from a tree to another tree i.e modifying the tree.&lt;/li&gt;
&lt;li&gt;Replacing a pattern matched subtree with transformation. e.g Add(Literal(2), Literal(5)) =&amp;gt; Literal(7)&lt;/li&gt;
&lt;li&gt;Transform method provided with catalyst tree 
e.g Recursively updates substructures to combine Literals
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="nv"&gt;tree&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;transform&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="nc"&gt;Add&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Literal&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c1&lt;/span&gt;&lt;span class="o"&gt;),&lt;/span&gt; &lt;span class="nc"&gt;Literal&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c2&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nc"&gt;Literal&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c1&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;c2&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="nc"&gt;So&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Expression
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Executable node in catalyst tree. Take inputs and evaluates them.&lt;/li&gt;
&lt;li&gt;Can generate java source from which can be used for evaluation (docodegen()).&lt;/li&gt;
&lt;li&gt;Should be deterministic. Like pure functions.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="n"&gt;scala&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.spark.sql.catalyst.expressions.Expression&lt;/span&gt;
&lt;span class="n"&gt;scala&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.spark.sql.catalyst.expressions.&lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="nc"&gt;Literal&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Add&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Expression and eval&lt;/span&gt;
&lt;span class="n"&gt;scala&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;e&lt;/span&gt; &lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Add&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;Literal&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="err"&gt;3&lt;/span&gt;&lt;span class="o"&gt;),&lt;/span&gt; &lt;span class="kt"&gt;Literal&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="err"&gt;4&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;scala&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nv"&gt;e&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;eval&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;res0&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Any&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;

&lt;span class="c1"&gt;// Deterministic?&lt;/span&gt;
&lt;span class="n"&gt;scala&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nv"&gt;e&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;deterministic&lt;/span&gt;
&lt;span class="n"&gt;res1&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Boolean&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Type of Expression&lt;/th&gt;
&lt;th&gt;Kind&lt;/th&gt;
&lt;th&gt;Use&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;BinaryExpression&lt;/td&gt;
&lt;td&gt;Abstract class&lt;/td&gt;
&lt;td&gt;2 children&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CodegenFallback&lt;/td&gt;
&lt;td&gt;trait&lt;/td&gt;
&lt;td&gt;Interpreted mode, no code generation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;UnaryExpression&lt;/td&gt;
&lt;td&gt;Abstract Class&lt;/td&gt;
&lt;td&gt;1 child&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LeafExpression&lt;/td&gt;
&lt;td&gt;abstract class&lt;/td&gt;
&lt;td&gt;No children&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Unevaluable&lt;/td&gt;
&lt;td&gt;trait&lt;/td&gt;
&lt;td&gt;Cannot be evaluated to produce a value (neither in interpreted nor code-generated expression evaluations), e.g AggregateExpression&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;Expression contract :
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="k"&gt;package&lt;/span&gt; &lt;span class="nn"&gt;org.apache.spark.sql.catalyst.expressions&lt;/span&gt;

    &lt;span class="c1"&gt;// only required methods that have no implementation&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;dataType&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;DataType&lt;/span&gt; &lt;span class="c1"&gt;// Data type of the result of evaluating an expression&lt;/span&gt;

    &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nc"&gt;The&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt; &lt;span class="n"&gt;behavior&lt;/span&gt; &lt;span class="n"&gt;is&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;call&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;eval&lt;/span&gt; &lt;span class="n"&gt;method&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;expression&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt; &lt;span class="nc"&gt;Concrete&lt;/span&gt; &lt;span class="n"&gt;expression&lt;/span&gt;
    &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;implementations&lt;/span&gt; &lt;span class="n"&gt;should&lt;/span&gt; &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt; &lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="n"&gt;code&lt;/span&gt; &lt;span class="n"&gt;generation&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;doGenCode&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;CodegenContext&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ev&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;ExprCode&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;ExprCode&lt;/span&gt;

    &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nc"&gt;Interpreted&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;non&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;generated&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;expression&lt;/span&gt; &lt;span class="n"&gt;evaluation&lt;/span&gt;
    &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nc"&gt;Slower&lt;/span&gt; &lt;span class="n"&gt;than&lt;/span&gt; &lt;span class="n"&gt;generated&lt;/span&gt; &lt;span class="nf"&gt;code&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="ss"&gt;'relative&lt;/span&gt;&lt;span class="o"&gt;?')&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;eval&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;InternalRow&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;EmptyRow&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Any&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;nullable&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Boolean&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  CodegenFallback
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Trait derived from Expression which allows expressions to not support java code generation and go full interpreted mode.&lt;/li&gt;
&lt;li&gt;e.g
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="k"&gt;trait&lt;/span&gt; &lt;span class="nc"&gt;NoCodegenExp&lt;/span&gt; &lt;span class="k"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;UnaryExpression&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;CodegenFallback&lt;/span&gt; &lt;span class="o"&gt;{}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Example of spark native function using Unary expression
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Here we will write a native function using Codegen and CodegenFallback.&lt;/li&gt;
&lt;li&gt;Codegen example :
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.spark.sql.catalyst.expressions.&lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="nc"&gt;Expression&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;ImplicitCastInputTypes&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;UnaryExpression&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.spark.sql.catalyst.util.DateTimeUtils&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.spark.sql.types.&lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="nc"&gt;DataType&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;DateType&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.spark.unsafe.types.UTF8String&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.spark.sql.catalyst.expressions.codegen.&lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="nc"&gt;CodegenContext&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;ExprCode&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Returns beginning of month date for a date&lt;/span&gt;

&lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;BeginningOfMonth&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;startDate&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Expression&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

  &lt;span class="k"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;UnaryExpression&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;ImplicitCastInputTypes&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;child&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Expression&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;startDate&lt;/span&gt;

  &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;inputTypes&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Seq&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;DataType&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Seq&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;DateType&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

  &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;dataType&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;DataType&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DateType&lt;/span&gt;

  &lt;span class="c1"&gt;// .eval calls nullSafeEval if input is non-null else it returns null&lt;/span&gt;
  &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;nullSafeEval&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Any&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Any&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;level&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;DateTimeUtils&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;parseTruncLevel&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;UTF8String&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;fromString&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"MONTH"&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
    &lt;span class="nv"&gt;DateTimeUtils&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;truncDate&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;date&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;asInstanceOf&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;Int&lt;/span&gt;&lt;span class="o"&gt;],&lt;/span&gt; &lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;

  &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="k"&gt;protected&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;doGenCode&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;CodegenContext&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ev&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;ExprCode&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;ExprCode&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;level&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;DateTimeUtils&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;parseTruncLevel&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;UTF8String&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;fromString&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"MONTH"&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;dtu&lt;/span&gt;   &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;DateTimeUtils&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;getClass&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;getName&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;stripSuffix&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"$"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;defineCodeGen&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ev&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sd&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="s"&gt;"$dtu.parseTruncLevel($sd, $level)"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;

  &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;prettyName&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"beginning_of_month"&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;CodegenFallback example :
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.spark.sql.catalyst.expressions.&lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="nc"&gt;Expression&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;ImplicitCastInputTypes&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;UnaryExpression&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.spark.sql.catalyst.expressions.codegen.CodegenFallback&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.spark.sql.catalyst.util.DateTimeUtils&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.spark.sql.types.&lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="nc"&gt;DataType&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;DateType&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.spark.unsafe.types.UTF8String&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.spark.sql.catalyst.expressions.codegen.&lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="nc"&gt;CodegenContext&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;ExprCode&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Returns beginning of month date for a date&lt;/span&gt;
&lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;BeginningOfMonth&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;startDate&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Expression&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

  &lt;span class="k"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;UnaryExpression&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;ImplicitCastInputTypes&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;CodegenFallback&lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;child&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Expression&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;startDate&lt;/span&gt;

  &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;inputTypes&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Seq&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;DataType&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Seq&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;DateType&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

  &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;dataType&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;DataType&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DateType&lt;/span&gt;

  &lt;span class="c1"&gt;// .eval calls nullSafeEval if input is non-null else it returns null&lt;/span&gt;
  &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;nullSafeEval&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Any&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Any&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;level&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;DateTimeUtils&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;parseTruncLevel&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;UTF8String&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;fromString&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"MONTH"&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
    &lt;span class="nv"&gt;DateTimeUtils&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;truncDate&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;date&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;asInstanceOf&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;Int&lt;/span&gt;&lt;span class="o"&gt;],&lt;/span&gt; &lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;//  override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {&lt;/span&gt;
&lt;span class="c1"&gt;//    val level = DateTimeUtils.parseTruncLevel(UTF8String.fromString("MONTH"))&lt;/span&gt;
&lt;span class="c1"&gt;//    val dtu   = DateTimeUtils.getClass.getName.stripSuffix("$")&lt;/span&gt;
&lt;span class="c1"&gt;//    defineCodeGen(ctx, ev, sd =&amp;gt; s"$dtu.parseTruncLevel($sd, $level)")&lt;/span&gt;
&lt;span class="c1"&gt;//  }&lt;/span&gt;

  &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;prettyName&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"beginning_of_month"&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;The can be registered and used in spark as :
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;  &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Array&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="o"&gt;])&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Unit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="o"&gt;....&lt;/span&gt;
    &lt;span class="c1"&gt;// Register the function&lt;/span&gt;
    &lt;span class="k"&gt;object&lt;/span&gt; &lt;span class="nc"&gt;BegRegister&lt;/span&gt; &lt;span class="k"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;NativeFunctionRegistration&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;expressions&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Map&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;String&lt;/span&gt;, &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;ExpressionInfo&lt;/span&gt;, &lt;span class="kt"&gt;FunctionBuilder&lt;/span&gt;&lt;span class="o"&gt;)]&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Map&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;expression&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;BeginningOfMonth&lt;/span&gt;&lt;span class="o"&gt;](&lt;/span&gt;&lt;span class="s"&gt;"beg_m"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
        &lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="nv"&gt;BegRegister&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;registerFunctions&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;spark.implicits._&lt;/span&gt;
    &lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;df&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Seq&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
      &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;Date&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;valueOf&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"2020-01-15"&lt;/span&gt;&lt;span class="o"&gt;)),&lt;/span&gt;
      &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;Date&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;valueOf&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"2020-01-20"&lt;/span&gt;&lt;span class="o"&gt;)),&lt;/span&gt;
      &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="py"&gt;toDF&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"some_date"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="nv"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;show&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
    &lt;span class="nv"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;createTempView&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"dates"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;dfVal&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;spark&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;sql&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"SELECT beg_m(some_date) from dates"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="nv"&gt;dfVal&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;show&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;In codegenfallback example CodegenFallback trait is used and doGenCode() method is not required as eval(or nullSafeEval) used instead.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;References :&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html"&gt;https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://jaceklaskowski.gitbooks.io/mastering-spark-sql"&gt;https://jaceklaskowski.gitbooks.io/mastering-spark-sql&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>spark</category>
      <category>apache</category>
      <category>scala</category>
      <category>analytics</category>
    </item>
    <item>
      <title>Integrating Elasticsearch 7 into Django project</title>
      <dc:creator>shivamanipatil</dc:creator>
      <pubDate>Thu, 30 Apr 2020 07:50:50 +0000</pubDate>
      <link>https://dev.to/q1ra/integrating-elasticsearch-7-into-django-project-d84</link>
      <guid>https://dev.to/q1ra/integrating-elasticsearch-7-into-django-project-d84</guid>
      <description>&lt;p&gt;Recently, I had to migrate a Django project from elasticsearch 2.x.x using haystack to latest at that time Elasticsearch 7. It is a Django project maintained by the ns3 team. I wrote a medium article for same &lt;a href="https://medium.com/analytics-vidhya/integrating-elasticsearch-7-to-django-project-c3812de78246"&gt;link&lt;/a&gt;. I hope this is useful for someone working with the same. Comments and suggestions are appreciated. &lt;/p&gt;

</description>
      <category>django</category>
      <category>python</category>
      <category>elasticsearch</category>
      <category>webdev</category>
    </item>
  </channel>
</rss>
