<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Mixpeek</title>
    <description>The latest articles on DEV Community by Mixpeek (@mixpeek-2).</description>
    <link>https://dev.to/mixpeek-2</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F781075%2F1f308c16-07b8-47c1-8865-71fdd8f8bf85.png</url>
      <title>DEV Community: Mixpeek</title>
      <link>https://dev.to/mixpeek-2</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mixpeek-2"/>
    <language>en</language>
    <item>
      <title>Search text from PDF files stored in an S3 bucket</title>
      <dc:creator>Mixpeek</dc:creator>
      <pubDate>Wed, 27 Jul 2022 23:41:00 +0000</pubDate>
      <link>https://dev.to/mixpeek-2/search-text-from-pdf-files-stored-in-an-s3-bucket-2084</link>
      <guid>https://dev.to/mixpeek-2/search-text-from-pdf-files-stored-in-an-s3-bucket-2084</guid>
      <description>&lt;p&gt;Does your application allow users to upload PDFs? Maybe they upload resumes, waivers, agreements or signed documents. What if they need to search the &lt;em&gt;contents&lt;/em&gt; of these PDFs?&lt;/p&gt;

&lt;h2&gt;
  
  
  As a developer, you have 3 options:
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Search by Filename&lt;/strong&gt;: Lookup by key/value like filename &lt;em&gt;[Native]&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Search by Metadata&lt;/strong&gt;: Store the metadata in a separate database to perform queries &lt;em&gt;[Database add-on]&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Full-Text-Search&lt;/strong&gt;: Extract the contents into a search engine [OCR_, Database, Search add-on]_&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;Full Text Search provides the most intuitive user experience, but it’s also the most challenging to build, maintain, and enhance.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjpb9k5g56sywc1c95vzm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjpb9k5g56sywc1c95vzm.png" alt="data diagram" width="561" height="111"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this tutorial, we’ll walk you through best practices for PDF file upload, content extraction via OCR (&lt;a href="https://en.wikipedia.org/wiki/Optical_character_recognition" rel="noopener noreferrer"&gt;Optical Character Recognition&lt;/a&gt;), and searching so you can add full-text PDF search into your application, with ease.&lt;/p&gt;

&lt;p&gt;Bonus: At the end will be a &lt;em&gt;Github repository&lt;/em&gt; so you can import the code directly into your application.&lt;/p&gt;

&lt;h2&gt;
  
  
  Store the file
&lt;/h2&gt;

&lt;p&gt;First we need a function to download the file locally in order to run our OCR extraction logic:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import boto3s3_client = boto3.client(  
    's3',  
    aws_access_key_id='aws_access_key_id',  
    aws_secret_access_key='aws_secret_access_key',  
    region_name='region_name'  
)

with open(s3_file_name, 'wb') as file:  
        s3_client.download_fileobj(  
            bucket_name,  
            s3_file_name,  
            file  
        )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Extract the contents
&lt;/h2&gt;

&lt;p&gt;We’ll use the open source, Apache Tika library, which contains a class: &lt;a href="https://tika.apache.org/1.9/api/org/apache/tika/parser/AutoDetectParser.html" rel="noopener noreferrer"&gt;AutoDetectParser&lt;/a&gt; that does OCR (optical character recognition):&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from tika import parser
parsed_pdf_content = parser.from_file(s3_file_name)['content']
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Insert contents into a search engine
&lt;/h2&gt;

&lt;p&gt;We’re using a self-managed OpenSearch node here, but you can use &lt;a href="https://lucene.apache.org/" rel="noopener noreferrer"&gt;Lucene&lt;/a&gt;, &lt;a href="https://solr.apache.org/" rel="noopener noreferrer"&gt;SOLR&lt;/a&gt;, &lt;a href="https://www.elastic.co/" rel="noopener noreferrer"&gt;ElasticSearch&lt;/a&gt; or &lt;a href="https://www.mongodb.com/atlas/search" rel="noopener noreferrer"&gt;Atlas Search&lt;/a&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Note: if you don’t have OpenSearch locally you must install it first, then run it:&lt;/p&gt;
&lt;/blockquote&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;brew update  
brew install opensearch  
opensearch
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;OpenSearch will now be accessible here: &lt;a href="http://localhost:9200/" rel="noopener noreferrer"&gt;http://localhost:9200&lt;/a&gt;. Let’s build the index and insert the file contents:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from opensearchpy import OpenSearch
os = OpenSearch("http://localhost:9200/")  

index_name="pdf-search"doc = {  
    "filename": s3_file_name,  
    "parsed_pdf_content": parsed_pdf_content  
}

response = os.index(  
    index=index_name,  
    body=doc,  
    id=1,  
    refresh=True  
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Creating a PDF search API
&lt;/h2&gt;

&lt;p&gt;We’ll use Flask to create a microservice that searches terms:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from flask import Flask, jsonify, request  
from opensearchpy import OpenSearch  
from config import *

app = Flask(__name__)  
    os = OpenSearch("http://localhost:9200/")
    @app.route('/search', methods=['GET'])  
    def search_file():  
        query = request.args.get('q', default = None, type = str)# query payload to ES  
        payload = {  
            'query': {  
                'match': {  
                    'parsed_pdf_content': query  
                }  
            }  
        }  

    response = os.search(  
        body=payload,  
        index=index_name  
    )

return jsonify(response)if __name__ == '__main__':  
    app.run(host="localhost", port=5011, debug=True)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Now we can call the API via:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GET: http://localhost:5011/search?q=SEARCH_TERM
{  
      "_shards": {  
        "failed": 0,   
        "skipped": 0,   
        "successful": 1,   
        "total": 1  
      },   
      "hits": {  
        "hits": [  
          {  
            "_id": "1",   
            "_index": "pdf-search",   
            "_score": 0.29289162,   
            "_source": {  
              "filename": "prescription.pdf",   
              "parsed_pdf_content": "http://localhost:5011/search?q=SEARCH_TERM"  
            }  
          }  
        ],   
        "max_score": 0.29289162,   
        "total": {  
          "relation": "eq",   
          "value": 1  
        }  
      },   
      "timed_out": false,   
      "took": 40  
    }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Whoo we did it! We’ve successfully created an API that offers full text PDF search.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fme78rv8wbmumpchn4h4z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fme78rv8wbmumpchn4h4z.png" alt="congrats" width="498" height="211"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can download the repo here: &lt;a href="https://github.com/mixpeek/pdf-search-s3" rel="noopener noreferrer"&gt;https://github.com/mixpeek/pdf-search-s3&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  So what’s next?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Queuing&lt;/strong&gt;: Ensuring concurrent file uploads are not dropped&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security&lt;/strong&gt;: Adding end to end encryption to the data pipeline&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enhancements&lt;/strong&gt;: Including more features like fuzzy, highlighting and autocomplete&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rate Limiting&lt;/strong&gt;: Building thresholds so users don’t abuse the system&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Everything collapsed into just 2 API calls
&lt;/h2&gt;

&lt;p&gt;If this feels like too much for you to build, maintain, and enhance, &lt;a href="https://mixpeek.com/" rel="noopener noreferrer"&gt;Mixpeek&lt;/a&gt; has you covered.&lt;/p&gt;

&lt;h3&gt;
  
  
  Upload
&lt;/h3&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import requests

url = "https://api.mixpeek.com/upload"  
    files=[  
      ('file',('FILE_NAME.pdf',open('FILE_NAME.pdf','rb'),'pdf'))  
    ]  
    response = requests.request("POST", url, files=files)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
  
  
  Search
&lt;/h3&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import requests
url = "https://api.mixpeek.com/search?q=SEARCH_QUERY"
response = requests.request("GET", url)
print(response.text)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Corresponding &lt;a href="https://www.getpostman.com/collections/6889b6991a8a0f7b774c" rel="noopener noreferrer"&gt;Postman Collection&lt;/a&gt; for your convenience.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dash.mixpeek.com/" rel="noopener noreferrer"&gt;Request an API key for free&lt;/a&gt;, and &lt;a href="https://docs.mixpeek.com/#intro" rel="noopener noreferrer"&gt;review the docs&lt;/a&gt; to get started.&lt;/p&gt;

</description>
      <category>opensearch</category>
      <category>pdfsearch</category>
      <category>tika</category>
      <category>search</category>
    </item>
  </channel>
</rss>
