DEV Community

Derek
Derek

Posted on

Extract Words from PDF using PHP

Source: Extract Text from PDF

Step1: Get and Access the License of PHP PDF API

 

ComPDFKit API provide users 1000 free PDF API requests. Follow the steps below to access the license and start your API requests.

  1. Register ComPDFKit API to go to the dashboard. You will see the API Keys, the progress of your API plan, and the status of API requests on your dashboard.

Image description
 

  1. Create a project and get the Public Key and Secret Key.

After your account is created, a default project will be created. You can create more projects to call ComPDFKit API. All supported PDF APIs could be checked on the documentation pages.

There are unique Public Key and Secret Key for each project. Remember to apply the right key for the corresponding project.

Image description

Step2: Authentication PDF API for PDF Text Extraction

You need to replace the real publicKey and secretKey to get the accessToken. Then, use the accessToken to create a task, upload files, extract PDF words, and get the extracted PDF Text JSON file.

PHP code example to authenticate ComPDFKit PDF text Extracting API:

$params = [
    'publicKey' => $publicKey,
    'secretKey' => $secretKey
];
$headers = ['Content-Type: application/json'];
$curl = curl_init();
curl_setopt_array($curl, array(
    CURLOPT_URL => 'https://api-server.compdf.com/server/v1/oauth/token',
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_ENCODING => '',
    CURLOPT_MAXREDIRS => 10,
    CURLOPT_TIMEOUT => 0,
    CURLOPT_FOLLOWLOCATION => true,
    CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
    CURLOPT_CUSTOMREQUEST => 'POST',
    CURLOPT_HTTPHEADER => $headers,
    CURLOPT_POSTFIELDS => json_encode($params)
));
$response = curl_exec($curl);
curl_close($curl);
$result = json_decode($response, true);
$accessToken = $result['data']['accessToken'];
$bearerToken = "Bearer $accessToken";
Enter fullscreen mode Exit fullscreen mode

Step3: Create Task - Extract PDF Text

You need to replace the accessToken which was obtained from the previous step. Set the language type you want to display the error information (1, English, 2, Chinese). ComPDFKit PDF API parameters can be found on the Quick Start --> Request Description page.

After replacing them, you will get the taskId in the response data. PHP code example to create PDF text extracting task:

$headers = [
    'Content-Type: application/json',
    'Authorization: ' . $bearerToken
];
$curl = curl_init();
curl_setopt_array($curl, array(
    CURLOPT_URL => 'https://api-server.compdf.com/server/v1/task/pdf/json?language=' . $language,
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_ENCODING => '',
    CURLOPT_MAXREDIRS => 10,
    CURLOPT_TIMEOUT => 0,
    CURLOPT_FOLLOWLOCATION => true,
    CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
    CURLOPT_CUSTOMREQUEST => 'GET',
    CURLOPT_HTTPHEADER => $headers,
));
$response = curl_exec($curl);
curl_close($curl);
$result = json_decode($response, true);
$taskId = $result['data']['taskId'];
Enter fullscreen mode Exit fullscreen mode

 

Step4: Upload Files for PDF Parser

Replace the information in the PHP code:

  • PDF Files: The PDF you want to extract Text from.
  • taskId: Obtained in the tast creating step.
  • Language: The language you want to display the error information.
  • accessToken: Obtained in the Authentication step.

ComPDFKit API provide AI, OCR, etc. You can also input the parameters in this step:

  • type:Options to extract contents (0: text, 1: table) Default 0.
  • isAllowOcr: Whether to allow to open OCR (1: yes, 0: no), Default 0.
  • isOnlyAiTable: Whether to enable AI to recognize table (1: yes, 0: no) Default 0.

PHP code example to upload PDFs to parsing:
···
$params = [
'taskId' => $taskId, // ID of your task
'file' => new CURLFile($pdfPath), // Files you need to process
'language' => $language,
'password' => '',
'parameter' => json_encode(['type' => 1, 'isAllowOcr' => 1, 'isContainOcrBg' => 0])
];
$headers = [
'Authorization: ' . $bearerToken
];
$curl = curl_init();
curl_setopt_array($curl, array(
CURLOPT_URL => 'https://api-server.compdf.com/server/v1/file/upload',
CURLOPT_RETURNTRANSFER => true,
CURLOPT_ENCODING => '',
CURLOPT_MAXREDIRS => 10,
CURLOPT_TIMEOUT => 0,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
CURLOPT_CUSTOMREQUEST => 'POST',
CURLOPT_HTTPHEADER => $headers,
CURLOPT_POSTFIELDS => $params
));
$response = curl_exec($curl);
curl_close($curl);
$result = json_decode($response, true);
$fileKey = $result['data']['fileKey'];
···

Step5: Process and Extract Text From Uploaded PDF Files

···
Execute the tast to extract Words from PDF you uploaded. Here is the PHP code example:

$headers = [
'Content-Type: application/json',
'Authorization: ' . $bearerToken
];
$curl = curl_init();
curl_setopt_array($curl, array(
CURLOPT_URL => 'https://api-server.compdf.com/server/v1/execute/start?language=' . $language . '&taskId=' . $taskId,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_ENCODING => '',
CURLOPT_MAXREDIRS => 10,
CURLOPT_TIMEOUT => 0,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
CURLOPT_CUSTOMREQUEST => 'GET',
CURLOPT_HTTPHEADER => $headers,
));
$response = curl_exec($curl);
curl_close($curl);
···

Step6: Get Task Information of PDF Text Extraction

Follow the PHP code example below to obtain the task information. Replace the needed information like taskId and access_token. The PDF PDF parser and extracted result file is presented in a JSON file, which is a structured data format beneficial for the reuse of PDF text extraction.

···
$headers = [
'Content-Type: application/json',
'Authorization: ' . $bearerToken
];

$curl = curl_init();
curl_setopt_array($curl, array(
CURLOPT_URL => 'https://api-server.compdf.com/server/v1/task/taskInfo' . '?taskId=' . $taskId,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_ENCODING => '',
CURLOPT_MAXREDIRS => 10,
CURLOPT_TIMEOUT => 0,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
CURLOPT_CUSTOMREQUEST => 'GET',
CURLOPT_HTTPHEADER => $headers,
));
$response = curl_exec($curl);
curl_close($curl);
$result = json_decode($response, true);
···

Top comments (0)