DEV Community

Intelligent Convert
Intelligent Convert

Posted on

How to Automate Extracting Documents from MS Access OLE Object Columns

Extracting documents (like Word, Excel, PDF, or images) that are stored inside MS Access OLE Object fields can be tricky because Access does not store the raw file directly. Instead, it wraps the file in an OLE header that depends on the application used to insert the object (Word, Excel, Paint, etc.). That means you usually have to strip off the OLE wrapper before you can recover the original file.

Here's a detailed guide with options depending on the approach you want to take.

Understanding OLE Storage in Access

When you insert a file into an OLE Object field, Access doesn't just store the file. Instead, it stores:

  • OLE Header - metadata about the embedding application and object type.
  • Raw File Data - the actual content of the file (but wrapped).

So simply reading the binary field gives you a blob with extra bytes at the beginning.

Using VBA inside Access

VBA (Visual Basic for Applications) is a programming language developed by Microsoft that's built into Office applications like Excel, Access, and Word. Therefore, this is the first and most obvious choice to automate extracting documents from MS Access OLE Objects. Use this script to extract the files from within Access.

This will export the binary OLE field as-is. You'll then need to manually strip off the OLE header (see "Stripping the OLE Header" section below).

Using .NET (C# or VB.NET)

VB.NET (Visual Basic .NET) is an object-oriented programming language developed by Microsoft. It's part of the .NET framework and is used to build desktop, web, and mobile applications. Unlike VBA, which is limited to Office apps, VB.NET is a general-purpose language with modern features like inheritance, exception handling, and strong typing.

So, if you are extracting data programmatically outside MS Access, compose a code as follows using ODBC or OleDB connectivity to read the database:

using System;
using System.Data.OleDb;
using System.IO;

class ExtractOLE
{
  static void Main()
  {
    string connStr = @"Provider=Microsoft.ACE.OLEDB.12.0;Data Source=C:\db\mydb.accdb;";
    using (OleDbConnection conn = new OleDbConnection(connStr))
    {
      conn.Open();
      OleDbCommand cmd = new OleDbCommand("SELECT ID, OLEField FROM MyTable WHERE OLEField IS NOT NULL", conn);
      OleDbDataReader reader = cmd.ExecuteReader();

      while (reader.Read())
      {
        byte[] data = (byte[])reader["OLEField"];
        string filePath = $@"C:\ExportedFiles\Doc_{reader["ID"]}.bin";
        File.WriteAllBytes(filePath, data);
        Console.WriteLine("Saved: " + filePath);
      }
    }
  }
} 

Enter fullscreen mode Exit fullscreen mode

This code saves the raw binary including the OLE header. You'll then need to manually strip off the OLE header as described below.

Stripping the OLE Header

The hard part is getting the real file out. The header size varies depending on how the file was inserted:

  • Bitmaps: OLE header usually 78 bytes.
  • Word, Excel, PDF: often larger, unpredictable.

Sometimes MS Access stores the full file name at the start.
In order to deal with it, look for the magic number (file signature) inside the binary blob. For example:

  • PDF → %PDF (25 50 44 46)
  • DOCX/ZIP → PK (50 4B 03 04)
  • PNG → 89 50 4E 47

Once found, strip everything before that offset and save the rest as a new file.

Example in Python (after exporting raw OLE):

def extract_real_file(ole_blob, out_path):
  signatures = {        
    b"%PDF": ".pdf",        
    b"PK\x03\x04": ".docx",
    b"\xD0\xCF\x11\xE0":".doc",  # old MS Office binary        
    b"\x89PNG": ".png"
  }    

  # Find file signature
  for sig, ext in signatures.items():
    idx = ole_blob.find(sig)
    if idx != -1:            
      with open (out_path + ext, "wb") as f:
        f.write(ole_blob[idx:])            
      print(f"Extracted {out_path}{ext}")            
      return    

  print("Unknown format")

# Usage after reading the binary
with open("Doc_1.bin", "rb") as f:
  blob = f.read()
extract_real_file(blob, "C:/ExportedFiles/Doc_1")
Enter fullscreen mode Exit fullscreen mode

Third-Party Tools

If you don't want to deal with OLE headers manually, consider using those tools for full automation of extracting documents from MS Access OLE Objects:

  • Access OLE Export - commercial tool ($95) to extract images and files from OLE Object fields of MS Access database
  • Access-to-MySQL - commercial tool ($79) to migrate MS Access data including OLE Objects to MySQL, MariaDB or Percona (both on premises and cloud platforms)
  • Access-to-PostgreSQL - commercial tool ($79) to migrate MS Access data including OLE Objects to PostgreSQL (both on premises and cloud platforms)

Conclusion

You can extract OLE objects by reading the binary data from MS Access database. What you get isn't a clean file, but a wrapped blob stream. To recover the real document, detect and strip the OLE header (search by magic numbers).

Automation possible in VBA, C#, Python, or any language that can handle binary data.

While it is entirely possible to extract documents from MS Access OLE Object fields programmatically using VBA, .NET, or Python, the process is often cumbersome due to the presence of unpredictable OLE headers. Manually stripping headers requires knowledge of file signatures and may involve trial and error if the database contains a mix of file types.
For organizations that need to extract large numbers of files reliably, third-party OLE extraction tools can save significant time and effort. These utilities are purpose-built to parse the OLE container correctly and export clean files in their original formats, often supporting batch operations and preserving file names. Although they may be commercial, such tools eliminate the need to write and debug custom header-stripping code, making them a practical choice when working under tight deadlines or dealing with non-technical staff.

Top comments (0)