<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ritika Bhambri</title>
    <description>The latest articles on DEV Community by Ritika Bhambri (@ritika-bhambri).</description>
    <link>https://dev.to/ritika-bhambri</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3842589%2F54e95fc3-c214-4bef-87a1-db7d70c1e87a.png</url>
      <title>DEV Community: Ritika Bhambri</title>
      <link>https://dev.to/ritika-bhambri</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ritika-bhambri"/>
    <language>en</language>
    <item>
      <title>A Practical Guide to Getting Started with Outreachy: From Application to Contributions</title>
      <dc:creator>Ritika Bhambri</dc:creator>
      <pubDate>Mon, 13 Apr 2026 23:08:41 +0000</pubDate>
      <link>https://dev.to/ritika-bhambri/a-practical-guide-to-getting-started-with-outreachy-from-application-to-contributions-2kae</link>
      <guid>https://dev.to/ritika-bhambri/a-practical-guide-to-getting-started-with-outreachy-from-application-to-contributions-2kae</guid>
      <description>&lt;h1&gt;
  
  
  What is Outreachy
&lt;/h1&gt;

&lt;p&gt;Outreachy provides internships in open source to anyone from any background who faces underrepresentation, systemic bias, or discrimination in the technical industry where they are living. Outreachy provides an opportunity for an online collaborative environment for learning, and remote mentoring with experienced FOSS contributors.&lt;/p&gt;

&lt;h2&gt;
  
  
  What kinds of projects do Outreachy internships offer.
&lt;/h2&gt;

&lt;p&gt;A common misconception that participants have is that Outreachy projects are strictly programming based. This isn't true. Some Outreachy projects are focused on non-programming work. These may involve design, documentation, user experience, marketing, or event planning. &lt;/p&gt;

&lt;p&gt;Outreachy application is split into three phases- &lt;strong&gt;Initial Application Phase&lt;/strong&gt;, &lt;strong&gt;Conrtibution Phase&lt;/strong&gt; and &lt;strong&gt;Final Application Phase&lt;/strong&gt;.  I will explain how to approach both these phases in a detailed manner.&lt;/p&gt;

&lt;h1&gt;
  
  
  Initial Application Phase
&lt;/h1&gt;

&lt;p&gt;Outreachy runs twice every year. The mid-year applications tentatively open early February and end of year applications open late august. It is a good idea to keep an eye on Outreachy's official website and &lt;strong&gt;subscribe to their mailing list to get an update about future internships.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Eligibility
&lt;/h2&gt;

&lt;p&gt;Before anything it is important to check if you are eligible to participate in an internship cohort. Outreachy has a wide set of rules to determine eligibility&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You must me 18 or above when your internships start. &lt;/li&gt;
&lt;li&gt;You must be available for a full-time internship. Outreachy interns work 30 hours per week.&lt;/li&gt;
&lt;li&gt;You must not be a past Outreachy or Google Summer of Code intern.&lt;/li&gt;
&lt;li&gt;You must not have another paid/unpaid internship or a full-time job or a full-time contracting position during the Outreachy internship period.&lt;/li&gt;
&lt;li&gt;If you are a student of a university in the Northern Hemisphere, you will only be eligible for the May to August internship cohort. Students in India are considered to be in the northern hemisphere, regardless of where their university is located.&lt;/li&gt;
&lt;li&gt;If you are a student of a university in the Southern Hemisphere, you will only be eligible for the December to March internship cohort.
Otherwise, if your university is near the equator, you may apply to any internship cohort. We will review university term schedules on a case-by-case basis.&lt;/li&gt;
&lt;li&gt;Non students can apply to either cohort. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These rules may change from time to time. Check the official website to stay up to date.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Remember the initial phase is usually lasts 7-14 days. Applications are accepted on a fist come first serve basis. So, do not wait until the last day to apply.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Initial Phase Essays
&lt;/h2&gt;

&lt;p&gt;The Initial Phase Application involves answering four essay questions. Please do not use generative AI to create or edit your essays. It is important that you write about your lived experiences. Using generative AI, even to polish your essays, can introduce inaccuracies and an inauthentic writing style. The mentors and Outreachy organizers read thousands of applications and would instantly know if you have used AI for your essays and your application may be cancelled. &lt;/p&gt;

&lt;h1&gt;
  
  
  The Waiting Period
&lt;/h1&gt;

&lt;p&gt;Now, that you have submitted your initial application, you wait. I you are anything like me you would probably be a bundle of nerves during this time      and be anxious about whether or not your application will be selects. But please, do not do this. Instead use this time judiciously to hone your technical skills and see if you can get involved with an open source mentoring community.&lt;/p&gt;

&lt;h1&gt;
  
  
  Congratulations, Your initial application is selected!! Contribution Period starts.
&lt;/h1&gt;

&lt;p&gt;The big day finally arrives. You receive a mail confirming your selection in the contribution phase. Now what to do?&lt;/p&gt;

&lt;p&gt;You probably want to rush up and start making as many contributions as you can. But wait, do not start working before looking at the complete project list carefully. Narrow down to 1-2 projects. Join the mentoring channels and introduce yourself. Install and use the project yourself. You may encounter some issues while setting up your project. Do not hesitate to ask questions in these channels. Your mentors will be around to help. Your fellow contributors can help you too and once you become well verse with everything you can offer help to others based on your learnings. After all this is what the spirit of open source is-Giving back to the community.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reach out to the mentors
&lt;/h2&gt;

&lt;p&gt;After setting up your project reach out to you mentors and ask them about the available issues. It is encouraged that you contact the mentors on the public forums mentioned in the project description. Avoid contacting the mentors personally. It is possible that others also have the same doubts as you. Asking questions in the public forums clears it for all. &lt;/p&gt;

&lt;h2&gt;
  
  
  Making Contributions
&lt;/h2&gt;

&lt;p&gt;Start early. Some project mentors find that they have many promising applicants. They may choose to close their project to new applicants. If you wait too long to start, your project may be closed to new applicants.&lt;/p&gt;

&lt;p&gt;Start with a smaller contribution. Then try a more complex contribution. The end goal is to show you have the skills to be a successful intern.&lt;/p&gt;

&lt;h2&gt;
  
  
  Recording Contributions
&lt;/h2&gt;

&lt;p&gt;Applicants are required to record their contributions in the Outreachy website. **You need to have atleast one recorded contribution to be allowed to submit a final application.&lt;/p&gt;

&lt;p&gt;It is best to record a contribution as soon as you start working on it and not wait till the end as you have to wrap up everything and will not be able to think thoroughly. You can go back and edit your recorded contribution at any time. Ask your mentors if they can review your contributions.&lt;/p&gt;

&lt;h1&gt;
  
  
  Final Application
&lt;/h1&gt;

&lt;p&gt;After recording your contributions you move to the last phase that is the  final application phase. The final application asks four questions-&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Past experience with this community.&lt;/li&gt;
&lt;li&gt;Past experience with other communities. &lt;/li&gt;
&lt;li&gt;Relevant Projects&lt;/li&gt;
&lt;li&gt;Outreachy internship project timeline:  This question asks you to provide a tentative timeline of your internship project. Go through the project description and familiarize yourself with the provided tasks and milestones. Once you have a clear understanding of the project, create a timeline outlining the tasks you plan to work on during the internship. Break down your project into smaller bi-weekly/weekly tasks. Be prepared to adjust your timeline as needed to accommodate unexpected developments or changes in project scope. Make sure to take into account any time commitments you have during the Outreachy internship round. You can always asks your mentors to review your breakdown of tasks to see if you are on the right track.
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>beginners</category>
      <category>career</category>
      <category>opensource</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Teaching Machines to Understand Documents with Docling</title>
      <dc:creator>Ritika Bhambri</dc:creator>
      <pubDate>Fri, 10 Apr 2026 15:13:27 +0000</pubDate>
      <link>https://dev.to/ritika-bhambri/a-deep-dive-into-docling-35d3</link>
      <guid>https://dev.to/ritika-bhambri/a-deep-dive-into-docling-35d3</guid>
      <description>&lt;h1&gt;
  
  
  Docling Exploration
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;In this experiment I have explored the Docling CLI and used it to parse a PDF and export it to multiple formats. I have also tried various flags to become familiar with the basic commands and functionality of Docling, which is part of the RAG support in Ramalama.&lt;/p&gt;

&lt;h3&gt;
  
  
  Documents Used
&lt;/h3&gt;

&lt;p&gt;For this task i have chosen the &lt;a href="https://events.linuxfoundation.org/wp-content/uploads/2026/03/sponsor_pytconf26_eu_030526.pdf" rel="noopener noreferrer"&gt;Pytorch Conference brochure&lt;/a&gt; and the &lt;a href="https://arxiv.org/pdf/1706.03762" rel="noopener noreferrer"&gt;Attention Is All You Need&lt;/a&gt; paper. I chose the brochure because it has diverse elements images, multi-column format, multi column table with rich formatting,  and styled text which is a great way to evaluate docling's performance across different formats and also to test features like table extraction and OCR. &lt;/p&gt;

&lt;p&gt;I also wanted to test the &lt;code&gt;--enrich formula&lt;/code&gt; feature flag but since the brochure has no mathematical formulas i used thr &lt;strong&gt;Attention Is All You Need&lt;/strong&gt; research paper for that. &lt;/p&gt;

&lt;h3&gt;
  
  
  Errors Encountered
&lt;/h3&gt;

&lt;p&gt;During the course of this experiment when I was testing the &lt;code&gt;--force OCR&lt;/code&gt; flag on the Pytorch Conference brochure i encountered a &lt;strong&gt;memory allocation failure&lt;/strong&gt;  from Pytorch because my system's available RAM + virtual memory was exhausted at that moment. &lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9wlryn3mj5ipf2yjy8ha.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9wlryn3mj5ipf2yjy8ha.jpg" alt="docling-error" width="800" height="204"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I figured out the following reasons for this&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Docling's pipeline with &lt;code&gt;--force-ocr&lt;/code&gt; is quite memory intensive because it loads heavy models for document understanding, table structure, etc.&lt;/li&gt;
&lt;li&gt;Force OCR processes every page as images, which spikes memory usage.&lt;/li&gt;
&lt;li&gt;A 7-page brochure rich in images/graphics can still trigger high peak usage.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I tried to manage this by closing all other tabs and applications and using pypdfium2 backend which is much lighter than the default.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;docling&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;docling_brochure&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pdf&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;md&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;force&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;ocr&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;force_ocr&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;pdf&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;backend&lt;/span&gt; &lt;span class="n"&gt;pypdfium2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I still got the same error, which meant that I had to either downgrade the quality of the PDF being used for the experiments and use something without tables/less rich in images or not test certain flags i wanted to. To avoid this I used Google Colab to do my task so that I can use the brochure and do all the experiments seamlessly.&lt;/p&gt;

&lt;p&gt;I have uploaded the notebook with cleanly marked cells to document each experiment along with the source PDFs and all the outputs in this repository.&lt;/p&gt;

&lt;h2&gt;
  
  
  Installation ##
&lt;/h2&gt;

&lt;p&gt;Installed Docling using pip &lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo3vqxqqxb0bhpxibnsm7.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo3vqxqqxb0bhpxibnsm7.jpg" alt="Docling-Installation" width="800" height="390"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Checked the version&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkpcogw2p9yusr0udk1mu.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkpcogw2p9yusr0udk1mu.jpg" alt="Docling-version" width="552" height="191"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I also explored the &lt;code&gt;--help&lt;/code&gt; flag to understand all the commands&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9a5wygbra2zsx2vgwwh8.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9a5wygbra2zsx2vgwwh8.jpg" alt="docling-help" width="800" height="636"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Converting the PDF into different formats
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Baseline(Markdown) Format
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="o"&gt;%%&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;docling&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;docling_brochure&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pdf&lt;/span&gt; \
  &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;baseline&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;

&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;du&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;baseline&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;docling_brochure&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;md&lt;/span&gt;
&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;wc&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;baseline&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;docling_brochure&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;md&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This was the output of the above code&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh6sn3p6ttdr0fwkv1prg.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh6sn3p6ttdr0fwkv1prg.jpg" alt="Baseline-Markdown1" width="800" height="343"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Observations&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Overall output is clean and readable Markdown.&lt;/li&gt;
&lt;li&gt;Wall time : 2min 26 secs&lt;/li&gt;
&lt;li&gt;CPU time : user 347 ms + sys 56.4 ms = total 403 ms&lt;/li&gt;
&lt;li&gt;Generated file: 1.2 MB, 281 lines&lt;/li&gt;
&lt;li&gt;Table structure is fully preserved. I did not observe any broken cells or merged rows.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  - Some minor visual disruption was observed. Some icon images appear slightly misaligned from the original PDF which uses tight icons + text blocks
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4acczthn2atu5zwxi6d3.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4acczthn2atu5zwxi6d3.jpg" alt="Baseline-visual" width="532" height="672"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  JSON
&lt;/h3&gt;

&lt;p&gt;What the &lt;code&gt;--to json&lt;/code&gt; flag does is that instead of serialising to a human-readable format, it serialises the entire internal DoclingDocument object directly to JSON. This is Docling's native data format the code for converting the original PDF to JSON&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="o"&gt;%%&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;docling&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;docling_brochure&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pdf&lt;/span&gt; \
  &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt; \
  &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;

&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;du&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;docling_brochure&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;this was the output &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff08jbtrqctermsdc92ne.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff08jbtrqctermsdc92ne.jpg" alt="JSON1" width="800" height="161"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;JSON Schema &lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frhhdm5xhu4ccprx0gzpn.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frhhdm5xhu4ccprx0gzpn.jpg" alt="JSON-Output" width="800" height="577"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h4&gt;
  
  
  JSON Structure Analysis
&lt;/h4&gt;

&lt;p&gt;I wanted to understand the internal document representation created by Docling and also to verify that the structural elements (headings, tables, images) were correctly detected and separated so I wrote a small analysis script:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;outputs/json/docling_brochure.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Top-level keys : &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Text blocks    : &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;texts&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[]))&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tables         : &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tables&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[]))&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Pictures       : &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;pictures&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[]))&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Pages          : &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;pages&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="si"&gt;{}&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The Output&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;
Top-level keys : &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'schema_name'&lt;/span&gt;, &lt;span class="s1"&gt;'version'&lt;/span&gt;, &lt;span class="s1"&gt;'name'&lt;/span&gt;, &lt;span class="s1"&gt;'origin'&lt;/span&gt;, &lt;span class="s1"&gt;'furniture'&lt;/span&gt;, &lt;span class="s1"&gt;'body'&lt;/span&gt;, &lt;span class="s1"&gt;'groups'&lt;/span&gt;, &lt;span class="s1"&gt;'texts'&lt;/span&gt;, &lt;span class="s1"&gt;'pictures'&lt;/span&gt;, &lt;span class="s1"&gt;'tables'&lt;/span&gt;, &lt;span class="s1"&gt;'key_value_items'&lt;/span&gt;, &lt;span class="s1"&gt;'form_items'&lt;/span&gt;, &lt;span class="s1"&gt;'pages'&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;
Text blocks    : 122
Tables         : 1
Pictures       : 25
Pages          : 7
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Obsevations&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Top-level keys show that Docling uses a comprehensive DoclingDocument schema. Important fields include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;texts&lt;/code&gt; — Individual text items (paragraphs, headings, lists, etc.)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tables&lt;/code&gt; — Structured tables with cell information&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;pictures&lt;/code&gt; — Detected images and icons&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;furniture&lt;/code&gt;, &lt;code&gt;body&lt;/code&gt;, &lt;code&gt;groups&lt;/code&gt; — Internal layout and hierarchy information&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;pages&lt;/code&gt; — Page-level metadata&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;&lt;p&gt;It correctly identified &lt;strong&gt;1 Table&lt;/strong&gt; , &lt;strong&gt;25 Images&lt;/strong&gt; and &lt;strong&gt;7 Pages&lt;/strong&gt; corresponding to the original brochure&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;The hierarchical structure of the document (title → sections → subsections → promotional blocks) is preserved in the JSON schema (with text, level, type, children, etc).&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;Wall time: 2 min 7 s&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;CPU time: user 270 ms + sys 38.5 ms = total 308 ms&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;File size: 6.5 MB&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;The file size is significantly larger than all other formats because the JSON contains all image data as base64 and all the structural metadata. It's the most information-dense format.&lt;/p&gt;&lt;/li&gt;

&lt;/ul&gt;

&lt;h3&gt;
  
  
  PlainText
&lt;/h3&gt;

&lt;p&gt;the &lt;code&gt;--to text&lt;/code&gt; extracts only the raw text content, discarding all structural markup.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="o"&gt;%%&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;docling&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;docling_brochure&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pdf&lt;/span&gt; \
  &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt; \
  &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;

&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;du&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;docling_brochure&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;txt&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Output&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqqjgza52mjpuamrsshoq.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqqjgza52mjpuamrsshoq.jpg" alt="Plaintext1" width="800" height="177"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;this is what the &lt;code&gt;.txt&lt;/code&gt; file looked like&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8d3xysii384qzuazmrfz.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8d3xysii384qzuazmrfz.jpg" alt="PlainText-Output" width="800" height="376"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Observations&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Wall time: 2 min 5 s&lt;/li&gt;
&lt;li&gt;CPU time: user 281 ms + sys 35.7 ms = total 316 ms&lt;/li&gt;
&lt;li&gt;File size: 28 KB&lt;/li&gt;
&lt;li&gt;The file very small, because it's just characters with no structural or visual data. I also noticed that there were no &lt;code&gt;&amp;lt;!-- image --&amp;gt;&lt;/code&gt; comments, no base64 data, no image references whatsoever. This is because this format does not export images.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Doctags
&lt;/h3&gt;

&lt;p&gt;DocTags is Docling's own markup language. It is a structured text format that uses XML-like tags to annotate document elements. It was designed specifically for training vision-language models on document understanding tasks.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="o"&gt;%%&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;docling&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;docling_brochure&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pdf&lt;/span&gt; \
  &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;doctags&lt;/span&gt; \
  &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;doctags&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;

&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;du&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;doctags&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;
&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;40&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;doctags&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;docling_brochure&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;doctags&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Observation&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Wall time: 2 min 3 s&lt;/li&gt;
&lt;li&gt;CPU time: user 260 ms + sys 43.7 ms = total 304 ms&lt;/li&gt;
&lt;li&gt;Output size: 28 KB&lt;/li&gt;
&lt;li&gt;The Output produced was a lightweight file(28 KB) with semantic structure and document heirarchy well preserved.&lt;/li&gt;
&lt;li&gt;This output looked a lot different than the rest of the documents generated thus far so i wanted to take a deeper look into the output. I found that every piece of content was wraped in small XML-style tags. The location co-ordinates like &lt;code&gt;&amp;lt;loc_153&amp;gt;&lt;/code&gt; &lt;code&gt;&amp;lt;loc_210&amp;gt;&lt;/code&gt; tell exactly where that text or image sits on the page.&lt;/li&gt;
&lt;li&gt;DocTags may be useful for RAG applications that require lightweight processing with basic structure and region aware retrieval but for most RAG systems Markdown or JSON format are much easier to work with.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Image Modes
&lt;/h3&gt;

&lt;h4&gt;
  
  
  HTML Embedded
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="o"&gt;%%&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;docling&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;docling_brochure&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pdf&lt;/span&gt; \
  &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;html&lt;/span&gt; \
  &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;html_embedded&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;

&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;du&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;html_embedded&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;docling_brochure&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;html&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Output&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is what the rendered file looked like&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fuser-attachments%2Fassets%2F32782a35-21e6-44aa-91d4-cc3ae7563115" class="article-body-image-wrapper"&gt;&lt;img width="1032" height="860" alt="image" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fuser-attachments%2Fassets%2F32782a35-21e6-44aa-91d4-cc3ae7563115"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  HTML Placeholder
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="o"&gt;%%&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;docling&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;docling_brochure&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pdf&lt;/span&gt; \
  &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;html&lt;/span&gt; \
  &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;export&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;mode&lt;/span&gt; &lt;span class="n"&gt;placeholder&lt;/span&gt; \
  &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; \
  &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;pdf&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;backend&lt;/span&gt; &lt;span class="n"&gt;pypdfium2&lt;/span&gt; \
  &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;html_placeholder&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;amp;&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;tee&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;html_placeholder&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;txt&lt;/span&gt;

&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;du&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;html_placeholder&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;docling_brochure&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;html&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Output&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is what the rendered file looked like&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fuser-attachments%2Fassets%2F12cd689e-1743-4379-aa17-395d9748c111" class="article-body-image-wrapper"&gt;&lt;img width="1032" height="843" alt="image" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fuser-attachments%2Fassets%2F12cd689e-1743-4379-aa17-395d9748c111"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Image Mode Size Comparision
&lt;/h4&gt;

&lt;p&gt;I wrote some code to compare both the image handling formats&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;emb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getsize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;outputs/html_embedded/docling_brochure.html&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ph&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getsize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;outputs/html_placeholder/docling_brochure.html&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Embedded    : &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;emb&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; bytes (&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;emb&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; KB)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Placeholder : &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ph&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; bytes (&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ph&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; KB)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Embedded is &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;emb&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;ph&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;x larger than placeholder&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Output&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;Embedded    : 1,178,502 bytes &lt;span class="o"&gt;(&lt;/span&gt;1150.9 KB&lt;span class="o"&gt;)&lt;/span&gt;
Placeholder : 18,286 bytes &lt;span class="o"&gt;(&lt;/span&gt;17.9 KB&lt;span class="o"&gt;)&lt;/span&gt;

Embedded is 64x larger than placeholder
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Observation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. HTML Embedded&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Maintains good visual layout &amp;amp; headings and keeps all icons and images visible. &lt;/li&gt;
&lt;li&gt;Extremely large file size (64× bigger) because every image is converted to base64 strings.&lt;/li&gt;
&lt;li&gt;These base64 blocks create a lot of noise when chunking and embedding &lt;/li&gt;
&lt;li&gt;This format may be helpful when using multimodal RAG with vision models.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. HTML Placeholder&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The output file is very small and lightweight (18 KB).&lt;/li&gt;
&lt;li&gt;It has clean text and structure with minimal noise as it removes visual clutter that doesn't add value to text-based RAG systems which means that content of this format is easy to chunk.&lt;/li&gt;
&lt;li&gt;This format is useful for text-based RAG systems that involve searching over documents and question answering.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Table Structure Experiments
&lt;/h2&gt;

&lt;p&gt;Docling has 2 table flags &lt;code&gt;--table-mode accurate&lt;/code&gt; and &lt;code&gt;--table-mode fast&lt;/code&gt; . After the layout model identifies a table region, a second model called &lt;em&gt;TableFormer&lt;/em&gt; reconstructs the row boundaries, column separators, and content mapping. &lt;/p&gt;

&lt;h3&gt;
  
  
  Accurate Mode
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="o"&gt;%%&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;docling&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;docling_brochure&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pdf&lt;/span&gt; \
  &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;md&lt;/span&gt; \
  &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;mode&lt;/span&gt; &lt;span class="n"&gt;accurate&lt;/span&gt; \
  &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;table_accurate&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;

&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;wc&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;table_accurate&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;docling_brochure&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;md&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Output&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fuser-attachments%2Fassets%2Ffc681433-d440-407f-8539-8efa2610b5d0" class="article-body-image-wrapper"&gt;&lt;img width="897" height="700" alt="image" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fuser-attachments%2Fassets%2Ffc681433-d440-407f-8539-8efa2610b5d0"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Fast Mode
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="o"&gt;%%&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;docling&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;docling_brochure&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pdf&lt;/span&gt; \
  &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;md&lt;/span&gt; \
  &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;mode&lt;/span&gt; &lt;span class="n"&gt;fast&lt;/span&gt; \
  &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;table_fast&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;

&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;wc&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;table_fast&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;docling_brochure&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;md&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Output&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;--fast&lt;/code&gt; flag generated a table with several inconsistencies. Here are a few images of the table generated that had merged cells and the content of one cell bleeding into the next cell.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fuser-attachments%2Fassets%2Ff1c4aa28-d844-4ab1-ab85-64cd9f02c3a1" class="article-body-image-wrapper"&gt;&lt;img width="627" height="700" alt="image" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fuser-attachments%2Fassets%2Ff1c4aa28-d844-4ab1-ab85-64cd9f02c3a1"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fuser-attachments%2Fassets%2F0c185279-e4c3-4ed1-8857-31efa2034605" class="article-body-image-wrapper"&gt;&lt;img width="895" height="650" alt="image" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fuser-attachments%2Fassets%2F0c185279-e4c3-4ed1-8857-31efa2034605"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Table Corruption Proof
&lt;/h3&gt;

&lt;p&gt;Both runs completed without errors or warnings. Both output files are similar in size. The corruption only became visible when I inspected specific rows directly. Here is the code i wrote to explicity demonstrate the table corruption in &lt;code&gt;--fast&lt;/code&gt; mode&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;echo&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;=== ACCURATE ===&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;grep&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Social Media\|Attendee Registration\|Sponsorship Cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; \
  &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;table_accurate&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;docling_brochure&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;md&lt;/span&gt;

&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;echo&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;=== FAST ===&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;grep&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Social Media\|Attendee Registration\|Sponsorship Cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; \
  &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;table_fast&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;docling_brochure&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;md&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Output&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;
&lt;span class="o"&gt;===&lt;/span&gt; ACCURATE &lt;span class="o"&gt;===&lt;/span&gt;
146:| Attendee Registration Contact List: Opt-in only                                                                                                                                                                                                              | ✔ &lt;span class="o"&gt;(&lt;/span&gt;List provided pre and post event&lt;span class="o"&gt;)&lt;/span&gt;                                                   | ✔ &lt;span class="o"&gt;(&lt;/span&gt;List provided post event&lt;span class="o"&gt;)&lt;/span&gt;                                                           |                                                                                        |                                                                                        |                                                                                        |                                                                                        |
147:| Social Media Promotion: From PyTorch X handle. All custom posts must be approved by the PyTorch Foundation.                                                                                                                                                  | 1 Custom Post, 1 Group Post, and 1 Re-Post                                             | 1 Group Post and 1 Re-Post                                                             | 1 Group Post                                                                           |                                                                                        |                                                                                        |                                                                                        |
160:| Sponsorship Cost                                                                                                                                                                                                                                             | &lt;span class="nv"&gt;$50&lt;/span&gt;,000                                                                                | &lt;span class="nv"&gt;$35&lt;/span&gt;,000                                                                                | &lt;span class="nv"&gt;$18&lt;/span&gt;,000                                                                                | &lt;span class="nv"&gt;$8&lt;/span&gt;,000                                                                                 | &lt;span class="nv"&gt;$4&lt;/span&gt;,000                                                                                 | &lt;span class="nv"&gt;$4&lt;/span&gt;,000                                                                                 |
&lt;span class="o"&gt;===&lt;/span&gt; FAST &lt;span class="o"&gt;===&lt;/span&gt;
147:| Attendee Registration Contact List: Opt-in only Social Media Promotion: From PyTorch X handle. All custom                                                                         | ✔ &lt;span class="o"&gt;(&lt;/span&gt;List provided pre and post event&lt;span class="o"&gt;)&lt;/span&gt; 1 Custom Post, 1 Group Post, and                  | ✔ &lt;span class="o"&gt;(&lt;/span&gt;List provided post event&lt;span class="o"&gt;)&lt;/span&gt; 1 Group Post and                                          | 1 Group Post                                                                           |                                                                                        |                                                                                   |                                                                                   |
161:| Sponsorship Cost                                                                                                                                                                  |                                                                                        |                                                                                        |                                                                                        |                                                                                        |                                                                                   |                                                                                   |



&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Analysis
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Table Mode: --table-mode accurate vs --table-mode fast&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Performance Results:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;Wall Time&lt;/th&gt;
&lt;th&gt;Line Count&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Accurate&lt;/td&gt;
&lt;td&gt;2m 3s&lt;/td&gt;
&lt;td&gt;281&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fast&lt;/td&gt;
&lt;td&gt;1m 53s&lt;/td&gt;
&lt;td&gt;283&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Fast mode is 10 seconds faster but it also generates 2 extra lines that are phantom rows created by cell overflow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Three distinct failures:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Failure 1&lt;/strong&gt; — Row merger: Lines 146 and 147 in accurate mode are two separate rows. &lt;br&gt;
In fast mode they collapse into one, both row descriptions were concatenated inside a single cell &lt;br&gt;
with no separator.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Failure 2&lt;/strong&gt; — Content overflow: The Diamond tier cell in the merged row contains values &lt;br&gt;
from both rows joined together, cut off mid-sentence. 1 Re-Post disappeared entirely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Failure 3&lt;/strong&gt; — Pricing row emptied: The Sponsorship Cost row exists in both outputs but &lt;br&gt;
in fast mode every pricing cell is empty. $50,000, $35,000, $18,000, $8,000, $4,000, &lt;br&gt;
$4,000 are all absent. The values were displaced into phantom rows that have no corresponding &lt;br&gt;
real row in the document.&lt;/p&gt;
&lt;h3&gt;
  
  
  RAG Implication
&lt;/h3&gt;

&lt;p&gt;A RAG system querying "What is the cost of a Diamond sponsorship?" would retrieve the Sponsorship Cost row from fast mode output and find empty cells. It would either return nothing or hallucinate a value from training data. It silently corrupts the table, no errors were raised at any stage.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;--table-mode fast&lt;/code&gt; processes the documents in lesser time but it is less accurate. It produces structurally invalid output for complex tables. For documents with large, complex tables that carry the important knowledge &lt;code&gt;--table-mode accurate&lt;/code&gt; is the right choice.&lt;/p&gt;
&lt;h2&gt;
  
  
  OCR Related Experiments
&lt;/h2&gt;

&lt;p&gt;Docling's OCR engine(Rapid OCR) handles text extraction from image based regions. Two flags control it's OCR behaviour&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;--no-ocr&lt;/code&gt; - disables ocr entirely, text comes only from the PDF's embedded text layer&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--force-ocr&lt;/code&gt; - forces OCR on every region of every page regardless of whether embedded text exists&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  &lt;code&gt;--no-ocr&lt;/code&gt;
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="o"&gt;%%&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;docling&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;docling_brochure&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pdf&lt;/span&gt; \
  &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;md&lt;/span&gt; \
  &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;no&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;ocr&lt;/span&gt; \
  &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;no_ocr&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;

&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;du&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;no_ocr&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;docling_brochure&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;md&lt;/span&gt;
&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;wc&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;no_ocr&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;docling_brochure&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;md&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Output&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Wall time: 1minute 9 seconds&lt;/li&gt;
&lt;li&gt;Output is 1.2MB with 281 lines identical to the baseline which means that no extra lines or phantom rows were added.&lt;/li&gt;
&lt;li&gt;The table generated in the rendered file was clean and retained original structure.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  &lt;code&gt;--force-ocr&lt;/code&gt;
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="o"&gt;%%&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;docling&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;docling_brochure&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pdf&lt;/span&gt; \
  &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;md&lt;/span&gt; \
  &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;force&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;ocr&lt;/span&gt; \
  &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;profiling&lt;/span&gt; \
  &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;force_ocr&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;amp;&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;tee&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;force_ocr&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;txt&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Output&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Wall time : 14 minutes and 37 seconds&lt;/li&gt;
&lt;li&gt;although the table structure was retained but it was not able to detect the check marks in the third last and second last rows&lt;/li&gt;
&lt;li&gt;Not only  &lt;code&gt;--force-ocr&lt;/code&gt; took much longer to process the document, it also degraded the quality of the document.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fuser-attachments%2Fassets%2F3782141f-9076-4d30-8817-cc0373ebd58a" class="article-body-image-wrapper"&gt;&lt;img width="940" height="427" alt="Screenshot 2026-04-10 193703" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fuser-attachments%2Fassets%2F3782141f-9076-4d30-8817-cc0373ebd58a"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Observation
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;One peculiar thing i observed for both the OCR modes was that completly miss the footnotes and small text. Both the flags missed the startups, startup sponsorships and non-profit sponsorships footnote as seen in the original brochure below.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fuser-attachments%2Fassets%2F1befb642-ce52-4257-9f59-536ccfa3d930" class="article-body-image-wrapper"&gt;&lt;img width="916" height="171" alt="Screenshot 2026-04-10 193212" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fuser-attachments%2Fassets%2F1befb642-ce52-4257-9f59-536ccfa3d930"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Enrich formula flag
&lt;/h2&gt;

&lt;p&gt;To test this flag I used the 'Attention Is All You Need' paper. &lt;/p&gt;

&lt;p&gt;What the &lt;code&gt;--enrich-formula&lt;/code&gt; flag does internally is that after the standard pipeline extracts text, this flag activates an additional model pass over regions classified as mathematical formulas. The model is trained to interpret mathematical notions and output in a structured format like LaTeX syntax or MathML. &lt;/p&gt;

&lt;p&gt;This is required for PDFs rich in scientific notations and mathematical formulas because PDF stores math as arbitrary positioned characters. for example, the integral sign, the fraction bar and variables stored as seperate positioned glyphs with no inherent relationship. The enrichment model reconstructs the mathematical meaning from the spatial arrangement of glyphs&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="o"&gt;%%&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;docling&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;Transformer_Paper&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pdf&lt;/span&gt; \
  &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;md&lt;/span&gt; \
  &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;enrich&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;formula&lt;/span&gt; \
  &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;enrich_formula&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Without &lt;code&gt;--Enrich-code&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Here is the baseline paper without &lt;code&gt;--enrich code&lt;/code&gt;. The &lt;strong&gt;Attention formula&lt;/strong&gt; is garbled PDF glyph positions were extracted as a flat sequence of characters without preserving the structural relation between them.  &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv27bhsinsbnzjstluj5q.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv27bhsinsbnzjstluj5q.jpg" alt="Transformer-baseline" width="518" height="426"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  With &lt;code&gt;--Enrich-code&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Transformer paper with &lt;code&gt;--enrich-code&lt;/code&gt;. The &lt;strong&gt;Attention formula&lt;/strong&gt; is LaTeX formatted which an LLM can parse correctly and a Markdown renderer can display properly.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftc1gmmkkefdxj4a9v8go.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftc1gmmkkefdxj4a9v8go.jpg" alt="Transformer-enriched" width="505" height="382"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>rag</category>
      <category>docling</category>
    </item>
    <item>
      <title>From Windows to Fedora: A beginners Guide to the Open Source World</title>
      <dc:creator>Ritika Bhambri</dc:creator>
      <pubDate>Wed, 25 Mar 2026 09:33:21 +0000</pubDate>
      <link>https://dev.to/ritika-bhambri/from-windows-to-fedora-a-beginners-guide-to-the-open-source-world-3lc6</link>
      <guid>https://dev.to/ritika-bhambri/from-windows-to-fedora-a-beginners-guide-to-the-open-source-world-3lc6</guid>
      <description>&lt;p&gt;&lt;em&gt;What is Fedora, why it exists, and why a lifelong Windows user like me decided to make the leap and never looked back.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Why I'm Writing This
&lt;/h2&gt;

&lt;p&gt;For most of my life, Windows was just what computers ran. It was the default, the familiar, the thing that was always just... there. I never questioned it. Open the laptop, see the Windows logo, get to work. Simple. But if you are a developer and as you move to more complicated projects, you will realise that this thinking quickly hits a wall.&lt;/p&gt;

&lt;p&gt;There's a particular kind of frustration that only a Windows developer knows, a package that refuses to install because of some missing build tool, a terminal that behaves differently from every tutorial you follow, path issues that make no sense, environment variables that need a restart to take effect. You spend more time fighting the system than actually building things.&lt;/p&gt;

&lt;p&gt;Then Outreachy, a program that opens doors to open-source contribution for underrepresented folks in tech, introduced me to Fedora. I won't lie, the first few days were disorienting. What even is a Linux distribution? What does "open source" actually mean in practice? Why do people get so passionate about an operating system?&lt;/p&gt;

&lt;p&gt;But the more I explored, the more something clicked. Fedora wasn't just a different OS. It was a different philosophy about how software should work, who it should serve, and who gets to build it.&lt;/p&gt;

&lt;p&gt;This blog post is my attempt to share what I've learned, for anyone who, like me, started from zero. If you're a Windows user who is curious about a Linux distribution but doesn't know where to begin, or if you're thinking about applying to Outreachy and want to understand Fedora before you dive in, this one's for you.&lt;/p&gt;




&lt;h2&gt;
  
  
  So, What Is the Fedora Project?
&lt;/h2&gt;

&lt;p&gt;Before I talk about the operating system, it's important to understand something: Fedora is not just software. It's a project and a community.&lt;/p&gt;

&lt;p&gt;The Fedora Project exists with a mission to create "an innovative platform for hardware, clouds, and containers that enables software developers and community members to build tailored solutions for their users." To put it simply, Fedora is a group of people - engineers, designers, writers, testers, and contributors from all over the world who come together to build free, open-source software for everyone.&lt;/p&gt;

&lt;p&gt;It was founded in 2003 when Red Hat decided to split its Linux distribution into two: Red Hat Enterprise Linux (RHEL) for businesses, and Fedora, a free, community-supported version for everyone else.&lt;/p&gt;

&lt;h2&gt;
  
  
  What actually happens inside Fedora?
&lt;/h2&gt;

&lt;p&gt;At the surface level, a new version of Fedora Linux is released every six months, but behind that release is a whole ecosystem of teams working simultaneously. There are teams handling infrastructure, release engineering, quality assurance, documentation, design, localization, and community outreach. Groups like the Design Team work on improving user experience, providing artwork and usability services to the project while other teams focus on keeping the servers running, reviewing packages, and writing the docs you'll depend on as a new user.&lt;/p&gt;

&lt;p&gt;Beyond the day-to-day work, Fedora also coordinates major community events like Flock to Fedora, an annual contributor conference. These are spaces where contributors meet, collaborate face-to-face, and shape the direction of the project.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Four Foundations - Fedora's Soul in Four Words
&lt;/h2&gt;

&lt;p&gt;The Four Foundations are: Freedom, Friends, Features, and First. Let me tell you what each one actually means.&lt;/p&gt;

&lt;h3&gt;
  
  
  Freedom
&lt;/h3&gt;

&lt;p&gt;The Fedora Project promotes the use of free, open-source software and does not distribute proprietary software with very limited exceptions for hardware firmware. But Freedom here goes deeper than the fact that it is just free to download. It means you have the right to look inside the software you use, modify it, share it, and build on top of it. Coming from other proprietary software where updates happen to you, settings are hidden behind paywalls, and you never really know what's running in the background, this hit differently.&lt;/p&gt;

&lt;h3&gt;
  
  
  Friends
&lt;/h3&gt;

&lt;p&gt;The Fedora Project cultivates a diverse community open to people from all backgrounds. This is the first thing that you will notice - You can come up as a beginner, asking questions you are almost too embarrassed to ask, but people will show up to help you. There is no gatekeeping.&lt;/p&gt;

&lt;h3&gt;
  
  
  Features
&lt;/h3&gt;

&lt;p&gt;The Fedora community creates technical features that make Fedora powerful, flexible, and usable for a wide spectrum of users. This is really exciting as a developer. You are working at the edge, which means you're always learning something real and relevant.&lt;/p&gt;

&lt;h3&gt;
  
  
  First
&lt;/h3&gt;

&lt;p&gt;Fedora's rapid release cycle enables the community to focus on innovation and maintain the forward momentum of its technical progress. Fedora ships a new version every six months, and it's often the first Linux distribution to introduce technologies that eventually make their way into enterprise systems used by millions.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Find Interesting About Fedora
&lt;/h2&gt;

&lt;p&gt;The deeper I went into Fedora, the more little things surprised me.&lt;/p&gt;

&lt;p&gt;Red Hat employees make up only about 35% of contributors. The majority are independent volunteers from around the world who just care. That genuinely shifted how I saw the whole project.&lt;/p&gt;

&lt;p&gt;Another thing I found fascinating is how Fedora is uniquely positioned in the Linux world. It's not a toy distro for tinkering, and it's not a slow, conservative enterprise system either. It lives right at the edge, shipping new technologies first, often months or years before they become mainstream. The tools you learn on Fedora today are the tools that power real enterprise infrastructure tomorrow. For a beginner who wants to learn things that actually matter, that's a big deal.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Find Confusing About Fedora
&lt;/h2&gt;

&lt;p&gt;Honestly? The terminology.&lt;/p&gt;

&lt;p&gt;When I first arrived, I kept seeing words like SIG, COPR, RPM, Koji, Bodhi, Pagure. Everyone in the community uses these terms casually. As a newcomer, it can feel like walking into a conversation that started without you.&lt;/p&gt;

&lt;p&gt;The other thing that confused me early on was understanding where things happen. Fedora's work is spread across mailing lists, Matrix channels, GitHub, Pagure and more. Figuring out where to ask a question or find the right team took me a while.&lt;/p&gt;

&lt;p&gt;But here's the thing, once you ask, people help. The confusion is temporary and the community is very patient.&lt;/p&gt;




&lt;h2&gt;
  
  
  Advice for Outreachy 2027 Applicants
&lt;/h2&gt;

&lt;p&gt;If you're reading this a year from now, preparing your Outreachy application, here's what I wish someone had told me:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Start before you feel ready&lt;/strong&gt; : You don't need to know Linux deeply to begin. You just need curiosity and a willingness to figure things out.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fedora is bigger than it looks&lt;/strong&gt; : Don't just think of it as an OS, explore the teams, the community calls, the different editions. The breadth of it is actually what makes it beginner-friendly, there's genuinely something for everyone, whatever your skill set.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ask questions out loud&lt;/strong&gt; : The Fedora community is very welcoming. Your "silly" question is probably someone else's too.&lt;/p&gt;

</description>
      <category>fedora</category>
      <category>opensource</category>
      <category>linux</category>
    </item>
  </channel>
</rss>
