<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: zhenghaoz</title>
    <description>The latest articles on DEV Community by zhenghaoz (@zhenghaoz).</description>
    <link>https://dev.to/zhenghaoz</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3755726%2F21ba6ac2-0f5f-4239-9ba4-c158e52a814c.jpeg</url>
      <title>DEV Community: zhenghaoz</title>
      <link>https://dev.to/zhenghaoz</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/zhenghaoz"/>
    <language>en</language>
    <item>
      <title>Building a Recommender System for GitHub Repositories</title>
      <dc:creator>zhenghaoz</dc:creator>
      <pubDate>Fri, 06 Feb 2026 13:04:22 +0000</pubDate>
      <link>https://dev.to/zhenghaoz/building-a-recommender-system-for-github-repositories-4f1p</link>
      <guid>https://dev.to/zhenghaoz/building-a-recommender-system-for-github-repositories-4f1p</guid>
      <description>&lt;p&gt;We built &lt;a href="https://gitrec.gorse.io" rel="noopener noreferrer"&gt;GitRec&lt;/a&gt;, a recommender system for GitHub repositories. This project not only demonstrates the basic capabilities of &lt;a href="https://github.com/gorse-io/gorse" rel="noopener noreferrer"&gt;Gorse recommender system&lt;/a&gt; but also helps users discover interesting and useful repositories among the massive amount of open-source projects.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;a href="https://gitrec.gorse.io/" rel="noopener noreferrer"&gt;GitRec&lt;/a&gt; has been running for three years, but it underwent a comprehensive upgrade and refactoring for Gorse v0.5 in 2025. The content in this article is based on the latest version of &lt;a href="https://gitrec.gorse.io/" rel="noopener noreferrer"&gt;GitRec&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Data Collection
&lt;/h2&gt;

&lt;p&gt;You can't make bricks without straw. The first step in building a recommender system is to construct the dataset consists of items, users, and feedback.&lt;/p&gt;

&lt;h3&gt;
  
  
  Items: GitHub Repositories
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;ItemId&lt;/code&gt;: To facilitate URL concatenation, &lt;code&gt;/&lt;/code&gt; in the repository name is replaced with &lt;code&gt;:&lt;/code&gt; and unified to lowercase.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Categories&lt;/code&gt;: The main programming language of the repository, can be used to filter recommendation results.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Timestamp&lt;/code&gt;: The last update time of the repository.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Labels&lt;/code&gt; is a JSON composed of two fields:

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Topics&lt;/strong&gt;: Topics added by the repository administrator.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Text Embedding of Description&lt;/strong&gt;: A highlight of GitRec is that it uses OpenAI's &lt;code&gt;text-embedding-v3&lt;/code&gt; model to generate a 512-dimensional embedding vector for the description of each repository. If a repository has no description, GitRec uses the &lt;code&gt;gpt-5-nano&lt;/code&gt; to read its &lt;code&gt;README.md&lt;/code&gt; file and generate a single-sentence summary before embedding.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;The following is the item record of the &lt;a href="https://github.com/gorse-io/gorse" rel="noopener noreferrer"&gt;Gorse&lt;/a&gt; repository in Gorse. The &lt;code&gt;Comment&lt;/code&gt; field uses the repository description as a remark for easy viewing in the dashboard.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"ItemId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"gorse-io:gorse"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"IsHidden"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Categories"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"go"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2025-05-24T19:41:09Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Labels"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"embedding"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;-0.0913363918662071&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;-0.0101912319660187&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;-0.0689065530896187&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.0137317562475801&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"topics"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"recommender-system"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"collaborative-filtering"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"go"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"knn"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"machine-learning"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Comment"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Gorse open source recommender system engine"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Due to the huge number of repositories on GitHub, GitRec only collects repositories with more than 100 stars to balance cost and coverage. Repositories are mainly collected in two ways:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Crawl repositories on GitHub trending every day.&lt;/li&gt;
&lt;li&gt;When accepting user feedback, add the repositories involved in user feedback to the item database.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Users: Giving up Personal Information
&lt;/h3&gt;

&lt;p&gt;After comprehensive consideration, GitRec chose not to collect any personal information of users other than User ID and feedback, for the following reasons:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The personal information of GitHub users is very diverse. Except for a small amount of structured data such as company and location, users can also provide information in unstructured text such as personal biography and README files. At the same time, many users fill in very little information or even none, making recommendation based on personal information difficult.&lt;/li&gt;
&lt;li&gt;In the current era of emphasis on privacy protection, choosing not to collect personal information is a good choice.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In Gorse, the &lt;code&gt;UserId&lt;/code&gt; of a user record corresponds to the GitHub User ID, while other fields are empty. Whenever a new user signs into GitRec, the system automatically creates a user record.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"UserId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"zhenghaoz"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Labels"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Comment"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;""&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Feedback: Like and Read
&lt;/h3&gt;

&lt;p&gt;Through the GitHub API, we can actually only get the user's &lt;code&gt;star&lt;/code&gt; behavior on the repository, but cannot get the "read" behavior. Therefore, GitRec provides a browser extension that can collect records of users browsing repositories. Finally, the definition of feedback in GitRec is as follows:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Positive Feedback&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;star&lt;/code&gt;: The user stars a repository on GitHub.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;like&lt;/code&gt;: The user clicks ::likefill:: on a repository on the GitRec website.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;read&amp;gt;=3&lt;/code&gt;: The user views a repository at least 3 times.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Read Feedback&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;read&lt;/code&gt;: The user views a repository.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;code&gt;read&lt;/code&gt; feedback is collected by the browser extension, the &lt;code&gt;like&lt;/code&gt; feedback is collected by the GitRec website, and the &lt;code&gt;star&lt;/code&gt; feedback is synchronized via the GitHub API once a day. The &lt;code&gt;read&lt;/code&gt; count accumulates each time a user visits the repository. When the count reaches 3, the system converts it into positive feedback.&lt;/p&gt;

&lt;h2&gt;
  
  
  Recommendation Pipeline Configuration
&lt;/h2&gt;

&lt;p&gt;The detailed configuration of the GitRec recommendation pipeline can be found in the &lt;a href="https://github.com/gorse-io/gitrec/blob/master/config.toml" rel="noopener noreferrer"&gt;config.toml&lt;/a&gt; file of the repository. The main content is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Non-personalized recommender &lt;code&gt;most_starred_weekly&lt;/code&gt;: The score is calculated by a custom formula, recommending the repositories with the most stars this week.&lt;/li&gt;
&lt;li&gt;Item-to-item recommender &lt;code&gt;neighbors&lt;/code&gt;: Recommend similar items based on the embedding of the repository description.&lt;/li&gt;
&lt;li&gt;User-to-item recommender &lt;code&gt;neighbors&lt;/code&gt;: Recommend items liked by users with high overlap in starred repositories.&lt;/li&gt;
&lt;li&gt;Collaborative filtering recommender keeps the default configuration.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The ranker merges the above recommendation results and the latest items, then uses the trained factorization machine for final ranking. If the generated recommendation results are insufficient, item-to-item recommendations and latest items are used to fill in sequentially. The recommendation pipeline is also displayed in RecFlow editor:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fplbtghtyxcwg5nygt13s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fplbtghtyxcwg5nygt13s.png" alt=" " width="800" height="385"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  User Interface and Interaction
&lt;/h2&gt;

&lt;p&gt;GitRec provides services to users in two ways.&lt;/p&gt;

&lt;h3&gt;
  
  
  Website
&lt;/h3&gt;

&lt;p&gt;The website provides an immersive repository discovery experience modeled after TikTok:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3wxvskijiwqkls3mbjlt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3wxvskijiwqkls3mbjlt.png" alt=" " width="800" height="465"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Explore&lt;/strong&gt;: This is the core function of the website, displaying the &lt;code&gt;README&lt;/code&gt; content of the repository in full screen. You can choose:

&lt;ul&gt;
&lt;li&gt;Click the ❤️ button, this behavior will be recorded as a positive feedback.&lt;/li&gt;
&lt;li&gt;Click the ▶️ button, the system will mark the current repository as "read" and present the next recommendation for you.&lt;/li&gt;
&lt;li&gt;You can select the programming language at the top of the page to filter the recommendation results.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Favorites&lt;/strong&gt;: This contains all the repositories you have ⭐ on GitHub or ❤️ on the GitRec website.&lt;/li&gt;

&lt;/ul&gt;

&lt;h3&gt;
  
  
  Browser Extension
&lt;/h3&gt;

&lt;p&gt;The browser extension seamlessly integrates recommendations into GitHub pages:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwjlpv4qf1r3ysj3v4cct.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwjlpv4qf1r3ysj3v4cct.png" alt=" " width="800" height="545"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;"Explore repositories" on GitHub homepage&lt;/strong&gt;: On the GitHub homepage, the extension injects an "Explore repositories" module to provide personalized repository recommendations. Even without logging into GitRec, the extension can provide preliminary personalized recommendations based on recently starred repository.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"Related repositories" on repository page&lt;/strong&gt;: When browsing any GitHub repository with more than 100 stars, the extension adds a "Related repositories" column on the right side of the page. This will display several repositories most similar to the current repository.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Read feedback collection&lt;/strong&gt;: The extension records the repositories browsed and sends this read feedback to the GitRec server, collecting no other personal information.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Scale and Cost
&lt;/h2&gt;

&lt;p&gt;By the end of 2025, GitRec's data scale is as follows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;More than 240,000 GitHub repositories&lt;/li&gt;
&lt;li&gt;More than 3,000 registered users&lt;/li&gt;
&lt;li&gt;More than 300,000 user feedback records&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Gorse recommender system and database are deployed on a cloud server with 2 CPU cores and 8GB RAM, costing 50 USD/month.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;GitRec is not only a GitHub repository recommender system, but also a testing ground for the Gorse recommender system. Welcome to use the GitRec &lt;a href="https://gitrec.gorse.io" rel="noopener noreferrer"&gt;website&lt;/a&gt; or install the browser extension (supports &lt;a href="https://chromewebstore.google.com/detail/gitrec/eihokbaeiebdenibjophfipedicippfl" rel="noopener noreferrer"&gt;Chrome&lt;/a&gt;, &lt;a href="https://addons.mozilla.org/zh-CN/firefox/addon/gitrec/" rel="noopener noreferrer"&gt;Firefox&lt;/a&gt; and &lt;a href="https://microsoftedge.microsoft.com/addons/detail/gitrec/cpcfbfpnagiffgpmfljmcdokmfjffdpa" rel="noopener noreferrer"&gt;Edge&lt;/a&gt;) to help the continuous development of the Gorse recommender system.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>github</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
