<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Merrill Cook</title>
    <description>The latest articles on DEV Community by Merrill Cook (@merrillcook2).</description>
    <link>https://dev.to/merrillcook2</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F183089%2F45cff811-e022-4620-9c8b-0cef4a5eee40.png</url>
      <title>DEV Community: Merrill Cook</title>
      <link>https://dev.to/merrillcook2</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/merrillcook2"/>
    <language>en</language>
    <item>
      <title>Meet the 10k Most Important People In The World (According to an AI)</title>
      <dc:creator>Merrill Cook</dc:creator>
      <pubDate>Mon, 11 Jan 2021 22:50:49 +0000</pubDate>
      <link>https://dev.to/merrillcook2/meet-the-10k-most-important-people-in-the-world-according-to-an-ai-16b6</link>
      <guid>https://dev.to/merrillcook2/meet-the-10k-most-important-people-in-the-world-according-to-an-ai-16b6</guid>
      <description>&lt;p&gt;What makes a human important? Their humanity, sure. But what makes you REALLY important? What would a balanced jury of your peers pull out about your life? &lt;/p&gt;

&lt;p&gt;Maybe you try really hard as a parent. Or were part of the making of a product. Maybe you shook up the world in public or maybe just had some particularly happy moments with a few. &lt;/p&gt;

&lt;p&gt;Emily Dickinson hardly left her house. And spent the last two decades of her life refusing visitors. But in the end left an indelible mark on literary history. &lt;/p&gt;

&lt;p&gt;*&lt;em&gt;Inherent importance of life aside, there’s a potential underlying structure to what we value in all these scenarios. And it’s readable by an AI. *&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Semantic triples follow the structure of &lt;em&gt;subject — predicate — object&lt;/em&gt;. &lt;/p&gt;

&lt;p&gt;“Steve graduated from Harvard”&lt;/p&gt;

&lt;p&gt;“Sam is 37” &lt;/p&gt;

&lt;p&gt;“Marissa Mayer was the CEO of Yahoo!”&lt;/p&gt;

&lt;p&gt;“My mother is a skilled flautist” &lt;/p&gt;

&lt;p&gt;These are all semantic triples. And the act of drawing inferences from them is a veritable gold mine of linked data when done at scale. &lt;/p&gt;

&lt;p&gt;This structure is what provides the underlying organization of a &lt;a href="https://en.wikipedia.org/wiki/Knowledge_graph"&gt;knowledge graph&lt;/a&gt;. For simplicity’s sake, you can think of knowledge graphs like a relational database. But basically they’re comprised of nodes (entities), and edges (relationships between entities). &lt;/p&gt;

&lt;p&gt;Where most databases historically have been structured to retain the structure of each individual entry (think a row a  spreadsheet), knowledge graphs are structured around the relationships between entities. This relationship-first&lt;br&gt;
structure has long been coveted as a cornerstone of the semantic web. And today we’re just seeing these fruits bear out at large through tools like Siri, richer search results, data enrichment tools, and more. &lt;/p&gt;

&lt;p&gt;There are probably two public knowledge graphs of particular note. Google’s &lt;a href="https://en.wikipedia.org/wiki/Google_Knowledge_Graph"&gt;Knowledge Graph&lt;/a&gt; is perhaps the most well known and commonly used. Diffbot’s &lt;a href="https://www.diffbot.com/products/knowledge-graph/"&gt;Knowledge Graph&lt;/a&gt; is the largest and most accurate knowledge graph sourced from the public web. &lt;/p&gt;

&lt;p&gt;There’s no public end point for consuming all of the relationships in Google’s KG data. So for the purposes of this exploration we used the data from Diffbot’s KG. &lt;/p&gt;

&lt;p&gt;*&lt;em&gt;So what does this have to do with importance? *&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;As previously mentioned, we tend to think individuals are more or less important based on how many lives or entities they’ve touched. And in turn how important those lives and entities are. Whether by proxy (making a product, or a poem), or in person (being a boss, or a friend, or attending something). &lt;/p&gt;

&lt;p&gt;The relationship-first nature of knowledge graphs does a good job at representing the way we actually view the world. And one factor present in Diffbot’s Knowledge Graph is an “importance” score for each entity. This is basically used to determine who you’re likelier to mean if you inquire about apple. Do you mean Apple Inc. or the fruit? &lt;/p&gt;

&lt;p&gt;Apple Inc. has millions of connections (“edges” in knowledge graph speak). News mentions, many employees, investors, products, reviews. Sure apples are popular. But in the context of a Knowledge Graph centered around organizations and&lt;br&gt;
people, you‘re probably after Apple Inc. &lt;/p&gt;

&lt;p&gt;And keep in mind that the Knowledge Graph is sourced from the public web. In essence an AI built to read web pages and infer facts. Surely there are many books out there about apple farming. But that’s not a huge portion of the web. &lt;/p&gt;

&lt;p&gt;*&lt;em&gt;So what can we learn from the 10k most important people (“MIPs”)? *&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;*&lt;em&gt;No names are named here. But what does it take to have more connections than nearly anyone in the world? *&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Education
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s---Q7aII0A--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2560/1%2ALkqZgAv72ewqTSx2fUHBmg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s---Q7aII0A--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2560/1%2ALkqZgAv72ewqTSx2fUHBmg.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As one would likely expect, certain schools are outsized pipelines to influence.&lt;/p&gt;

&lt;p&gt;Looking at the most commonly attended schools in this cohort, the following are likely to be present more than once in every 200 MIPs. &lt;/p&gt;

&lt;p&gt;In particular: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Harvard University — 1 in 14 MIPs&lt;/li&gt;
&lt;li&gt;Stanford University — 1 in 28 MIPs&lt;/li&gt;
&lt;li&gt;University of California Berkeley — 1 in 32 MIPs&lt;/li&gt;
&lt;li&gt;Massachusetts Institute of Technology — 1 in 52 MIPs&lt;/li&gt;
&lt;li&gt;University of Pennsylvania — 1 in 64 MIPs&lt;/li&gt;
&lt;li&gt;Columbia University — 1 in 85 MIPs&lt;/li&gt;
&lt;li&gt;Yale University — 1 in 88 MIPs&lt;/li&gt;
&lt;li&gt;University of Chicago — 1 in 110 MIPs&lt;/li&gt;
&lt;li&gt;University of Cambridge — 1 in 124 MIPs&lt;/li&gt;
&lt;li&gt;Northwestern University — 1 in 124 MIPs&lt;/li&gt;
&lt;li&gt;University of Oxford — 1 in 127 MIPs&lt;/li&gt;
&lt;li&gt;Cornell University — 1 in 162 MIPs&lt;/li&gt;
&lt;li&gt;University of Illinois — 1 in 165 MIPs&lt;/li&gt;
&lt;li&gt;UCLA — 1 in 191 MIPs&lt;/li&gt;
&lt;li&gt;Brown University — 1 in 196 MIPs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Our 10,000 MIPs attended a total of slightly over 3,000 schools. 65 of these schools were attended by over 30 MIPs each. And the top handful attended by hundreds of MIPs. &lt;/p&gt;

&lt;p&gt;65% of total MIPs did not attend these 65 premier schools, however. And a small handful did not attend higher education. &lt;/p&gt;

&lt;p&gt;A cluster of pre-collegiate schools also surfaced. For individuals where their pre-collegiate training is listed online. &lt;/p&gt;

&lt;p&gt;**Roughly 1 in 200 **of our MIPs attended Eton College (British prep school). &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Roughly 1 in 375&lt;/strong&gt; of our MIPs attended the Bronx High School of Science. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Roughly 1 in 1000&lt;/strong&gt; of our MIPs attended the following high schools: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Phillips Academy&lt;/li&gt;
&lt;li&gt;Horace Mann School&lt;/li&gt;
&lt;li&gt;Berkeley High School&lt;/li&gt;
&lt;li&gt;Phillips Exeter Academy&lt;/li&gt;
&lt;li&gt;Gaithersburg High School&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;And roughly 1 in 3000&lt;/strong&gt; of our MIPs attended the following: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stuyvesant High School&lt;/li&gt;
&lt;li&gt;Greeley Central High School&lt;/li&gt;
&lt;li&gt;Towson High School&lt;/li&gt;
&lt;li&gt;Horace Greeley High School&lt;/li&gt;
&lt;li&gt;Saint Ignatius High School&lt;/li&gt;
&lt;li&gt;Beverly Hills High School&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Internationally, clusters were less extreme. But the most common non-American universities attended by our MIPs included: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cambridge University&lt;/li&gt;
&lt;li&gt;Oxford University&lt;/li&gt;
&lt;li&gt;INSEAD&lt;/li&gt;
&lt;li&gt;London School of Economics&lt;/li&gt;
&lt;li&gt;Imperial College London&lt;/li&gt;
&lt;li&gt;Hebrew University of Jerusalem&lt;/li&gt;
&lt;li&gt;University of the Witwatersrand&lt;/li&gt;
&lt;li&gt;Tel Aviv University&lt;/li&gt;
&lt;li&gt;University of Western Ontario&lt;/li&gt;
&lt;li&gt;University of British Columbia&lt;/li&gt;
&lt;li&gt;Indian Institute of Technology&lt;/li&gt;
&lt;li&gt;London Business School&lt;/li&gt;
&lt;li&gt;University of Waterloo&lt;/li&gt;
&lt;li&gt;National University of Singapore&lt;/li&gt;
&lt;li&gt;HEC Paris&lt;/li&gt;
&lt;li&gt;University of London&lt;/li&gt;
&lt;li&gt;McGill University&lt;/li&gt;
&lt;li&gt;University of Manchester&lt;/li&gt;
&lt;li&gt;University of Capetown&lt;/li&gt;
&lt;li&gt;University of Taiwan&lt;/li&gt;
&lt;li&gt;University College London&lt;/li&gt;
&lt;li&gt;King’s College London&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Skills
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--v4TIwuAa--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2560/1%2A4nb8WsjWWTNgfJgszOk5Yg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--v4TIwuAa--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2560/1%2A4nb8WsjWWTNgfJgszOk5Yg.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At the end of the day, education will only get you so far. In our&lt;br&gt;
hyper-specialized economies there are many ways to get ahead. And many problems worth solving. Let’s take a look at the most common skills our MIPs possess. &lt;/p&gt;

&lt;p&gt;In total, our 10k MIPs have listed or attested to roughly 6,000 unique skillsets, suggesting a sizable amount of overlap. &lt;/p&gt;

&lt;p&gt;If you had to guess one single skill that is most prevalent among these individuals, you probably wouldn’t get it. Not even on a multiple choice test. &lt;/p&gt;

&lt;p&gt;*&lt;em&gt;The single most common skill attributed to our 10,000 MIPs is teaching. *&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Of every skill attributed to the MIPs, one out of 55 is teaching. That might not be quite what you expect from our empire-creating cadre. But in a larger cluster of human-related skills it starts to make more sense: teaching, management, leadership, human resources management. &lt;/p&gt;

&lt;p&gt;Add to that that a large portion of the individuals in question hold advanced degrees and at one point were university TAs, and perhaps the number isn’t that surprising. &lt;/p&gt;

&lt;p&gt;In descending order, the 50 most common skills attributed to our MIPs include: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Teaching&lt;/li&gt;
&lt;li&gt;Economics&lt;/li&gt;
&lt;li&gt;Management&lt;/li&gt;
&lt;li&gt;Marketing&lt;/li&gt;
&lt;li&gt;Supply Chain Management&lt;/li&gt;
&lt;li&gt;Start-ups&lt;/li&gt;
&lt;li&gt;Strategy&lt;/li&gt;
&lt;li&gt;Sales&lt;/li&gt;
&lt;li&gt;Entrepreneurship&lt;/li&gt;
&lt;li&gt;Leadership&lt;/li&gt;
&lt;li&gt;Law&lt;/li&gt;
&lt;li&gt;Mass Media&lt;/li&gt;
&lt;li&gt;Human Resources Management&lt;/li&gt;
&lt;li&gt;Software Development&lt;/li&gt;
&lt;li&gt;Business Development&lt;/li&gt;
&lt;li&gt;Cloud Technologies&lt;/li&gt;
&lt;li&gt;Strategic Partnerships&lt;/li&gt;
&lt;li&gt;Product Management&lt;/li&gt;
&lt;li&gt;Content Management Systems&lt;/li&gt;
&lt;li&gt;Writing&lt;/li&gt;
&lt;li&gt;Public Speaking&lt;/li&gt;
&lt;li&gt;Advertising&lt;/li&gt;
&lt;li&gt;Mathematics&lt;/li&gt;
&lt;li&gt;Social Media&lt;/li&gt;
&lt;li&gt;Venture Capital&lt;/li&gt;
&lt;li&gt;Mergers and Acquisitions&lt;/li&gt;
&lt;li&gt;Research&lt;/li&gt;
&lt;li&gt;Mobile Technologies&lt;/li&gt;
&lt;li&gt;User Interface&lt;/li&gt;
&lt;li&gt;Ecommerce&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Working through the entire list of skills, three clusters appear:&lt;br&gt;
finance-related skills, engineering-related skills, and marketing or public-facing skills. &lt;/p&gt;

&lt;p&gt;The top finance-related skills include: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Economics&lt;/li&gt;
&lt;li&gt;Venture Capital &lt;/li&gt;
&lt;li&gt;Mergers and Acquisitions&lt;/li&gt;
&lt;li&gt;Investing&lt;/li&gt;
&lt;li&gt;And Fundraising&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The top engineering-related skills include: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cloud Technologies&lt;/li&gt;
&lt;li&gt;Mobile Technologies&lt;/li&gt;
&lt;li&gt;Enterprise Software&lt;/li&gt;
&lt;li&gt;Networking Technologies&lt;/li&gt;
&lt;li&gt;And Robotics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The top public-facing skills: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Marketing&lt;/li&gt;
&lt;li&gt;Sales&lt;/li&gt;
&lt;li&gt;Mass Media&lt;/li&gt;
&lt;li&gt;Public Speaking&lt;/li&gt;
&lt;li&gt;And Online Advertising&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--pQiFM3mZ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2560/1%2AbD4zyoaC3Ax26VVmhk9CpQ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--pQiFM3mZ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2560/1%2AbD4zyoaC3Ax26VVmhk9CpQ.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;**A large majority of MIPs also specialize. **While a cluster of skills are shared by many MIPs (as in the illustration above), a majority of skills are one-offs, shared by no or very few other MIPs. &lt;/p&gt;

&lt;p&gt;While there are too many specializations to list, to exemplify the range of industries and competency areas represented, a random sample is presented below.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Union negotiations&lt;/li&gt;
&lt;li&gt;eSports&lt;/li&gt;
&lt;li&gt;Phytochemicals&lt;/li&gt;
&lt;li&gt;Quorum Sensing&lt;/li&gt;
&lt;li&gt;Essential Oils&lt;/li&gt;
&lt;li&gt;Federal Budget Management&lt;/li&gt;
&lt;li&gt;Printing Solutions&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Location
&lt;/h3&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--wNDIp7v_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2560/1%2AUxtzcCdTacjdDqTST_CaMQ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--wNDIp7v_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2560/1%2AUxtzcCdTacjdDqTST_CaMQ.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;While we’ve just witnessed the year of remote work, location still matters. Particularly in networking-heavy, governmental, research, and capital-intensive industries like manufacturing, MIPs tend to cluster. &lt;/p&gt;

&lt;p&gt;In fact, while many of these individuals have undoubtedly worked remotely for at least part of 2020, &lt;strong&gt;only 1 in 100 have listed remote working&lt;/strong&gt; as a current or past job location. &lt;/p&gt;

&lt;p&gt;Our 10k MIPs are listed as working in a total of 1,800 locations throughout their lives. Considering there are over 4,000 mid-sized cities in the world, this suggests a definite clustering. The most recent location listed for each of our 10k MIPs lowers this number to around 600 cities, with only 36 cities hosting more than 1 in 250 of our MIPs. &lt;/p&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--BsUBLTnJ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2560/1%2AaIEpFU88rztPkc7JSYduAQ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--BsUBLTnJ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2560/1%2AaIEpFU88rztPkc7JSYduAQ.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;*&lt;em&gt;Of MIPs located in the top 100 MIP-hosting locations in the US, 1 in 3 are cities in California, 1 in 6 are in New York, and one in 15 in D.C. No other locations come close. *&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Beyond large financial, research governmental, and technical hubs, noteworthy small clusters include well-known university towns throughout the United States and Europe. &lt;/p&gt;

&lt;p&gt;Additionally, there are definite “stepping stone” locations among MIPs. These are past locations associated with MIPs. And this range of locations pulls in a range of university towns with the leading few including: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cambridge, MA&lt;/li&gt;
&lt;li&gt;Stanford, CA&lt;/li&gt;
&lt;li&gt;Berkeley, CA&lt;/li&gt;
&lt;li&gt;Princeton, NJ&lt;/li&gt;
&lt;li&gt;Oxford, UK&lt;/li&gt;
&lt;li&gt;New Haven, CT&lt;/li&gt;
&lt;li&gt;Boulder, CO&lt;/li&gt;
&lt;li&gt;Ann Arbor, MI&lt;/li&gt;
&lt;li&gt;Evanston, IL&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Job Titles
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--y4IdAAqR--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2560/1%2AtdwVBjOpd4HPe9faB7djAA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--y4IdAAqR--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2560/1%2AtdwVBjOpd4HPe9faB7djAA.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Most large scale impact by MIPs is derived from their work. And while MIP work is at the end of the day very wide ranging, definite clusters appear. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;More than 1 in 8 MIPs work in computing or information science roles&lt;/li&gt;
&lt;li&gt;More than 1 in 8 MIPs work in finance-related industries&lt;/li&gt;
&lt;li&gt;More than 1 in 10 MIPs work in software-related industries&lt;/li&gt;
&lt;li&gt;More than 1 in 20 MIPs work in health care-related industries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For job titles, many MIPs have accumulated quite a number through the years, and&lt;br&gt;
hold several simultaneously. &lt;/p&gt;

&lt;p&gt;**The single most common job title of our MIPs was board member. **Though many of these individuals also lead or help lead their own enterprise. &lt;/p&gt;

&lt;p&gt;As one might expect, the top handful of job titles for MIPs &lt;br&gt;
 include: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Board member&lt;/li&gt;
&lt;li&gt;Chairman of the board&lt;/li&gt;
&lt;li&gt;CEO&lt;/li&gt;
&lt;li&gt;Founder / Co-Founder&lt;/li&gt;
&lt;li&gt;Owner&lt;/li&gt;
&lt;li&gt;Executive Director&lt;/li&gt;
&lt;li&gt;Chancellor&lt;/li&gt;
&lt;li&gt;And Partner&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Roughly half of all current jobs held by MIPs were some derivation of the above titles. For the other half, an exceedingly diverse range of titles emerges. A sampling includes: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Angel Investor&lt;/li&gt;
&lt;li&gt;Lobbyist&lt;/li&gt;
&lt;li&gt;Chief negotiator&lt;/li&gt;
&lt;li&gt;Journalist&lt;/li&gt;
&lt;li&gt;Philosopher&lt;/li&gt;
&lt;li&gt;Governor&lt;/li&gt;
&lt;li&gt;Attorney General&lt;/li&gt;
&lt;li&gt;General&lt;/li&gt;
&lt;li&gt;Bass Player&lt;/li&gt;
&lt;li&gt;Chief Scientist&lt;/li&gt;
&lt;li&gt;Author&lt;/li&gt;
&lt;li&gt;Producer&lt;/li&gt;
&lt;li&gt;Senator&lt;/li&gt;
&lt;li&gt;Rector&lt;/li&gt;
&lt;li&gt;Evangelist&lt;/li&gt;
&lt;li&gt;Bishop&lt;/li&gt;
&lt;li&gt;Head Coach&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  So what have we learned?
&lt;/h3&gt;

&lt;p&gt;On one level the public (in this case facts from the public web) visibility of individuals will never capture a truly holistic vision of “important” people. Importance is subjective in and of itself.&lt;/p&gt;

&lt;p&gt;But the ability to structure and quantify relationships at scale is new. Particularly from otherwise unstructured natural language and visuals from around the web. &lt;/p&gt;

&lt;p&gt;This quick illustration validates many things one may have already known. Power and influence cluster. Education matters. There are a few ways to gain large levels of influence, and they tend to revolve around public service, being the best in a particular niche, building a company, or owning things. And this seems&lt;br&gt;
to align with a common sense view of who would realistically be able to change a large number of lives. Or have more “touch points” with the world. &lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Two Simple Techniques For Web Scraping Pages With Dynamically-Created CSS Class Names</title>
      <dc:creator>Merrill Cook</dc:creator>
      <pubDate>Mon, 14 Dec 2020 16:19:53 +0000</pubDate>
      <link>https://dev.to/merrillcook2/two-simple-techniques-for-web-scraping-pages-with-dynamically-created-css-class-names-3f4n</link>
      <guid>https://dev.to/merrillcook2/two-simple-techniques-for-web-scraping-pages-with-dynamically-created-css-class-names-3f4n</guid>
      <description>&lt;p&gt;I get to work with a variety of web scraping products and techniques at my job at &lt;a href="https://www.diffbot.com?utm_source=medium&amp;amp;utm_medium=org&amp;amp;utm_campaign=gp"&gt;Diffbot&lt;/a&gt;. Aligned with Diffbot's mission to "structure the world's knowledge" is an initial step of first gathering the underlying data to be structured. Diffbot is one of three western entities that truly crawl the whole public web. So this involves a pretty stellar stack of web crawling, extraction, and parsing tools. &lt;/p&gt;

&lt;p&gt;Even with great tools, one of the challenges with crawling and extracting data from pages at a large scale is you don't really know what structure a page is going to have before you get to it. To this end, Diffbot employs a series of Automatic APIs. These are AI-enabled web extraction APIs that employ a range of techniques from computer vision through NLP to discern what data may be valuable on a page, and then to grab and structure that data.&lt;/p&gt;

&lt;p&gt;Based on our &lt;a href="https://www.youtube.com/watch?v=d58BwcyTwEo&amp;amp;t=465s"&gt;research&lt;/a&gt;, around 90% of the surface of the web can be classified into 20 distinct page types. These can be discussion pages, product pages, article pages, nav pages, organizational "about" pages, and so forth. And typically each "type" of page will share a cluster of characteristics.&lt;/p&gt;

&lt;p&gt;An event page is likely to have a time and date for the event. An article is likely to have an author. A product is likely to have an SKU. By training AI to look for available visual and non-visual fields that a page is likely to have (given it's type), you've bypassed the need to dive into site-specific structural details. &lt;br&gt;
This leads me to my first tip…&lt;/p&gt;
&lt;h2&gt;
  
  
  Tip #1: Don't Use Rule-Based Extraction
&lt;/h2&gt;

&lt;p&gt;Rule-based extraction is fine for small scale scraping, one-off scripts to grab some data, and sites that don't routinely change. But these days a site with data of any value that isn't dynamic to some degree is relatively rare. &lt;br&gt;
Additionally, classifying extraction rules for a given domain doesn't scale to multiple domains. Simply ensuring regularly updated web data from a small group of domains routinely requires a whole team to manage the process. And the process still breaks down. Trust me, we hear this a ton in conversations with current or potential clients. &lt;br&gt;
So you have a few choices for following this tip. Or at least for avoiding what this tip is meant to avoid: &lt;strong&gt;unscalable or regularly broken scrapers&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The first is that you can build a non-rule centered form of extraction custom to you. There are more free training data sets out there than ever before. Out of the box NLP is improving from a handful of providers. And particularly if you want to focus on a small set of domains, you may be able to pull this off. &lt;br&gt;
Secondly, you can reach out to the small handful of providers who truly offer rule-less web extraction. If you're wanting to extract from a wide range of sites, your sites are regularly changing, or your seeking a variety of document types, this is likely the way to go. &lt;/p&gt;

&lt;p&gt;Third, you can stick to gathering public web data about particularly well known sites. At the end of the day this may simply be paying someone else to maintain rule-based extractors for you. But - for example - there's a veritable cottage industry around scraping very specific sites like social media. Their whole business is provide up-to-date extractors for things like lists of members of a given Facebook group. But these scrape providers won't help if you want to monitor custom domains or on a vast majority of the web. &lt;/p&gt;
&lt;h2&gt;
  
  
  Tip#2: If You Have To Use Rule-Based Extraction Try Out These Advanced Selectors
&lt;/h2&gt;

&lt;p&gt;If you truly can't find a way to extract what you need with one of the options above, there are a few ways you can at least proof your scraping of dynamic content. &lt;/p&gt;

&lt;p&gt;Among Diffbot products, this is what the Custom API is for. It's our only rule-based extractor and it's essentially for page types unique enough to where they don't fit into a major page category. Or you just want to grab specific pieces of information from the page. You can pair it with Crawlbot to apply this API to large numbers of pages at once. &lt;/p&gt;

&lt;p&gt;Alternatively, this type of rule-based selector extraction is how most major extraction services work (like Import.io, plugin web extractors, Octoparse, or if you're rolling your own extractor with something like Selenium or BeautifulSoup). &lt;br&gt;
Now there are a few scenarios where these selectors become useful. Typically if a site is well structured, class and ID names make sense, and you have classed elements inside of classed elements, you're good without these techniques. &lt;/p&gt;

&lt;p&gt;But if you've spent anytime with web scraping, don't tell me you haven't occasionally gotten a few of these:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  &amp;lt;a href="/some/stuff" data-event="ev=filedownload" data-link-event=" Our_Book "&amp;gt;
    &amp;lt;span class=""&amp;gt;Download Our Book&amp;lt;/span&amp;gt;
  &amp;lt;/a&amp;gt;
&amp;lt;/div&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or...&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;div class="Cell-sc-1abjmm4-0 Layout__RailCell-sc-1goy157-1 hcxgdw"&amp;gt;
  &amp;lt;div class="RailGeneric__RailBox-sc-1565s4y-0 iZilXF mt5"&amp;gt;
    ...
  &amp;lt;/div&amp;gt;
  &amp;lt;div class="RailGeneric__AdviceBox-sc-1565s4y-3 kObkOT"&amp;gt;
    ...
  &amp;lt;/div&amp;gt;
&amp;lt;/div&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The above both stray from regular class declarations, and eschew attempts to extract data using typical selectors. They're both examples of irregular markup, but potentially in inverse ways. &lt;br&gt;
The first example provides very little traditional markup that could be used for typical CSS selectors. &lt;/p&gt;

&lt;p&gt;The second contains very specific class names that are dynamically created in something like React. &lt;/p&gt;

&lt;p&gt;For both, we can use the same handful of advanced CSS selectors to grab the values we want.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CSS Begins With, Ends With, and Contains &lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You won't encounter these CSS selectors very often when building your own site. And maybe that's why they're often overlooked in explanations. But many individuals don't know that you can essentially use regex in a subset of css selector types. &lt;br&gt;
Fortunately, Regex-like selectors can be applied to html attribute/value selectors. &lt;/p&gt;

&lt;p&gt;So in the first example above, something like the following works great:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;a[data-link*='Our_Book']
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Within CSS, square brackets are used to filter. And follow the general format of:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;element[attribute=value]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This in and of itself doesn't solve either of our issues up there, it's the inclusion of the three regex operators for begins with, ends with, and contains. &lt;/p&gt;

&lt;p&gt;In the above example grabbing Our_Book (note these selectors are case sensitive), the original markup has extra whitespace to either side of the characters. that's where our friend "contains" comes into play. In short these selectors work like so:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;div[class^="beginsWith"]
div[class$="endsWith"]
div[class*="containsThis"]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where class can be any attribute, and where the value string matches the beginning, ending, or some substring of the total value name.&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>webscraping</category>
      <category>css</category>
    </item>
  </channel>
</rss>
