DEV Community: Merrill Cook

Meet the 10k Most Important People In The World (According to an AI)

Merrill Cook — Mon, 11 Jan 2021 22:50:49 +0000

What makes a human important? Their humanity, sure. But what makes you REALLY important? What would a balanced jury of your peers pull out about your life?

Maybe you try really hard as a parent. Or were part of the making of a product. Maybe you shook up the world in public or maybe just had some particularly happy moments with a few.

Emily Dickinson hardly left her house. And spent the last two decades of her life refusing visitors. But in the end left an indelible mark on literary history.

*Inherent importance of life aside, there’s a potential underlying structure to what we value in all these scenarios. And it’s readable by an AI. *

Semantic triples follow the structure of subject — predicate — object.

“Steve graduated from Harvard”

“Sam is 37”

“Marissa Mayer was the CEO of Yahoo!”

“My mother is a skilled flautist”

These are all semantic triples. And the act of drawing inferences from them is a veritable gold mine of linked data when done at scale.

This structure is what provides the underlying organization of a knowledge graph. For simplicity’s sake, you can think of knowledge graphs like a relational database. But basically they’re comprised of nodes (entities), and edges (relationships between entities).

Where most databases historically have been structured to retain the structure of each individual entry (think a row a spreadsheet), knowledge graphs are structured around the relationships between entities. This relationship-first
structure has long been coveted as a cornerstone of the semantic web. And today we’re just seeing these fruits bear out at large through tools like Siri, richer search results, data enrichment tools, and more.

There are probably two public knowledge graphs of particular note. Google’s Knowledge Graph is perhaps the most well known and commonly used. Diffbot’s Knowledge Graph is the largest and most accurate knowledge graph sourced from the public web.

There’s no public end point for consuming all of the relationships in Google’s KG data. So for the purposes of this exploration we used the data from Diffbot’s KG.

*So what does this have to do with importance? *

As previously mentioned, we tend to think individuals are more or less important based on how many lives or entities they’ve touched. And in turn how important those lives and entities are. Whether by proxy (making a product, or a poem), or in person (being a boss, or a friend, or attending something).

The relationship-first nature of knowledge graphs does a good job at representing the way we actually view the world. And one factor present in Diffbot’s Knowledge Graph is an “importance” score for each entity. This is basically used to determine who you’re likelier to mean if you inquire about apple. Do you mean Apple Inc. or the fruit?

Apple Inc. has millions of connections (“edges” in knowledge graph speak). News mentions, many employees, investors, products, reviews. Sure apples are popular. But in the context of a Knowledge Graph centered around organizations and
people, you‘re probably after Apple Inc.

And keep in mind that the Knowledge Graph is sourced from the public web. In essence an AI built to read web pages and infer facts. Surely there are many books out there about apple farming. But that’s not a huge portion of the web.

*So what can we learn from the 10k most important people (“MIPs”)? *

*No names are named here. But what does it take to have more connections than nearly anyone in the world? *

Education

As one would likely expect, certain schools are outsized pipelines to influence.

Looking at the most commonly attended schools in this cohort, the following are likely to be present more than once in every 200 MIPs.

In particular:

Harvard University — 1 in 14 MIPs
Stanford University — 1 in 28 MIPs
University of California Berkeley — 1 in 32 MIPs
Massachusetts Institute of Technology — 1 in 52 MIPs
University of Pennsylvania — 1 in 64 MIPs
Columbia University — 1 in 85 MIPs
Yale University — 1 in 88 MIPs
University of Chicago — 1 in 110 MIPs
University of Cambridge — 1 in 124 MIPs
Northwestern University — 1 in 124 MIPs
University of Oxford — 1 in 127 MIPs
Cornell University — 1 in 162 MIPs
University of Illinois — 1 in 165 MIPs
UCLA — 1 in 191 MIPs
Brown University — 1 in 196 MIPs

Our 10,000 MIPs attended a total of slightly over 3,000 schools. 65 of these schools were attended by over 30 MIPs each. And the top handful attended by hundreds of MIPs.

65% of total MIPs did not attend these 65 premier schools, however. And a small handful did not attend higher education.

A cluster of pre-collegiate schools also surfaced. For individuals where their pre-collegiate training is listed online.

**Roughly 1 in 200 **of our MIPs attended Eton College (British prep school).

Roughly 1 in 375 of our MIPs attended the Bronx High School of Science.

Roughly 1 in 1000 of our MIPs attended the following high schools:

Phillips Academy
Horace Mann School
Berkeley High School
Phillips Exeter Academy
Gaithersburg High School

And roughly 1 in 3000 of our MIPs attended the following:

Stuyvesant High School
Greeley Central High School
Towson High School
Horace Greeley High School
Saint Ignatius High School
Beverly Hills High School

Internationally, clusters were less extreme. But the most common non-American universities attended by our MIPs included:

Cambridge University
Oxford University
INSEAD
London School of Economics
Imperial College London
Hebrew University of Jerusalem
University of the Witwatersrand
Tel Aviv University
University of Western Ontario
University of British Columbia
Indian Institute of Technology
London Business School
University of Waterloo
National University of Singapore
HEC Paris
University of London
McGill University
University of Manchester
University of Capetown
University of Taiwan
University College London
King’s College London

Skills

At the end of the day, education will only get you so far. In our
hyper-specialized economies there are many ways to get ahead. And many problems worth solving. Let’s take a look at the most common skills our MIPs possess.

In total, our 10k MIPs have listed or attested to roughly 6,000 unique skillsets, suggesting a sizable amount of overlap.

If you had to guess one single skill that is most prevalent among these individuals, you probably wouldn’t get it. Not even on a multiple choice test.

*The single most common skill attributed to our 10,000 MIPs is teaching. *

Of every skill attributed to the MIPs, one out of 55 is teaching. That might not be quite what you expect from our empire-creating cadre. But in a larger cluster of human-related skills it starts to make more sense: teaching, management, leadership, human resources management.

Add to that that a large portion of the individuals in question hold advanced degrees and at one point were university TAs, and perhaps the number isn’t that surprising.

In descending order, the 50 most common skills attributed to our MIPs include:

Teaching
Economics
Management
Marketing
Supply Chain Management
Start-ups
Strategy
Sales
Entrepreneurship
Leadership
Law
Mass Media
Human Resources Management
Software Development
Business Development
Cloud Technologies
Strategic Partnerships
Product Management
Content Management Systems
Writing
Public Speaking
Advertising
Mathematics
Social Media
Venture Capital
Mergers and Acquisitions
Research
Mobile Technologies
User Interface
Ecommerce

Working through the entire list of skills, three clusters appear:
finance-related skills, engineering-related skills, and marketing or public-facing skills.

The top finance-related skills include:

Economics
Venture Capital
Mergers and Acquisitions
Investing
And Fundraising

The top engineering-related skills include:

Cloud Technologies
Mobile Technologies
Enterprise Software
Networking Technologies
And Robotics

The top public-facing skills:

Marketing
Sales
Mass Media
Public Speaking
And Online Advertising

**A large majority of MIPs also specialize. **While a cluster of skills are shared by many MIPs (as in the illustration above), a majority of skills are one-offs, shared by no or very few other MIPs.

While there are too many specializations to list, to exemplify the range of industries and competency areas represented, a random sample is presented below.

Union negotiations
eSports
Phytochemicals
Quorum Sensing
Essential Oils
Federal Budget Management
Printing Solutions

Location

While we’ve just witnessed the year of remote work, location still matters. Particularly in networking-heavy, governmental, research, and capital-intensive industries like manufacturing, MIPs tend to cluster.

In fact, while many of these individuals have undoubtedly worked remotely for at least part of 2020, only 1 in 100 have listed remote working as a current or past job location.

Our 10k MIPs are listed as working in a total of 1,800 locations throughout their lives. Considering there are over 4,000 mid-sized cities in the world, this suggests a definite clustering. The most recent location listed for each of our 10k MIPs lowers this number to around 600 cities, with only 36 cities hosting more than 1 in 250 of our MIPs.

*Of MIPs located in the top 100 MIP-hosting locations in the US, 1 in 3 are cities in California, 1 in 6 are in New York, and one in 15 in D.C. No other locations come close. *

Beyond large financial, research governmental, and technical hubs, noteworthy small clusters include well-known university towns throughout the United States and Europe.

Additionally, there are definite “stepping stone” locations among MIPs. These are past locations associated with MIPs. And this range of locations pulls in a range of university towns with the leading few including:

Cambridge, MA
Stanford, CA
Berkeley, CA
Princeton, NJ
Oxford, UK
New Haven, CT
Boulder, CO
Ann Arbor, MI
Evanston, IL

Job Titles

Most large scale impact by MIPs is derived from their work. And while MIP work is at the end of the day very wide ranging, definite clusters appear.

More than 1 in 8 MIPs work in computing or information science roles
More than 1 in 8 MIPs work in finance-related industries
More than 1 in 10 MIPs work in software-related industries
More than 1 in 20 MIPs work in health care-related industries

For job titles, many MIPs have accumulated quite a number through the years, and
hold several simultaneously.

**The single most common job title of our MIPs was board member. **Though many of these individuals also lead or help lead their own enterprise.

As one might expect, the top handful of job titles for MIPs
include:

Board member
Chairman of the board
CEO
Founder / Co-Founder
Owner
Executive Director
Chancellor
And Partner

Roughly half of all current jobs held by MIPs were some derivation of the above titles. For the other half, an exceedingly diverse range of titles emerges. A sampling includes:

Angel Investor
Lobbyist
Chief negotiator
Journalist
Philosopher
Governor
Attorney General
General
Bass Player
Chief Scientist
Author
Producer
Senator
Rector
Evangelist
Bishop
Head Coach

So what have we learned?

On one level the public (in this case facts from the public web) visibility of individuals will never capture a truly holistic vision of “important” people. Importance is subjective in and of itself.

But the ability to structure and quantify relationships at scale is new. Particularly from otherwise unstructured natural language and visuals from around the web.

This quick illustration validates many things one may have already known. Power and influence cluster. Education matters. There are a few ways to gain large levels of influence, and they tend to revolve around public service, being the best in a particular niche, building a company, or owning things. And this seems
to align with a common sense view of who would realistically be able to change a large number of lives. Or have more “touch points” with the world.

Two Simple Techniques For Web Scraping Pages With Dynamically-Created CSS Class Names

Merrill Cook — Mon, 14 Dec 2020 16:19:53 +0000

I get to work with a variety of web scraping products and techniques at my job at Diffbot. Aligned with Diffbot's mission to "structure the world's knowledge" is an initial step of first gathering the underlying data to be structured. Diffbot is one of three western entities that truly crawl the whole public web. So this involves a pretty stellar stack of web crawling, extraction, and parsing tools.

Even with great tools, one of the challenges with crawling and extracting data from pages at a large scale is you don't really know what structure a page is going to have before you get to it. To this end, Diffbot employs a series of Automatic APIs. These are AI-enabled web extraction APIs that employ a range of techniques from computer vision through NLP to discern what data may be valuable on a page, and then to grab and structure that data.

Based on our research, around 90% of the surface of the web can be classified into 20 distinct page types. These can be discussion pages, product pages, article pages, nav pages, organizational "about" pages, and so forth. And typically each "type" of page will share a cluster of characteristics.

An event page is likely to have a time and date for the event. An article is likely to have an author. A product is likely to have an SKU. By training AI to look for available visual and non-visual fields that a page is likely to have (given it's type), you've bypassed the need to dive into site-specific structural details.
This leads me to my first tip…

Tip #1: Don't Use Rule-Based Extraction

Rule-based extraction is fine for small scale scraping, one-off scripts to grab some data, and sites that don't routinely change. But these days a site with data of any value that isn't dynamic to some degree is relatively rare.
Additionally, classifying extraction rules for a given domain doesn't scale to multiple domains. Simply ensuring regularly updated web data from a small group of domains routinely requires a whole team to manage the process. And the process still breaks down. Trust me, we hear this a ton in conversations with current or potential clients.
So you have a few choices for following this tip. Or at least for avoiding what this tip is meant to avoid: unscalable or regularly broken scrapers.

The first is that you can build a non-rule centered form of extraction custom to you. There are more free training data sets out there than ever before. Out of the box NLP is improving from a handful of providers. And particularly if you want to focus on a small set of domains, you may be able to pull this off.
Secondly, you can reach out to the small handful of providers who truly offer rule-less web extraction. If you're wanting to extract from a wide range of sites, your sites are regularly changing, or your seeking a variety of document types, this is likely the way to go.

Third, you can stick to gathering public web data about particularly well known sites. At the end of the day this may simply be paying someone else to maintain rule-based extractors for you. But - for example - there's a veritable cottage industry around scraping very specific sites like social media. Their whole business is provide up-to-date extractors for things like lists of members of a given Facebook group. But these scrape providers won't help if you want to monitor custom domains or on a vast majority of the web.

Tip#2: If You Have To Use Rule-Based Extraction Try Out These Advanced Selectors

If you truly can't find a way to extract what you need with one of the options above, there are a few ways you can at least proof your scraping of dynamic content.

Among Diffbot products, this is what the Custom API is for. It's our only rule-based extractor and it's essentially for page types unique enough to where they don't fit into a major page category. Or you just want to grab specific pieces of information from the page. You can pair it with Crawlbot to apply this API to large numbers of pages at once.

Alternatively, this type of rule-based selector extraction is how most major extraction services work (like Import.io, plugin web extractors, Octoparse, or if you're rolling your own extractor with something like Selenium or BeautifulSoup).
Now there are a few scenarios where these selectors become useful. Typically if a site is well structured, class and ID names make sense, and you have classed elements inside of classed elements, you're good without these techniques.

But if you've spent anytime with web scraping, don't tell me you haven't occasionally gotten a few of these:

  <a href="/some/stuff" data-event="ev=filedownload" data-link-event=" Our_Book ">
    <span class="">Download Our Book</span>
  </a>
</div>

Or...

<div class="Cell-sc-1abjmm4-0 Layout__RailCell-sc-1goy157-1 hcxgdw">
  <div class="RailGeneric__RailBox-sc-1565s4y-0 iZilXF mt5">
    ...
  </div>
  <div class="RailGeneric__AdviceBox-sc-1565s4y-3 kObkOT">
    ...
  </div>
</div>

The above both stray from regular class declarations, and eschew attempts to extract data using typical selectors. They're both examples of irregular markup, but potentially in inverse ways.
The first example provides very little traditional markup that could be used for typical CSS selectors.

The second contains very specific class names that are dynamically created in something like React.

For both, we can use the same handful of advanced CSS selectors to grab the values we want.

CSS Begins With, Ends With, and Contains

You won't encounter these CSS selectors very often when building your own site. And maybe that's why they're often overlooked in explanations. But many individuals don't know that you can essentially use regex in a subset of css selector types.
Fortunately, Regex-like selectors can be applied to html attribute/value selectors.

So in the first example above, something like the following works great:

a[data-link*='Our_Book']

Within CSS, square brackets are used to filter. And follow the general format of:

element[attribute=value]

This in and of itself doesn't solve either of our issues up there, it's the inclusion of the three regex operators for begins with, ends with, and contains.

In the above example grabbing Our_Book (note these selectors are case sensitive), the original markup has extra whitespace to either side of the characters. that's where our friend "contains" comes into play. In short these selectors work like so:

div[class^="beginsWith"]
div[class$="endsWith"]
div[class*="containsThis"]

Where class can be any attribute, and where the value string matches the beginning, ending, or some substring of the total value name.