DEV Community

Discussion on: Using Google's BigQuery to Better Understand the Python Ecosystem

Collapse
 
felipehoffa profile image
Felipe Hoffa

Thanks Walker!

A better query to extract all the ipynb (I left the results at bigquery.cloud.google.com/table/fh...

#standardSQL  
SELECT a.id id, size, content, binary, copies,
  sample_repo_name, sample_path
FROM (
  SELECT id, ANY_VALUE(repo_name) sample_repo_name, ANY_VALUE(path) sample_path
  FROM `bigquery-public-data.github_repos.files`
  WHERE ENDS_WITH(path,'.ipynb')
  GROUP BY 1
) a
JOIN `bigquery-public-data.github_repos.contents` b
ON a.id=b.id
Enter fullscreen mode Exit fullscreen mode
Collapse
 
felipehoffa profile image
Felipe Hoffa

And more #standardSQL for the imports query:

#standardSQL
SELECT
CASE WHEN package LIKE '%\\n",' THEN SUBSTR(package, 0, LENGTH(package) - 4)
ELSE package
END as package, n
FROM (
  SELECT REGEXP_EXTRACT( line, r'(?:\S+\s+)(\S+)' ) as package, count(*) as n
  FROM (
    SELECT line
    FROM (
      SELECT SPLIT(content, '\n      \"') as lines
      FROM `fh-bigquery.github_extracts.contents_ipynb`
    ), UNNEST(lines) line
    WHERE STARTS_WITH(line, 'import ') OR STARTS_WITH(line, 'from '))
  GROUP BY package
)
ORDER BY n DESC
LIMIT 50;
Enter fullscreen mode Exit fullscreen mode
Collapse
 
walker profile image
Walker Harrison

Thanks, Felipe. Still honing my SQL skills so all ears for better solutions. I've got some more BigQuery posts in the pipeline both here on dev.to and on Medium