DEV Community

Sahil kashyap
Sahil kashyap

Posted on

1

From File location address, get data based on how many file extension are there and their count

Problem: Assume you have million of rows, which has a address field. it contains URL like this:
*/Volumes/ Macintoh Hard Drive/Users/sahil/Desktop/some.pdf
*/Volumes/ Macintoh Hard Drive/Users/sahil/Desktop/abc.pptx
*/Volumes/ Macintoh Hard Drive/Users/sahil/Desktop/xyz.docx
*/Volumes/ Macintoh Hard Drive/Users/sahil/Desktop/some.mp4
*/Volumes/ Macintoh Hard Drive/Users/sahil/Desktop/w.mp3
*/Volumes/ Macintoh Hard Drive/Users/sahil/Desktop/q.png
table with file location
Now I want to know how many
-pdfs,ppts,mp4,png and so on are present

Below query :
reverse the address, looks for '.' and extracts the string after it and names it extension and counts how many times it encountered the extension

--volume is the column where file address is saved
select REVERSE(SUBSTRING(REVERSE(volume),1,LOCATE('.',REVERSE(volume),1))) as extension, count(1) as count from loutable2 group by extension
Enter fullscreen mode Exit fullscreen mode

query will return a result like this
table showing numbers of file of each entension

Now if you want only pdf data

SELECT * FROM `loutable2` where REVERSE(SUBSTRING(REVERSE(volume),1,LOCATE('.',REVERSE(volume),1))) =".pdf"
Enter fullscreen mode Exit fullscreen mode

now let's say you want this data in excel/csv file
All you do is make a "mysql view"

``CREATE VIEW pdf_details AS SELECT scaned,status,somecolum,volume FROMloutable2` where REVERSE(SUBSTRING(REVERSE(volume),1,LOCATE('.',REVERSE(volume),1))) =".pdf"





```sql
CREATE VIEW pdf_details AS SELECT scaned,status,somecolum,volume FROM `loutable2`  where REVERSE(SUBSTRING(REVERSE(volume),1,LOCATE('.',REVERSE(volume),1))) =".docx"
Enter fullscreen mode Exit fullscreen mode

and then you can use these views and export these in excel or whatever format you want
export mysql data in csv/excel

Image of Timescale

🚀 pgai Vectorizer: SQLAlchemy and LiteLLM Make Vector Search Simple

We built pgai Vectorizer to simplify embedding management for AI applications—without needing a separate database or complex infrastructure. Since launch, developers have created over 3,000 vectorizers on Timescale Cloud, with many more self-hosted.

Read more →

Top comments (0)

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more