DEV Community

Guo-Guang
Guo-Guang

Posted on

how to correctly identify a web page?

why need to identify a certain page? A Google browser extension needs to react to a certain page loaded from a website. The current way of identifying the specific page is to match its URL by using a regular expression generated according to a set of ULRs given by Ops people.

Using regex has some drawbacks. Ops or marketing people don't know regex. Therefore, it always relies on engineers to generate the regex if there are contracts signed with partners. We are in EC field and it is at least thousands of partners in each country running our services. So far we run our service in 7 countries.

Instead of using regex, I am thinking if I could use Solr or ElasticSearch to index the URL with different weights on the specific terms in the URLs. Hope to learn from how you probably address such a problem.

Top comments (4)

Collapse
 
rhymes profile image
rhymes

I'm not sure about the size of this system nor how it actually works nor how you store those URLs but I agree on one thing. A regexp wouldn't probably scale well with too much data.

Maybe you can just index them with Elasticsearch and use it to do a full text search. If you need to tag the URLs you can probably add that info too and if you have the need to answer the question "do I have this URL in my system" without actually using the search engine you can probably hash every URL and check if you have it.

Collapse
 
taitung profile image
Guo-Guang • Edited

Taking the eBay for example, when our users install the extension and users would like to check out. Before users being redirected to payment confirmation page on eBay, the extension needs to send an event to our backend services. Therefore, we need to identify the page and send an event before the payment page loaded. The problem here is that the page will change which means the URL will change also. Therefore, constantly changing the regex is inevitable.

The goal is to have a semi-automated service internally allowing ops and salespeople to react to changes instead of getting engineers involved even that is just a small change in the regex.

We have around 10 thousands of partners in each country. It is not a big number though. But maintaining URLs for top 30 partners in each country has been a hassle. We hope we can apply to all partners instead of just top 30 partners.

Collapse
 
rhymes profile image
rhymes

Hi Guo-Guang, can you elaborate on your question? What do you mean?

Collapse
 
taitung profile image
Guo-Guang

@rhymes , I just added. I forgot to add description.