DEV Community: Ben Corcoran

Scraping Wikipedia for Orders of Magnitude

Ben Corcoran — Mon, 03 Jun 2019 23:18:53 +0000

Pintis an awesome python package that allows for the easy conversion between units. As it stands it covers nearly all scientific SI units and most imperial ones. This means I can easily convert between fathoms and meters. Pint also makes it really easy to extend a unit, or include a completely new custom one.

When it comes to standard units you’d use in calculation there is rarely a simple real-world understanding of the actual value of this unit. Number sense is incredibly import for effective science communication. It can be the difference between someone engaging with the work and giving up entirely.

This rainy afternoon project scrapes Wikipedia’s order of magnitude sections to pull out potentially useful real-world comparisons and generates a Pint config to make these values accessible as units.

Scraping Wikipedia

The first thing we need to do is find some useful data. After about two hours of googling, I managed to find Wikipedia’s order of magnitude categories. These are a series of tables for various concepts listing examples of values of that concept, such as length, at different orders of magnitude.

Looking at the order of magnitude section nearly all of the pages have a table with a “Value” and “Item” column. The Value column contains a numerical value in various units. The Item column is a description of what the “Value” column represents. The ‘data’, ‘numbers’, ‘radiation’, ‘temperature’ and ‘time’ pages deviate from this structure of the other tables. So unlike Wikipedia to be inconsistent! We might have to return to these at a later date.

For the others, we just need to find a way of pulling that data out of those tables. Thankfully smarter people than me have already solved that problem. After they solved it they then went put stabilisers on the solution. You’ve really got to try to fail at python nowadays.

Robobrowser makes web scraping offensively easy. Let’s open up an instance of Robobrowser and point it to the page detailing the various different order of magnitude sections.

import re
import pandas as pd 
import robobrowser

BASEURL = 'https://en.wikipedia.org/wiki/Category:Orders\_of\_magnitude'

rb = robobrowser.RoboBrowser(history=False)
rb.open(BASEURL)

From this point, we can select all the links on the page using the inbuilt select function. The select function behaves pretty much the same way as a jQuery $() function. In the #mw-pagessection we find all the links. We can then filter those just to the actual 'Order of magnitude' pages with a small bit of RegEx. Let's shove all that in a dictionary for later use.

pages = {x.contents[0]:x for x in rb.select('#mw-pages li a') if re.match(r'^Orders',x.contents[0]) is not None}

Now is just the process of iterating over that dictionary, telling robobrowser to follow the link and parsing any tables that we find on that page. The below code shows exactly how that process works. The lions share of what’s been written is to deal with eccentricities of each page. Hopefully, the comments should provide some clarity as to what is happening in each step.

# Initialise lastTableColumns variable. This will be used to capture the structure of 
# the table in the case that the header row is missing. 
lastTableColumns = None

# Create empty dictionary to store processed items
OrdersOfMagnitude = dict()

print("Begining wikipedia scape...\n")

for pageId, link in pages.items():

 # The order of magnitude pages for Pressure and Money have have renamed the 'Value' column. 
 # This dictionary allows for these values to be looked up.

 wikiValueColumns = dict(pressure='Pressure',currency='Money')

 # Get value column header

 if pageId in wikiValueColumns.keys():
 valueColumn = wikiValueColumns[pageId]
 else:
 valueColumn = 'Value'

 # Follow the link to the order of magnitude table 
 rb.follow\_link(link)

 # Select all tables on the page with the .wikitable class
 rawTables = rb.select('.wikitable')

 # Create a list to store parsed page tables
 pageTables = []

 for i, rawTable in enumerate(rawTables):

 # Parse the html table using pandas
 table = pd.read\_html(str(rawTable))[0]

 # Search the parsed columns for names similar to 'Value'
 # and 'Item'. Some of the tables have additional text 
 # in the header. Using the filter/like combo we can avoid
 # manually defining each column name. 

 valColList = table.filter(like=valueColumn).columns
 itmColList = table.filter(like='Item').columns

 # Check that the table has a Value and Item column

 if len(valColList) \> 0 and len(itmColList) \> 0:
 valCol = valColList[0]
 itmCol = itmColList[0]

 # Some pages only show the header on the first table, 
 # in that case use the previously parsed table's header

 elif i\>0 and lastTableColumns is not None and len(lastTableColumns)==len(table.columns):
 table.columns = lastTableColumns
 valCol = table.filter(like=valueColumn).columns[0]
 itmCol = table.filter(like='Item').columns[0]

 # If neither of the above conditions are met then we 
 # cannot parse this table as it doesn't meet our defined
 # structure.

 else:
 continue

 # Set lastTableColumns for part of this table 

 lastTableColumns = table.copy().columns

 # Some tables have the unit in the Value header rather 
 # than within the column. In that instance we want to 
 # pull that information out. Using some RegEx we search
 # for any Value column head with brackets. 

 if len(re.findall(r'\((.\*)\)',valCol)) \>0 :
 tableUnit = re.sub(r'[^A-Za-z0-9\/]','', re.findall(r'\((.\*)\)',valCol)[0])
 else:
 tableUnit = ''

 # Throw away all columns except for the Value and Item
 # and throw away all rows with a null Value.

 table = table[[valCol,itmCol]]
 table = table[table[valCol].notnull()]

 # Split the Value column into a numeric value and a unit
 table[['Value (Numeric)','Unit']] = table[valCol].str.extract(r'([^\s]+)\s\*([^\s]\*)$', expand=True)

 # Standardise the scientific notation, replacing '×10' with 'e'
 table['Value (Numeric)'].replace(regex=True, inplace=True, to\_replace=r'×10', value='e')

 # Filter out any complex values such as '50 to 100' or '20-25'. 
 table=table[table['Value (Numeric)'].map(lambda x: re.match(r'.\*(?:\d\*\.)?\d+[^\de.]+(?:\d\*\.)?\d+',x) is None)]

 # Remove any additional text or symbols that are still present in the string
 table['Value (Numeric)'].replace(regex=True, inplace=True, to\_replace=r'[^.–\-/e−\d]', value='')

 # Convert wikipedia stylistic choice of '−' into the common '-' character.
 table['Value (Numeric)'].replace(regex=True, inplace=True, to\_replace=r'−', value='-')

 # Convert any digits displayed as superscript in the unit to inline powers
 table['Unit'].replace(regex=True, inplace=True, to\_replace=r'((?:-)?\d)', value=r'^\1')

 # Finally pass the string to to\_numeric in order to parse this into a floating 
 # point. Then filter any values that failed to get through the conversion as well 
 # as any values who resolve to 0. 
 table['Value (Numeric)'] = pd.to\_numeric(table['Value (Numeric)'],errors='coerce')
 table = table[table['Value (Numeric)']!=0]
 table = table[table['Value (Numeric)'].notnull()]

 # Extract any wiki references from the Item column for potential future use 
 table[['Reference']] = table[itmCol].str.extract(r'((?:\[.+\])+)$', expand=True)

 # Create a detail column containing the Item text without any references
 table['Detail'] = table[itmCol].replace(regex=True, to\_replace=r'(?:\[.\*\])+', value='')

 # Remove any null detail columns
 table = table[table['Detail'].notnull()]

 # In the case that the unit is defined in the header, add this unit to the unit 
 # column
 table[['Unit']]=table['Unit'].replace('',tableUnit)

 pageTables.append(table)

 # Concatenate all the tables on the page and add this to our dictionary of Orders of Magnitude
 if len(pageTables) \> 0:

 OrdersOfMagnitude[pageId] = pd.concat(pageTables)

 # Print summary of the results. 
 print('{}: {} approximations found.'.format(pageId,len(OrdersOfMagnitude[pageId])))

When we run the above we get the following output, indicating that our scraping occurred error free.

Begining wikipedia scape...

 Orders of magnitude (acceleration): 52 approximations found.
 Orders of magnitude (angular momentum): 8 approximations found.
 Orders of magnitude (area): 64 approximations found.
 Orders of magnitude (bit rate): 45 approximations found.
 Orders of magnitude (charge): 21 approximations found.
 Orders of magnitude (currency): 43 approximations found.
 Orders of magnitude (current): 37 approximations found.
 Orders of magnitude (energy): 170 approximations found.
 Orders of magnitude (entropy): 1 approximations found.
 Orders of magnitude (force): 33 approximations found.
 Orders of magnitude (frequency): 54 approximations found.
 Orders of magnitude (illuminance): 14 approximations found.
 Orders of magnitude (length): 118 approximations found.
 Orders of magnitude (luminance): 26 approximations found.
 Orders of magnitude (magnetic field): 32 approximations found.
 Orders of magnitude (mass): 156 approximations found.
 Orders of magnitude (molar concentration): 25 approximations found.
 Orders of magnitude (power): 80 approximations found.
 Orders of magnitude (probability): 45 approximations found.
 Orders of magnitude (specific heat capacity): 36 approximations found.
 Orders of magnitude (speed): 96 approximations found.
 Orders of magnitude (voltage): 27 approximations found.

Generate format names using TextBlob

Now that we’ve scraped our data we need to figure out what these values mean. The Info column we picked up from the various tables gives us a decent description. the Info column is not something that would work well as an identifier thought. Some of these information strings go on for two or three sentences.

We need to come up with a way to convert these long strings into useable variable names. We could generate a unique code for each one in the vain of ‘acceleration1’, ‘acceleration2’, ‘acceleration3’ etc. That doesn’t give us something immediately understandable though. What would be great is if we could pull out the relevant information from the Item column in the tables we just parsed. Using TextBlob we can.

Textblob (& nltk) is a package that makes natural language processing an absolute breeze. In order to get the information we want we want all we need to do is convert our strings to a TextBlob object. Once we’ve done that we can pull out all of the noun_phrases in the text and convert those into identifiers. AS we’re mostly dealing with tangible ‘things’ referring to them by their noun phrase makes sense. Everything else in the Item text is just waffle.

import nltk
from textblob import TextBlob

# Download natural language processing libraries in order to use TextBlob
nltk.download('brown')
nltk.download('punkt')

def getNouns(string): 

 # Remove any information contained in brackets and any symbols other than alpha numeric ones
 # Then create the textblob object and extract the noun phrases.
 np = TextBlob(re.sub(r'[^a-zA-Z0-9\s]','',re.sub(r'\(.\*\)','',string))).noun\_phrases

 # If any noun phrases are found then join all these phrases into one large string. 
 # Split this string by anything that isn't alpha numeric, filter out anything else
 # that isn't alpha numeric. Capitalize each value and stick all these processed 
 # strings together. The result should be all the noun phrases in camelCase as one 
 # string. 
 if len(np)\>0:
 return ''.join([x.capitalize() for x in re.split('([^a-zA-Z0-9])',' '.join(np)) if x.isalnum()])

# Loop over our scraped orders of magnitude and create our noun phrase camelCase string
for area, data in OrdersOfMagnitude.items():

 # Apply the getNouns function
 data['id']=data['Detail'].apply(getNouns)
 # Throw away any values that were we couldn't create an id
 data=data[data['id'].notnull()]

Pint and generating configurations

We’ve got our data, we got our labels. Now we just need to throw this into Pint and we’re done. In order to do that we need to set up a custom unit registry.

As with everything in python nowadays its ridiculously easy. Following their detailed guidance) we can see that the format is just id = value*unit. Most of the values we parsed were in units already defined in Pint's base unit registry. This means that all our custom units can be converted to and from any other unit that pint offers. They will also behave with pints dimensionality checks.

# Iterate through the rows of our various magnitudes writing out to a text file
# string configurations for pint
with open('pintConfig.txt','w') as out:
 for area, data in OrdersOfMagnitude.items():
 for id, row in data.iterrows():
 pintConfig = '{} = {}\*1{} # {}\n'.format(row['id'],row['Value (Numeric)'],row['Unit'],row['Detail'])
 out.write(pintConfig)

Now that we’ve built our unit registry we can do all sorts of bizarre conversions. Want to see how many space shuttles worth of acceleration an adult saltwater crocodile’s bite would provide if it was applied to the mass of the largest Argentiosaurs? Easy, just look below.

from pint import UnitRegistry

ureg = UnitRegistry()
ureg.load\_definitions('pintConfig.txt')

x = 1\*ureg.Large67mAdultSaltwaterCrocodile
acc = x/(1\*ureg.LargestArgentinosaurus)
print(acc.to('Shuttle'))

The answer?

0.01629664619744922 Shuttle

Bizareness aside, now we can easily switch any of our calculations, data analysis, tables or arrays into more comprehensible numbers. As Pint easily interfaces with both pandas and numpy it can make the publication of data much more user-friendly. Saving time and effort when it comes to producing any sort of public-facing data analysis. Plus, you know, 0.016 shuttles, that’s a lot of bite.

Originally published at Benjamin Corcoran.

Understanding the SAS log: Time and Switches

Ben Corcoran — Sun, 26 May 2019 15:09:49 +0000

If you’ve worked with SAS before, you’ve seen the SAS log. It’s a beast, an unwieldy and at times unhelpful beast. But we can tame it. Well, maybe not, but we can be hit with a stick long enough that it becomes docile around tourists. Let’s use the great work of elephant tamers and orca ‘trainers’ as inspiration. We’re diving into the world of the SAS log.

Real time vs CPU time

You’ve run SAS code before and had log lines line this before. We’ve looked at this log before.

And, err, it took 3 minutes? or did it take 1 minute? What why are there two times, I don’t..what? If you’re like me then this was accompanied with a hammed-up preformative mental breakdown. Followed by your boss taking you aside to make sure everything is okay at home. But as we’ve had our 5 minutes of amateur dramatics lets dig into what these two times actually mean.

Real time

Let’s start simple first, let’s start with real time. This is a measurement of how exactly how much physical time has passed since the job was started. Start a stopwatch and hit run, the time you’d get at the end of execution is the real time.

Real-time is heavily dependent on the resources and load on the system. If there is a huge queue of jobs you can expect real time to be significantly extended. If the load on the system is low, our real-time depends solely on our shitty optimisation.

User CPU time

User CPU time is a slightly different measurement. It is a measurement of how much ‘time’ the job spent utilising the CPU. More explicitly, how much of the job occurs in the CPU. In the above example, almost ⅓ of the real-time was spent inside the CPU. The rest of the time being used for read/write operations and system processes.

On a system with a single CPU, the user CPU time will always be less than the real time. However, if a machine has multiple processors then it’s more than possible for the user CPU time to be longer than the real time. Woah woah woooah, before the urge to go full tilt Peter Finch overtakes, let’s look at this.

The CPU time is the sum of the amount of time the job utilises in each CPU. Say you have a machine with four processors. You’ve absolutely smashed your rig/SAS setup, this thing is purring like a cat when it comes to efficient processing. You run a job that takes two seconds of real time. Your set up spreads that computational load across the four processors. If each processor spends one second working on this job then your total user CPU time is four seconds. Two seconds longer than your real time.

Now we understand the difference between these two times stats. We’re now savant kings of the SAS log, so let’s step this up a notch.

FULLSTIMER, or stats for nerds for nerds.

In our previous example, we got two stats outlining the performance of our proc sort, real-time and user CPU time. However, there is an option that can be called to give us even more information about how inefficient our programming is, FULLSTIMER.

Ho hoo, look at all these new stats we’ve got to inves…jesus christ is that a third time measurement?! /)_-). Right well, we best figure out what these ones measure as well.

System time

We know that the user CPU time is the time taken for the processor(s) to execute our job. The system time then represents the time taken by the processor(s) to execute the operating system tasks that support our job. More simply the system time is the time spent managing your request. For example, if your job was waiting for input/output from a slow disk your job might be involuntarily swapped out of the CPU until that I/O resource became available. In a multiprocessor system, this also keeps track of time spent splitting and pulling together various parallel threads. These system jobs are the ones that contribute to the system time

Memory and OS memory

Memory is fairly straightforward, the amount of memory used in processing the job. This doesn’t include memory used for the job’s overheads, for example running SAS manager.

OS memory is the amount of memory that was released to complete the job. In other words, the maximum amount of memory the OS allocated to the job.

Paging

Right, come on let’s look at paging. We’ve all seen it before, page faults here page something else there. You’ll react with fear before you realise nothing significant has happened. I can still open excel, everything must be fine, I obviously don’t need to worry about paging.

A page is essentially memory on disk. In order to conserve space in the main memory, the operating system can store information on disk until it is required by the OS. When we need that data, it is copied from the disk into the memory for the operating system to use. There are entire books devoted to paging, how it works and how it is implemented so let’s not go into it here. Suffice to say, we’re cheating and storing some of the stuff we should be putting in the memory onto the disk. This means the system can multitask more effectively. We can do this is because not everything is required to be in memory at the same time.

Page faults

A page fault is where we get caught. That thing we pretended was in the memory but was actually on the disk, well they want it now. This means the OS has to do an I/O read to the disk in order to load in this data. If there are a high number of page faults its likely that your system isn’t well optimised. The page fault requires an expensive I/O read rather than a quick dip in and out of the system memory.

Page reclaims

Some of the data we need may already be in a different area of memory when we need it. A page reclaim is a page fault that is handled entirely in the memory (no I/O operation). The data have already been loaded into the memory, used, and then marked as over-writable. But crucially hasn’t yet been overwritten. So we can jump in and grab that data before anyone notices.

Page swaps

The page swaps statistic represents the number of times our job is swapped out of main memory. I.e the instructions for our job have been written to disk so another process can use our slot of the memory.

Context switches

Nerds are actual nerds you know. Okay, so you’ve got an amazing multitasking system. We can kick out any job that is currently being processed and pick it up at a later time. This is called preemption. In this system, we deal out discrete chunks of time for each job. Now you’ve got this, what are you going to name the chunks of CPU time your allocating? Oh…a time slice? really? A time slice…fine (nerds.)

A context switch is when our multitasking system switches out one job for another. There are two kinds of context switch, one is voluntary the other is involuntary.

Voluntary context switches

These occur when the job hands back the CPU in the middle of processing because it is busy waiting on some other information. Typically this takes the form of an I/O read. Voluntary context switches are good, they mean that the job is sharing the CPU resource with other jobs when it has to wait for a slow I/O read. Too many of them might be a sign that your I/O is hampering the speed at which your system could be processing.

Involuntary context switches

These occur when our job has overstayed it’s welcome. If the job hasn’t completed by the end of it’s *sigh* time slice then the job is kicked out to be picked up at a later point. This can also happen if a higher priority job comes in requesting CPU time.

Block operations

Block operations are the number of times that a read or write, the size of the buffer, occurs. The buffer size or bufsize is a dataset option that defines the amount of data that can be transferred in a single I/O operation. This value is set by default to be optimised on your system for sequential reads. You may see improved performance for random direct reads by increasing this value but it will incur additional memory costs.

Input/Output block operations

Now that we’ve defined what a block operation is, it’s fairly simple to see that an input block operation is a bufsize I/O read and an output block operation is a bufsize I/O write. Not all of these will be disk operations as some reads will be of data already stored in memory like we saw when we discussed page reclaims.

So what?

Well hey, at least we know what all these things are on FULLSTIMER now. We’ve understood our SAS log a little better. We know that nerds are terrible at naming things. We also have a better understanding of how our system fares with various jobs.

High values for any of the above statistics can be indicative of poor optimisation. Be that in the code, software or subsystems. Now you have a slightly better understanding of the logging. You might be able to narrow down exactly where in the process your job is bottle-necking. Or you can just show off if you’re friends are impressed by this sort of knowledge. If they are though, ooof mate you need better friends.

Originally published at Benjamin Corcoran.

Consistency: templating charts with Plotly

Ben Corcoran — Sun, 19 May 2019 22:10:53 +0000

Exploiting plotly’s template attribute to create consistent charts.

Great publications have great charts. The Economist and the New York Times to name but two examples. The reason these stand out to me is consistency. When you look at an Economist chart, you know it’s an Economist chart.

The best thing about consistency is that your charts don’t even have to look that good. In the above example, we’re not seeing anything special. In fact “Education, years” has way too many annotations and is incredibly busy. But the consistency trumps that. It causes that little spurt of happychem in the back of your brain that comes from spotting some order in the world.

An example of The Economist’s charting

A more brand focused person might suggest that it also helps in establishing a familiarity the same way any good house styling or logo set might. But really, its all about that high.

Junkie to dealer

Plotly is a high-level library built on D3. It exploits all of D3’s fantastic charting powers without suffering from its excruciating learning curve. It’s a great charting platform, but one gripe was the difficulty in creating consistent looking charts.

Styling with plotly is easy enough. The entire plot, data and styling, is contained in a single JSON object. Once you’ve added your data it’s simply a case of setting various attributes until the chart begins to resemble the one that’s in your head. That JSON is then pushed through one of plotly’s libraries and hey presto you’ve got your chart.

The problem is that we have to do this all over again for the next chart. God forbid the next 10. There are of course ways of producing these in batch. Plotly has a python library that makes programmatically producing charts very easy. There isn’t at the moment a way of creating a template within python’s library. So we could try extracting the styling part of the JSON and then reapply it to the next chart. But just writing that out is a faff.

Let’s say that you do manage to create a workflow that lets you reapply your styling. What happens when you realise that your comic sans titles might not be doing you any favours. How do you update all of your live charts to your new theme? Write a script to find all your live charts, download, retheme, reupload, faff, faff, faff.

Plotly’s Template attribute

As of plotly3.4, we have a template attribute we can use to solve all our problems. Documentation is a little thin on the ground at the moment. There are a couple of introductory articles here and here that gives an overview.

Essentially, you recreate the chart JSON within itself. The template can have data, layout, annotations and images as elements. Each element applies conditions in the same way as their main chart JSON counterparts. The difference is that if the template has an element that has already in the main chart, the main chart's definition takes precedence.

# Unthemed
{
 "data": [
 {
 "x": [
 "giraffes", 
 "orangutans", 
 "monkeys"
 ], 
 "y": [
 20, 
 14, 
 23
 ], 
 "type": "bar"
 }
 ],
 "layout":{
 "template":{
 "layout":{

 }
 }
 }
}

#Themed

{
 "data": [
 {
 "x": [
 "giraffes", 
 "orangutans", 
 "monkeys"
 ], 
 "y": [
 20, 
 14, 
 23
 ], 
 "type": "bar"
 }
 ],
 "layout":{
 "template":{
 "layout":{
 "paper\_bgcolor":"pink"
 }
 }
 }
}

Unthemed

In the above example, we’re able to set the paper background colour in the template only. The resulting charts behave as we’d expect.

Themed

Had we set the paper background colour directly in the chart’s layout component then the template's paper background colour would not be rendered pink.

Theming multiple chart types

As with setting templates for the layout of the chart. We can also set default values for parts of the data. In the data section we can assign our template to a type of chart. In the below example we've set up a template for the colour of the bar in a bar chart. The bar element's value is a list of dictionaries defining each individual trace.

template":{
 "data":{
 "bar":[{"marker":{"color":"pink"}}],
 }
 "layout":{
 ...
 }
}

Here we’ve set the first trace of the bar chart where we’ve set the colour of the bar to pink. We can extend this for each trace until we’ve built up an entire colour scheme.

We can even extend this further to include multiple chart types. This means we can have a consistent template that works for all chart types. Even allowing for differences between chart types. Below are two charts created from the same template.

"template":{
 "data":{
 "bar":[{"marker":{"color":"pink"}},{"marker":{"color":"purple"}}],
 "scatter":[{"marker":{"color":"green"}},{"marker":{"color":"red"}}]
 }
 "layout":{
 ...
 }
}

Separating the data from the design

This is all fine and dandy but we’ve not really solved our problem. We’ve made it a little easier to theme a chart but we still have to stitch together a huge JSON in order for plotly to render it.

Except we don’t.

Now that we have our template, we’ve not much use for all the properties that style our chart within the chart itself. So let’s separate the two. We can keep all our chart’s essential data like the x and y values and perhaps some required formatting in the main chart JSON. In a new JSON we put our template object. As the template object isn’t going to change it makes sense to keep the two apart.

The only point we need to combine the two JSONs is when they’re delivered to the end user. As plotly has already gone to the trouble of building a javascript library that allows for separate data and layout to be rendered together on the fly, it would be foolish not to take advantage. We simply pass our data as data and our template as the only element of an empty layoutobject.

The below is part of the javascript that dynamically renders plotly charts on this site. A small php script is called to load chart data from a database. This combined with a templateJSON stored on the site.

var chartJSON = $.ajax(
 {
 url:'/resources/getChart.php?chartID='+chartID,
 dataType:'json',
 type:'GET'
 });
if(window.tmpltJSON == null){
 window.tmpltJSON = $.ajax(
 {
 url:'/resources/plotlyTemplate.json',
 dataType:'json',
 type:'GET'
 });
};

$.when.apply($,[chartJSON,window.tmpltJSON])
 .then(function(){
 var chart = JSON.parse(chartJSON.responseJSON);
 var tmplt = window.tmpltJSON.responseJSON;
 plotDiv = document.getElementById("pie");
 Plotly.newPlot(plotDiv,chart,tmplt,{'responsive':true});
 }
);

Now all my charts are themed when rendered for the end user. If I make to the plotlyTemplate.json they will be immediately used in all charts on the site.

All of which means, we never need to worry about styling a chart again. We have a setup that only produces consistency. Well at least until you make the mistake of opening it on a phone. I mean, how can such a small screen do so much damage.

Originally published at Benjamin Corcoran.

SAS Program Timer

Ben Corcoran — Sun, 12 May 2019 10:28:08 +0000

SAS is great. Well SAS is okay. Eh, 70% of the time SAS is fine. The issue with SAS, or rather one of the issues, is it’s continued and almost dogged choice to refuse to see the wood for the trees. Let’s improve that a little by building a program timer.

Grab this from GitHub

What we’re currently working with…

Let it not be said that SAS doesn’t make an effort. In fact, it would be a fairly extreme departure from common sense if they were not to note somewhere the running times for parts of SAS objects. An inspection of any SAS log will show data steps and procedures that producing the following output.

10 proc sort data=logsample; **[12]**
11 by LastName;
12

NOTE: There were 5 observations read from the dataset
 WORK.LOGSAMPLE. 
NOTE: The data set WORK.LOGSAMPLE has 5 observations and 10
 variables. **[13]**
NOTE: PROCEDURE SORT used:
 real time 0.16 seconds
 cpu time 0.03 seconds

Here we can see that the PROCEDURE SORT used 0.16 seconds of real-time which equates to 0.03 seconds of CPU time. The difference here doesn’t matter for our purposes but a more in-depth look at these notes is discussed here.

This is all very nice to know, but in the more usual case that you're working with 100s of different procedures, data steps and god knows what else, it would be useful to get an overall time for the entire program, not just the individual steps.

A SAS Program Timer

First things first, let’s grab the time at which the program was started. Not as simple as you might expect. SAS doesn’t store a macro variable with the time of execution. SAS does have automatically generated macro variables such as SYSTIME. However, these refer to the time the session was opened, and not when we explicitly started the program.

As such we need to create our own macro variable. Let’s call it launchTime to avoid any confusion with other macro variables that might be hanging around.

Launching and landing

%let launchTime = %sysfunc(time());

If we place the above snippet at the start of our program we will capture our start time. Next, let’s capture the end time. Which fortunately is just the exact same snippet! We just need to rename the macro variable.

%let landTime = %sysfunc(time());

%let timeTaken = %sysevalf(&landTime.-&launchTime.);

We have our two times, the total run time for the entire program is simply the difference. Done! we’ve successfully timed our program. All we need to do is to %PUT our timeTaken macro variable. Alas, in its current form, our timeTaken variable is just the number of seconds taken. It would be nice if we could format this into hours, minutes and seconds.

Formatting, formatting, formatting

In order to do so, we’d need to convert this from a time delta (a difference between two times) into something that SAS’s DateTime formats can understand. So we can add on the reference point for all SAS DateTime objects 1st January 1960.

%let timeTaken = %sysevalf(mdy(1,1,1960)+&landTime.-&launchTime.);

%let timeTakenFmt = %sysfunc(putn(&timeTaken.,e8601tm15.6));
%put &timeTakenFmt.;

Then all we need to do is add on our desired format. I’ve picked e8601tm15.6 which gives a nicely formatted time to the millisecond but feel free to use any one of SAS’s many built-in formats.

This is fine, but a bit of a pain in the arse if we have to copy these lines into every script that we worth with from now on.

Enterprise Guide ain’t all bad.

Enterprise Guide to the rescue! In EG we are able to set scripts that execute at the beginning and end of every program execution.

In Options > SAS Programs > Additional SAS Code, there are two checkboxes for ‘Insert custom SAS code before submitted code’ and ‘Insert custom SAS code after submitted code’. Placing our launchTime macro variable definition in the before section and our landTime macro variable definition and %PUT in the after section every program from now on will be timed!

Extra Credit

Make this information stand out in the log, add an ASCII box, tab this away from the left-hand side, add a title and context.

%let tab = %str( );

%put ;
%put &tab.############################### ;
%put &tab.# SAS CODE SUMMARY #;
%put &tab.# TIME TAKEN: &timeTakenFmt. #;
%PUT &TAB.# #; 
%PUT &TAB.###############################;

In the end, your log should look something like this…

Grab this from GitHub

Website: https://www.benjamincorcoran.com