DEV Community: Uman Shahzad

Oxylabs Python SDK

Uman Shahzad — Mon, 24 Jun 2024 06:23:25 +0000

Hello everyone! We've created a Python SDK for the Oxylabs Scraper APIs to help simplify integrating with Oxylabs's APIs.

You can find it here: https://pypi.org/project/oxylabs/

Would appreciate if Oxylabs users could try it out and give their feedback so we can start addressing it and improving the SDK.

Here's a quick start example:

from oxylabs import RealtimeClient

# Set your Oxylabs API Credentials.
username = "username"
password = "password"

# Initialize the SERP Realtime client with your credentials.
c = RealtimeClient(username, password)

# Use `bing_search` as a source to scrape Bing with nike as a query.
res = c.serp.bing.scrape_search("nike")

print(res.raw)

It works with both the real-time and async integration methods and makes it especially easy to use the latter method which is otherwise quite tedious.

Source code is at https://github.com/oxylabs/oxylabs-sdk-python

Thank you!

How To Make A Custom Splunk Command

Uman Shahzad — Mon, 05 Feb 2024 23:34:30 +0000

Step 0: Understanding the types of commands

You can think of Splunk custom search commands as specialized command components within Splunk apps that understand your unique needs.

Each command has a specific type that serves as the boundary of how it can work and interact with other commands. Here are the different types:

Streaming commands: Process results in one event at a time, applying transformations individually.
Transforming commands: Arrange results into a data table for statistical analysis.
Generating commands: Retrieve information from indexes without modifications.
Dataset processing commands: Require the complete dataset before execution.

From here, let's take everything in steps - by completing the 4 easy steps below, you'll have a functioning custom search command.

Step 1: Setting up your Splunk environment

Let's first get the Splunk home directory under control.

$SPLUNK_HOME is an environment variable used in various places and it depends on where you have installed Splunk Enterprise on your machine. Please set it to the right directory before continuing.

On Linux or macOS: Open a terminal window and type echo $SPLUNK_HOME.
On Windows: Open a Command Prompt window and type echo %SPLUNK_HOME%.

Now in order to execute a custom command, we must first have Splunk Enterprise running. Please download the most recent version of Splunk Enterprise.

After downloading and unpacking, you can launch Splunk via the terminal by executing:

$SPLUNK_HOME/bin/splunk start

After this is successful, Splunk Enterprise will be available at http://localhost:8000 in your browser.

To create an app, open the above page and navigate to Manage Apps, click the Create app button, and enter your app's name along with some additional information like folder name, version, author and description.

By pressing "Save!", you'll now have a Splunk app and can access the app's directory at $SPLUNK_HOME/etc/apps/my_cutom_command_app/.

Now that we have a container for custom commands, you must choose a programming language to write the code of the command in.

Python is the most popular language used among Splunk developers and we'll be using that going forward, but you can also use Java - the choice is yours!

Step 2: Writing the script for your command

Let's assume our custom command name is my_custom_command.

Now that the entire environment for the custom command has been configured, we need to create a script for it at $SPLUNK_HOME/etc/my_cutom_command_app/bin/my_custom_command.py. Note that the name of the script (the .py file) is the same as the command name - this is not required, but good practice.

For writing the script we need the Splunk Enterprise SDK for Python, which you can copy and paste into the following folder:

$SPLUNK_HOME/etc/my_cutom_command_app/bin/splunklib

Below are examples of each of the different types of commands - streaming, transforming, generating and data processing. Pick one and move on to the next step.

You can also feel free to modify the code to play around with it right now.

Streaming

import sys
from splunklib.searchcommands import dispatch, StreamingCommand, Configuration


@Configuration()
class CustomCommand(StreamingCommand):
    def stream(self, events):
        for event in events:
            yield event


dispatch(CustomCommand, sys.argv, sys.stdin, sys.stdout, __name__)

Reporting / Transforming

import sys
from splunklib.searchcommands import dispatch, ReportingCommand, Configuration


@Configuration()
class CustomCommand(ReportingCommand):
    @Configuration()
    def map(self, events):
        pass

    def reduce(self, events):
        pass


dispatch(CustomCommand, sys.argv, sys.stdin, sys.stdout, __name__)

Generating

import sys
from splunklib.searchcommands import dispatch, GeneratingCommand, Configuration


@Configuration()
class CustomCommand(GeneratingCommand):
    def generate(self):
        pass


dispatch(CustomCommand, sys.argv, sys.stdin, sys.stdout, __name__)

Eventing / Data Processing

import sys
from splunklib.searchcommands import dispatch, EventingCommand, Configuration


@Configuration()
class CustomCommand(EventingCommand):
    def transform(self, events):
        pass


dispatch(CustomCommand, sys.argv, sys.stdin, sys.stdout, __name__)

Step 3: Tell Splunk about it

Now how does Splunk know we made a custom command? We register it within our app with two config files - commands.conf and searchbnf.conf. Create these files in your app's default directory.

commands.conf path & content:

$SPLUNK_HOME/etc/apps/my_custom_app/default/commands.conf

[my_custom_command]
python.version = python3
filename = my_custom_command.py
chunked = true

searchbnf.conf path & content:

$SPLUNK_HOME/etc/apps/my_custom_app/default/searchbnf.conf

[my_custom_command-command]
syntax = [my_custom_command]
shortdesc = [A short description of your custom command]
usage = public

Step 4: Restart Splunk and test it

We can now restart Splunk to apply the new configuration using this command.

$SPLUNK_HOME/bin/splunk restart

You can also do this directly from the UI via Setting > Server Controls > Restart Splunk.

Now you can test it in the "Search and Reporting" app with this SPL command:

index=_internal | my_custom_command

Congratulations! You've successfully created a custom search command in your Splunk app. Now you can enhance and modify the script based on your specific use case.

Bonus: You can check out the Splunk-app-examples which is a treasure chest of ready-made templates and inspiration for your Splunk custom search commands.

Happy Splunking!

High-level Parsing Concepts

Uman Shahzad — Mon, 05 Feb 2024 23:27:51 +0000

This post will teach you some general things you need to know and think about when parsing out structured content inside a buffer. It will particularly provide questions to ask and points to ponder when approaching any parsing problem, along with some real-world examples.

We will not talk about any particular algorithm, e.g. those used in compilers or other sophisticated programs with large and complex parsing steps. However, all of the concepts explained here will be useful in one or way or another in any parsing context.

Whole vs. Chunks

One of the most important parameters in deciding a parsing strategy is how the input is received; is it one buffer that contains the entirety of the structure, or a partial buffer (a chunk) that gets more data over time (streaming)?

When dealing with the entirety of the structure in a single buffer at once, the algorithm is usually very simple to express. You can have a simple loop that goes from start to finish in the buffer, and keep state and context data relevant to the problem updated as the loop proceeds. There is usually never a reason to backtrack in the buffer itself, since out-of-band state and context data should be correctly kept up-to-date regarding what's been parsed so far.

When dealing with only chunks of the structure at a time, where chunk boundaries may occur at arbitrary points in the structure, parsing becomes significantly more difficult because the control flow is harder to express; a simple loop is usually never sufficient. However, by asking certain questions and coming up with the right solutions for your particular problem, the parse can be made a lot simpler:

What is the input source?
Is the input going to be retrieved in one big chunk, or in multiple fixed-sized chunks, or in multiple variable-sized chunks? And is the chunk size going to be sufficient for performing all parse steps, or will we need to buffer previous chunks until we get 1 bigger chunk that can be properly parsed?
What non-buffer data (i.e. context/state) is needed to help perform a later parse step given the current chunk?
How is the data encoded inside the buffer going to be output to the user after parsing?
What guarantees do we have about total size? Do we know in advance, or do we need to track how much has been input so far, in order to enforce limits?

Examples

In each example, we ask ourselves relevant questions and answer them in order to come to a reasonable parsing strategy.

Stream Parsing JSON

Q: What is the input source?

Assume the input source to be a network socket using a streaming protocol like TCP.

Q: What is the chunk size, and can all parsing steps succeed on a chunk of that size, or will buffering be needed?

Data from the network socket will come in streams, and JSON itself contains elements of varying lengths, so a single size will not suffice for all possible entities, nor does it align itself well with the streaming input model of the socket. So, we will use whatever comes from the socket first, and if it doesn't contain enough data for 1 successful parse step, we will buffer that data and ask for more, appending to the buffer until we can successfully complete a parse step. Then we can deallocate the buffer and start with a new one for more data, until the end of the JSON representation.

Q: What non-buffer data is needed?

We need several pieces of state information. For example:

Has the JSON document just started?
Has the JSON document just ended?
Have we just read in a key in an object, and are now expecting a value?
Have we just read in a value and a comma, and are now expecting another key (if in an object) or a value (if in an array)?
Has a new array just started?
Has a new object just started?
Has an array just ended?
Has an object just ended?

Additionally, we need to know how deep we are in the JSON hierarchy, and whether the parent layers were arrays or objects. This is important because as we "close" an object or array, we need to know whether the new scope is within an object or array, so we can properly confirm whether the scope will be closed appropriately, and whether to expect a key or value.

In other words, lots of data is needed outside of the buffer itself as we parse. We cannot ensure a fully correct parse otherwise.

Q: How is the data encoded inside the buffer going to be output to the user after parsing?

In the interest of making the least amount of assumptions about the user's desired non-buffer data format, we will only choose to output individual data types which have a 1-to-1 correspondence between JSON and the programming language used.

So, if we just completed the parse for a number, and it was in floating-point format, we can use one of f64_t or f128_t depending upon the size of the number. If it was an unsigned number, we will output a u64_t. If it was a negative number, then s64_t. If a string (whether key or value), then a str_t. And so on.

We do not discriminate against objects or arrays; the user will have to check the current JSON parse state to figure out in what context (whether object or array) the data was found, in case they need to construct an array or object of their own format (which may be a hashtable, a linked list, or any other format) in their application.

Q: What guarantees do we have about total size?

When receiving from the network, especially if received from the open internet where anyone may send a message, malicious or otherwise, we cannot make many guarantees about the total input size.

Instead, we can set internal limits, such as, for example:

A maximum string size of 1MB.
A maximum object/array hierarchy depth of 200.
A maximum total document size of 5MB.

There may be more limits, and all of these limits will be configurable. So a user trying to parse trusted JSON from disk can choose to remove the document size limit completely because they may be parsing gigabytes worth of JSON. Or a user who knows that some data coming from a web request will contain strings of sizes up to 5MB can change the string size limit and the total document size limit to compensate for that use case.

On-Disk CSV File Parsing

Q: What is the input source?

The input source is a file on a local filesystem. The entirety of the file will be loaded before parsing starts.

Q: What is the chunk size, and can all parsing steps succeed on a chunk of that size, or will buffering be needed?

Because we have the entirety of the CSV data in-memory, our "chunk" is really the entire file and so can be trivially parsed successfully by any parsing step.

Q: What non-buffer data is needed?

We can keep track of several pieces of state information, some necessary for correctness parsing and others necessary for debugging/etc. For example:

How many total columns are expected from the header?
What column are we in right now?
What row are we in right now?
Are we inside of a double-quoted cell?
Has the header been parsed already?
Did a new row just start?

Q: How is the data encoded inside the buffer going to be output to the user after parsing?

Because CSV only specifies data as strings, by default we will parse each cell as a buf_t. The user can then perform further parsing on the individual cell depending upon which column it was in. For examplee, it may contain a number or a boolean value which can be converted to a s32_t or a bool_t.

Q: What guarantees do we have about total size?

We know the exact size of the full CSV file because that information is available via the filesystem.

Structure Assumptions

The source of data may or may not be reliable. Getting data from the local filesystem that was written by an internal system poses an entirely different parsing problem than getting data from the network sent by an arbitrary host following some protocol defined in a public IETF RFC.

When receiving data for which we can make very strong guarantees regarding structure, parsing can generally avoid looking for error conditions and anomolous circumstances. But if the data comes from an untrusted arbitrary source, little to no assumption can be made about what's being received.

If little assumptions can be made about the data, what to do is dependant upon the particular problem. Generally, however, it is useful to think in the shoes of an adversary who can input arbitrary data into your parsing system, sometimes fuzzy and sometimes semi-structured.

When the consequence of a crash caused by an internal error in the parsing system from an unhandled case is low, it is generally acceptable to allow it and solve it iteratively, especially if the particular parsing problem is very complex, such as when parsing a programming language locally with a compiler.

JWT Revokation

Uman Shahzad — Mon, 05 Feb 2024 23:26:13 +0000

Among the many criticisms given to stateless tokens, specifically JWT, the only one I ever found reasonable was that stateless tokens usually have no built-in or simple means of revokation.

JWT lovers defend by claiming that a short enough expiry is enough of a revokation scheme for them. The JWT haters continue to claim that isn't a revokation scheme at all, because you're not really revoking anything when you need to revoke it.

Both sides make sense to me - some application owners using JWT just want simplicity and find the JWT ecosystem wide enough to easily include a "fast" authentication scheme into their application (ignoring the fact that some of those JWT libraries literally do database lookups for each request alongside verifying the JWT statelessly, practically nullifying the performance gain!), but it's also true that they don't really have a solid revokation scheme, and an attacker who gets their hands on the wrong JWT can wreak havoc pretty quickly while your customers ask you why you can't stop them.

So in that context, let's consider what an actual revokation scheme would look like, where one doesn't sacrifice the performance benefits of JWT. Disclaimer: any strong and scalable revokation scheme won't make JWTs "simple" in any sense anymore, especially because I don't know of any 1-click ecosystem solution for it yet (yes, you can probably make a business out of this). So if the reason you use JWT is for its simplicity, that'll be lost as soon as you want a strong revokation scheme and are required to build your own. Just know what you're getting into.

In a nutshell and at a high-level, here's what the revokation scheme looks like:

A distributed database/key-value store that allows publishing changes to subscribers/listeners/watchers/whatever-lingo-it-uses.
An application that performs authentication using JWTs which can subscribe/listen/watch/etc the distributed store on a separate thread.
An in-memory data structure (hashmap) that the application can use to locally store and look-up blacklisted JWTs.

Engineers should already understand how this all connects. But just to detail it out:

The distributed store holds blacklisted JWTs. An admin should be able to update it easily, and the changes should propagate to all replications of that store (e.g. globally) extremely fast.
Applications that do authentication with JWT will write some logic to start a separate thread upon start-up which simply subscribes/listens/watches on the appropriate channel/key/etc and each time a message comes in, runs logic to update an in-memory data structure such as a hashmap to mirror the database store's state of blacklisted JWTs. The action of subscribing/listening/watching happens through a persistent TCP connection - you don't have to make a request every second to detect updates, you get "pushed" the update by the distributed store.
Authentication logic is modified so that on every request that requires authentication using JWTs, the in-memory data structure is queried to find if the request's token matches or not - this should be super fast, because the data structure is on the application's heap just like many other variables it'd work with while serving the request. If there's a match, we execute request rejection logic.

Revokation (or even reallowance) happens at the time the admin commits the update to the distributed store, so that's basically as fast as you can get as long as your revokation scheme is human-guided.

The obvious downside though is the added complexity, which is why this is a bit troublesome. But as alluded to earlier, this is something that can be offered as a central service by some cloud provider or similar, with client libraries that could require 1 or 2 lines of logic to get going. There may be some authentication-focused businesses who may already be offering this.

The Absolute Minimum Every Software Developer Must Know About Pointers

Uman Shahzad — Mon, 05 Feb 2024 23:24:03 +0000

The concept of pointers regularly confuses beginner programmers. But pointers are fundamental to understanding how sophisticated memory management works in programs, so their importance cannot be avoided. In this post, we will use C-like syntax as a tool to help demonstrate pointer concepts.

The examples we shall encounter are simple and usually not what happens in the real world necessarily - especially the more complex examples. A C-like syntax is used only to give a taste of how the code might look like when using pointers. What is much, much more important is understanding what pointers are all about as a general concept. When you understand that, then the real-world use cases will start to look and feel obvious and easy.

What really is a pointer?

On 64-bit systems (and analogously on 32-bit ones), a pointer is nothing but a 8-byte number that is supposed to be interpreted as an address in memory.

When we say a pointer "points to" something, we mean that the address the pointer contains is the address of that "something" in memory.

A pointer that points to a value can be "dereferenced" to get the actual memory contents that the value holds.

A "dangling" pointer is a pointer that used to point to a memory location that had "valid" user-initiated data, but now that memory location is no longer "valid" for some reason or another. A memory location could become invalid if, for example, the memory gets deallocated by the application and reclaimed by the operating system. And so a pointer pointing to such a location now points to an invalid memory location, and is thus "dangling" unless it is reassigned to a valid location.

A "nil" pointer is a pointer whose address value refers to an address that will never be possibly valid in the lifetime of a program. In most modern programs with a pointer concept, the nil pointer always has the value 0, although it can technically be anything as long as the address can never be valid. In the examples in this document, we use the 0 address as if it were usable, but only for illustration. In the real world, the 0 address will generally not be accessible, and is also usually what all NULL or nil pointers have as their value.

Pointing to a Value

Let us consider a pointer p that points to v:

int v;
int *p;

v = 5;
p = &v;

An example illustration of how memory would look after this code runs is as follows:

ADDR    NAME    VAL
0       v       +---------+
                |    5    |
                +---------+
4       p       +---------+
                |    0    |
                +---------+
12      N/A     N/A
16      N/A     N/A
20      N/A     N/A
24      N/A     N/A
...     ...     ...

v is the name of a 32-bit signed number whose value is currently 5. It is at the very start of memory, at address 0. Since v takes 4 bytes, that means it occupies addresses 0, 1, 2 and 3 in total. The next value after v will start at address 4.

You will notice that we are incrementing addresses by 1 whenever 8 bits (or 1 byte) has passed in memory; this is how addresses are always counted - we don't increment an address when only 1 bit has passed or only after more than 1 byte has passed. Each address represents a unique byte in memory, where address A-1 is 1 byte before address A and address A+1 is 1 byte ahead of address A.

p is the name of a 64-bit address whose value is currently 0. Since 0 is the address of v, p is said to point to v. Since p is 64-bits long, it takes 8 bytes and therefore 8 unique address locations, specifically the addresses 4, 5, 6, 7, 8, 9, 10 and 11. The next value after p will start at address 12.

If we were to dereference p with *p while it holds address 0, the result of evaluating that expression would give us 5, which is the value found at address 0.

Pointing to a List of Values

If we had 3 values v1, v2 and v3, all of which are consecutive in memory and are the same length and type, we can use a pointer p to point to the first value in this list of values.

int v1, v2, v3;
int *p;

v1 = 51;
v2 = 52;
v3 = 53;

p = &v1;
ASSERT(*p, 51);
p += 1;
ASSERT(*p, 52);
p += 1;
ASSERT(*p, 53);

If the above code ran and v1, v2 and v3 happened to be next to each other in memory in that order, the assertions in the code above would hold true. (Note that in C there is no such guarantee; instead, a C user would use an array type which contains these 3 values, which will tell the compiler to explicitly put the 3 values consecutively in memory. However, we do not consider array types in this section for simplicity.)

Now, when p = &v1 in the above code occurs, this is an example illustration of what memory would look like:

ADDR    NAME    VAL
0       v1      +---------+
                |    51   |
                +---------+
4       v2      +---------+
                |    52   |
                +---------+
8       v3      +---------+
                |    53   |
                +---------+
12      p       +---------+
                |    0    |
                +---------+
20      N/A     N/A
24      N/A     N/A
28      N/A     N/A
32      N/A     N/A
...     ...     ...

When *p is run, this is dereferencing the pointer to get the value at address 0. Since p is specifically an int pointer (note that it was declared as int *p and not as bool_t *p or some other type), it is assumed that the value at address 0 is an int, so *p translates into 51, the int value found at address 0. We will discuss in the "Pointer Aliasing" section what would happen if p was a different pointer type but pointed to an int nonetheless.

When p += 1 occurs, since p was an int pointer and int is 4 bytes long, it is automatically incremented by 4 bytes; the address inside p is updated from 0 to 4.

Note that in C, incrementing any pointer of any type by any amount is always scaled by how long its type is. So if p was a pointer to a 8-byte number instead of a 4-byte one, p += 1 would increment by 8 addresses rather than 4. In Mlang, the operation is never automatically scaled by the type's length, so adding 1 to any pointer type will always explicitly add 1 address.

After p is incremented so that the address in it becomes 4, dereferencing it with *p gives us the value found at address 4 which is currently 52. A similar thing happens again in the next increment and dereference.

Aside: Pointer Aliasing

Before continuing, as a thought experiment, consider what would happen if we had another pointer p2 which was declared as bool_t *p2, which is a boolean pointer, as such:

int v = 50;
int *p;
bool_t *p2;

p = &v;
p2 = (bool_t *)&v;

Since a bool_t is only 1 byte long, *p2 would mean to get the boolean value at the address in p2.

This can be problematic, since we did not intend to put a boolean value in v. If it is done nonetheless, then only the first byte of v will be retrieved and interpreted as a boolean value.

This may seem troubling, but it is possible in C by force if the programmer wishes. This is rarely done in this way exactly, but if it is, the reason for doing so should be well-documented and well-reasoned, and only after considering the use of a language-level facility like union in C, which allows representing multiple different types in the same memory location safely. The details of union are out of scope for this document.

Compilers may have issues performing certain optimizations if more than 1 pointer points to the same value but with different types, because less assumptions can be made about what's happening. So, they introduce the concept of a "strict" pointer aliasing rule which requires that all pointers pointing to the same memory location are of the same type, otherwise the program is considered as having undefined behavior. An exception is made for the char * and void * pointer types, however, which can be used to refer to the same memory location as any other pointer type, without breaking this rule.

In Mlang, no strict pointer aliasing rule exists and the associated optimizations are not made on purpose; the programmer must perform the relevant optimization manually or make it explicit to the compiler that they desire a particular optimization. In our experience, this makes translation from the language to assembly a lot more predictable. We prefer the user to have more control over the compiler than have the compiler assume it can always do better than the user. The compiler is only a tool.

Pointing to a List of Variable-Sized Values

Instead of numbers, consider trying to store characters in memory consecutively, which can then be interpreted together as representing a human-readable sentence.

char *s = "Hello"

s is a pointer to a character, and when the "Hello" expression is evaluated, it returns a character pointer to the start location of the string. In particular, it returns the address of the "H" in "Hello".

ADDR    NAME    VAL
0       N/A     +---------+
                |    H    |
                |    e    |
                |    l    |
                |    l    |
                |    o    |
                |   \0    |
                +---------+
6       N/A     N/A
7       N/A     N/A
8       s       +---------+
                |    0    |
                +---------+
16      N/A     N/A
20      N/A     N/A
...     ...     ...

The memory for "Hello" is stored starting at address 0, and goes on until address 5. The reason this is actually 1 value higher than what you'd expect, is because in languages like C, strings declared in the source code are automatically appended with a NUL byte (the '\0' character), which is the 0 value in 7-bit ASCII. We adopt strings of this format (often called "C-strings" because of their ubiquity in the C language) for this lesson.

The actual pointer s starts at address 8 (see the section "Aside: Alignment" for why s doesn't start at address 6, which is unused in this example). The value it contains is address 0, which points to the "H" character.

Now, in order to access the remaining characters in the string, we can simply increment s by 1, access the character, and repeat, until we reach the special NUL byte character '\0', which indicates that we have reached the end of the string. Indeed, this methodology is exactly the technique used to get the total length of C-strings, because there is no other way otherwise.

Now that we understand how to point to strings, it's natural that as our programs have more use cases, they will require multiple strings that are somehow related, and must be stored "together" in some way, but still remain separate strings.

One methodology for doing this is to store all the strings that we feel are related next to each other in memory, just as we did with numbers:

Here is one weird trick one can pull to achieve this:

char *s = "Hi\0Bye";
ADDR    NAME    VAL
0       N/A     +---------+
                |    H    |
                |    i    |
                |   \0    |
                +---------+
3       N/A     +---------+
                |    B    |
                |    y    |
                |    e    |
                |   \0    |
                +---------+
7       N/A     N/A
8       s       +---------+
                |    0    |
                +---------+
16      N/A     N/A
20      N/A     N/A
24      N/A     N/A
28      N/A     N/A
...     ...     ...

Here, we manually put a NUL byte between "Hi" and "Bye", making them look like 2 C-strings in memory, and having a pointer s point to "H". s can be used to iterate through the "list of strings" in this way.

This is obviously problematic; as our string quantities and sizes increase, this strategy becomes too problematic in practice. Even more, what happens if you wanted to dynamically grow the size of one of these strings? For example, what if "Hi" needs to be converted to "Hi, brother"?

So instead, let us consider a more general format. Instead of requiring that the string content itself gets stored next to each other in memory, we instead store a list of character pointers together in memory, and have them each point to one unique string which could be stored anywhere and managed separately.

For example:

char *s1 = "Hi";
char *s2 = "Bye";
char **p = &s1;

ADDR    NAME    VAL
0       N/A     +---------+
                |    H    |
                |    i    |
                |   \0    |
                +---------+
8       s1      +---------+
                |    0    |
                +---------+
16      s2      +---------+
                |   44    |
                +---------+
24      p       +---------+
                |    8    |
                +---------+
32      N/A     N/A
36      N/A     N/A
40      N/A     N/A
44      N/A     +---------+
                |    B    |
                |    y    |
                |    e    |
                |   \0    |
                +---------+
48      N/A     N/A
52      N/A     N/A
...     ...     ...

Here, we assume that s1 and s2 are stored contiguously in memory just as if they were in an array together. p is stored after them. The two strings, "Hi" and "Bye", are stored far apart from each other. s1 and s2 contain the addresses of these two strings, and p points to s1.

What p essentially is now is a "pointer to an array of character pointers". Each of the character pointers (s1 and s2) are themselves a "pointer to a character array". Using p, we can access both "Hi" and "Bye":

char *s;

// s will be equal to `s1`.
s = *p;
ASSERT_EQ(*(s+0), 'H');
ASSERT_EQ(*(s+1), 'i');
ASSERT_EQ(*(s+2), '\0');

// s will be equal to `s2`.
s = *(p+1);
ASSERT_EQ(*(s+0), 'B');
ASSERT_EQ(*(s+1), 'y');
ASSERT_EQ(*(s+2), 'e');
ASSERT_EQ(*(s+3), '\0');

s is a character pointer that is being used to reference s1 and s2 one-by-one. When it first points to s1, it is dereferenced using an offset of 0, 1 and 2 to assert that the value at each address is 'H', 'i' and '\0' respectively.

The same is then done when s is changed to point to s2.

Note that the syntax of the form *(<ptr>+<number>), such as *(s+1), is the longer form version of what is found in C as s[1]. The expression s[1] actually expands into *(s+1), because it is doing exactly that: 1 is being added to s (since s is a character pointer, the actual address gets increased by 1 only) and then the resulting address is dereferenced to get the value stored at that address, which is a character in one of our strings.

Aside: Alignment

Consider having 3 numbers in memory, two 8-bit numbers and one 32-bit number, layed out as follows:

ADDR    NAME    VAL
0       N/A     +---------+
                |    5    |
                +---------+
1       N/A     +---------+
                |    9    |
                +---------+
2       N/A     +---------+
                |   999   |
                +---------+
5       N/A     N/A
8       N/A     N/A
12      N/A     N/A
16      N/A     N/A
...     ...     ...

When retrieving the 8-bit number 5 at address 0, or the 8-bit number 9 at address 1, everything is alright and works as expected when using assembly instructions appropriate for that size of access.

But, depending upon the CPU ISA, trying to access the 32-bit number at address 2 can cause a program to crash, or do the load slower than the other loads, or lose load atomicity. If the number was "aligned", this would not happen.

A value is "aligned" in memory if it is stored starting at an address that is divisible by some constant, where that constant is usually equal to the size of the value in bytes. If that is indeed the constant in question, we say that the value is "naturally aligned", or has "natural alignment".

So, the reason why accessing the first two 8-bit numbers works is because their size in bytes is 1, and both addresses 0 and 1 are considered divisible by 1. They are therefore naturally aligned. The reason accessing the 32-bit number fails or is slow or loses atomicity is because its size in bytes is 4, and 2 is not divisible by 4, so it is unaligned. Valid addresses would be, for example, 0, 4, 8, 12, 16, etc.

When a value is thus naturally aligned, by aligning it to an address that is divisible by its size in bytes, no issues will occur.

The reason for this seeming anomaly is a detail of how CPUs and memory chips work, and is out of scope for this document: just understand that it is required at the hardware level, so in software, we make sure to align our values.

Because of this, if we had a structure that looks like this:

struct {
    u8_t a;
    u8_t b;
    u32_t c;
};

The correct memory layout would look like this after taking alignment into account:

ADDR    NAME    VAL
0       a       +---------+ <---+ start of struct
                |    5    |     |
                +---------+     |
1       b       +---------+     |
                |    9    |     |
                +---------+     |
2       N/A     N/A             |
3       N/A     N/A             |
4       c       +---------+     |
                |   999   |     |
                +---------+ <---+ end of struct
8       N/A     N/A
12      N/A     N/A
16      N/A     N/A
...     ...     ...

We say that addresses 2 and 3 (i.e. the two extra bytes between b and c) are "padding" bytes - the compiler will automatically add this padding when the structure is used anywhere.

On x64 systems, the program will not crash if values are unaligned, but access will be costlier and atomicity is not guaranteed any longer. In some cases, an engineer may make the decision that these costs are worth paying in order to avoid adding padding.

For example, if the engineer is faced with a use case where memory usage is the costliest factor, and losing atomicity and having slightly slower accesses is worth the trade, then the compiler can be told to purposely not add padding to achieve alignment.

Obviously, it is the default to always align values, whether on the heap, stack or anywhere in memory. But say, for example, you have a system that stores millions of the following structures in memory, but accesses them very infrequently and always in a single-threaded fashion:

struct A {
    u8_t a;     // 1 byte
    u64_t b;    // 8 bytes
};              // = 9 bytes total?

struct B {
    u64_t a;    // 8 bytes
    u8_t b;     // 1 byte
};              // = 9 bytes total?

If alignment is required, then both structures will consume 16 bytes of memory; structure A will need an extra 7 bytes between a and b, and structure B will add an extra 7 after b to allow other values that surround structure B to be more easily aligned themselves.

If alignment is turned off for both structures, then both take 9 bytes and look in memory exactly as they're written in the code. However, note that values in memory which surround instances of structure A and B will still have to be aligned themselves (unless they too were allowed to be unaligned).

The engineer faced with this dilemma in such a memory-intensive system may opt for allowing these structures to be unaligned, thus saving 7 bytes per structure, which is a 43.75% memory savings - a big amount if we're talking about having gigabytes worth of these structures! If 20 GiB is used by these structures when aligned, the unaligned versions will consume 11.25 GiB instead.

Oxylabs Go SDK

Uman Shahzad — Mon, 05 Feb 2024 09:22:16 +0000

Hey everyone! We've created an Oxylabs Go SDK that'll allow users to more easily utilize the Oxylabs SERP APIs (and more later) via a Golang library.

You can find it here: https://github.com/mslmio/oxylabs-sdk-go

This is just an MVP and we'll be working on improving it over time with more features.

Would appreciate if Oxylabs users could try it out and give their feedback so we can start addressing it and improving the SDK.

Here's a quick start example:

package main

import (
    "fmt"

    "github.com/mslmio/oxylabs-sdk-go/serp"
)

func main() {
    // Set your Oxylabs API Credentials.
    const username = "username"
    const password = "password"

    // Initialize the SERP realtime client with your credentials.
    c := serp.Init(username, password)

    // Use `google_search` as a source to scrape Google with adidas as a query.
    res, err := c.ScrapeGoogleSearch(
        "adidas",
    )
    if err != nil {
        panic(err)
    }

    fmt.Printf("Results: %+v\n", res)
}

It works with both the real-time and async integration methods and makes it especially easy to use the latter method which is otherwise quite tedious.

Please see detailed documentation at https://pkg.go.dev/github.com/mslmio/oxylabs-sdk-go

Source code is at github.com/mslmio/oxylabs-sdk-go

Thank you!