DEV Community

Cover image for DSA for Data Scientists
Niladri Das
Niladri Das

Posted on • Updated on

DSA for Data Scientists

"The data structure is an essential foundation for computer practitioners."
— Niladri Das


What is computer science

Computer science is often difficult to define. This may be due to the unfortunate use of the “computer” word. As you probably know, computer science is more than just the study of computers. Although computers as tools play an important supporting role in the discipline, they are just tools.

Computer science is the study of problems, their solution, and the solutions that result from problem-solving. Given a problem, the goal of a computer scientist is to develop an algorithm, a list of instructions that can be used to solve any instance of the problem that may arise. An algorithm can solve a problem by following its limited process.

Computer science can be thought of as the study of algorithms. However, we must be careful to include the fact that some problems may not have solutions. Although proving that this statement is correct is beyond the scope of this article, the fact that some problems are unsolvable is important to those studying computer science. So we can define computer science as a science that studies solutions to problems that can be solved and problems that cannot be solved.

Usually, we say that a problem is computable when describing the problem and solution. A problem is computable if an algorithm exists to solve it. Another definition of computer science is that computer science is the study of computable and non-computable problems, and whether there is an algorithm to solve them. You will notice that the “computer” word does not appear at all until the solution is machine-independent.

Computer science, as it involves the problem-solving process itself, is also the study of abstraction. Abstraction allows us to view problems and solutions in a way that separates the so-called logical and physical perspectives. The basic idea is the same as our common examples.

Let's say you probably already drive to school or work. As a driver, as a user of the car. For the car to take you to your destination, you will have some interaction with the car. Enter the car, insert the key, ignition, shift, brake, accelerate and steer. From an abstract perspective, we can say that what you are seeing is the logical perspective of the car. You are using the features provided by Car Designer to transport you from one location to another. These functions are sometimes called interfaces.

On the other hand, a mechanic working on a car has a completely different perspective. Not only does he know how to drive a car, he must know all the details necessary to make the features we take for granted work. He needs to understand how the engine works, how the gearbox changes speed, how the temperature is controlled, etc. This is called the physical perspective, and details occur in“under the hood”.

The same thing happens when we use computers. Most people use their computers to write documents, send and receive email, surf the Internet, play music, store images, and play games without knowing the details that make these applications work. They view computers from a logical or user perspective. Computer scientists, programmers, technical support personnel, and system administrators look at computers from very different perspectives. They must know the details of how the operating system works, how to configure network protocols, and How to write various scripts to control functions. They must be able to control the underlying details.

What these two examples have in common is that the user-space abstraction, sometimes called the client, does not need to know the details, as long as the user knows how the interface works. This interface is the way for users to communicate with the underlying layer. Another example of abstraction is the Python Mathematics module. Once we import the module, we can perform the calculation:

>>>import math
>>>math.sqrt(16)
4.0
>>>
Enter fullscreen mode Exit fullscreen mode

This is an example of procedural abstraction. We don't necessarily know how to calculate square roots, but we know what the function is and how to use it. If we perform the import correctly, we can assume that the function will give us the correct results. We know someone has implemented a solution to the square root problem, but we just need to know how to use it. This is sometimes called the “black box Child” view. Let's briefly describe the interface: the name of the function, what it requires (parameters), and what it will return. Details are hidden inside (see picture 1)

1st


What is programming

Programming is the process of encoding algorithms into symbols, a programming language so that they can be executed by a computer. Although many programming languages and different types of computers exist, the first step is to have a solution. There is no program without algorithms.

Computer science is not the study of programming. However, programming is an important ability for computer scientists. Programming is often the representation we create for a solution. Therefore, this linguistic representation and the process of creating it became a fundamental part of the discipline.

An algorithm describes a solution generated based on problem instance data and a set of steps required to produce the expected results. A programming language must provide a representation method for representing procedures and data. For this purpose, it provides control structures and data types.

Control structures allow algorithmic steps to be expressed conveniently and unambiguously. At a minimum, the algorithm needs to perform sequential processing, decision selection, and repeated control iterations. As long as the language provides these basic statements, it can be used for algorithmic representation.

All data items in a computer are represented in binary form. To give these strings meaning, we need data types. Data types provide an interpretation of this binary data so that we can think about the data in terms of the problem it solves. These low-level built-in data types (sometimes called primitive data types) provide the basis for algorithm development.

For example, most programming languages provide data types for integers. Binary data in memory can be interpreted as integers and can be given an (23,654 and -19) associated meaning. In addition, the data type describes the operation in which the data item participates.

For integers, operations such as addition, subtraction, and multiplication are common. We expect numeric data to participate in these arithmetic operations. Often the difficulty we encounter is that the problem and its solution are very complex. These simple, language-provided structures and data types, while adequate for representing complex solutions, are often at a disadvantage in how we approach problems. We need some methods to control this complexity and provide us with better solutions.


Why learn data structures and abstract data types

To manage problem complexity and the problem-solving process, computer scientists use abstractions that allow them to focus on the “overall situation” Without getting lost in the details. By creating a model of the problem domain, we can leverage a better and more efficient problem-solving process. These models allow us to describe the data our algorithms will process more consistently.

Previously, we referred to procedural abstraction as a process that hides the details of a specific function to allow the user or client to view it at a high level. We now turn our attention to a similar idea, that of data abstraction. Abstract data type(sometimes abbreviated to ADT) is a logical description of how we view the data and the allowed operations, without regard to how to implement them. This means we only care about what the data represents, regardless of how it will ultimately be structured. By providing this level of abstraction, we create an encapsulation around the data. By encapsulating implementation details, we hide them from the user's view. This is called information hiding.

Abstract data types and how to operate on them. Users interact with the interface using operations specified by abstract data types. Abstract data types are what users interact with the shell. The implementation is hidden deeper. Users don't care about implementation details.

The implementation of abstract data types (often called data structures) will require us to use some collection of program constructs and primitive data types to provide a physical view of the data. As we discussed earlier, the separation of these two perspectives will allow us to define the problem into complex data models without giving out details about how the model is constructed. This provides an implementation-independent view of the data. Because there are typically many different ways to implement an abstract data type, this implementation independence allows programmers to switch implementation details without changing the way users of the data interact with it. Users can continue to focus on the problem-solving process.


Why learn algorithms

Computer scientists often learn through experience. We learn by watching others solve problems and by solving problems ourselves. Exposure to different problem-solving techniques and seeing different algorithm designs can help us take on the next challenging problem. By thinking about many different algorithms, we can begin to develop pattern recognition so that the next time a similar problem arises, we will be better able to solve it.

Algorithms are often completely different from each other. Consider what we saw earlier sqrt example of. There may be many different ways to implement the details to compute the square root function. One algorithm can use fewer resources than another. An algorithm may require 10 times to return results. We want some way to compare these two solutions. Even if they both work, one may be better than the other “better”. We recommend using a more efficient algorithm, or one that simply works faster or uses less memory. When we study algorithms, we can learn analysis techniques that allow us to compare and contrast solutions based solely on their characteristics rather than on the characteristics of the program or computer used to implement them.

In the worst case, we might have an intractable problem, meaning that no algorithm can solve the problem in a realistic amount of time. It is important to be able to distinguish between those problems that have a solution, those problems that do not have a solution, and those problems that have a solution but require too much time or other resources to work reasonably well.

There are often trade-offs and decisions we need to make. As computer scientists, in addition to our problem-solving abilities, we also need to understand solution evaluation techniques. In the end, there are usually many ways to solve the problem. Once a solution is found, we will compare it over and over again and decide if it is a good solution.


Review Python Base

In this section, we will review the Python programming language and provide some more detailed examples. What if you are Python If you are new to this, or if you need more information on any of the topics presented, we recommend that you refer to Python language reference PythonTutorial. Our goal here is to re-understand the Python language and reinforce some concepts that will be central to later chapters.

Python Is a modern, easy-to-learn object-oriented programming language. It has a powerful set of built-in data types and easy-to-use control structures. because Python is an interpreted language, so it is easier to inspect by simply viewing and describing interactive sessions. You should remember that the interpreter displays the familiar >>> prompt, and then calculates the Python Sentences. For example,

>>> print("Algorithms and Data Structures") 
    Algorithms and Data Structures 
    >>> 

Enter fullscreen mode Exit fullscreen mode

Show prompt, print result and next prompt.


Target

Understand the importance of algorithms. Use large symbols to represent execution times. Understand Python Large for lists and dictionaries. Execution times understanding Python. How data implementation affects algorithmic analysis. Learn how to perform simple benchmark testing with Python.


What is Algorithmic Analysis

A common phenomenon is that students who are new to computer science will compare their programs with others. You may also notice that computer programs look very similar, especially simple programs. An interesting question often arises. When two programs solve the same problem but look different, which one is better? To answer this question, we need to remember that there is an important difference between a program and the underlying algorithm that the program represents. As we discussed, an algorithm is a general, problem-solving list of instructions. It is a method used to solve any instance of a problem that, given specific inputs, produces a desired result. A program, on the other hand, is an algorithm that has been encoded in a programming language. Depending on the programmer and programming language used, there may be many programs for the same algorithm. To explore this difference further, please refer to the ActiveCode 1 function shown in. This function solves a familiar problem, before calculating The sum of integers. The algorithm is initialized using an accumulator variable. then iterate integers, adding each to the accumulator.

def calculate_sum(m):
    result = 0 
    for x in range(1, m+1):
        result = result + x

    return result 

    print(calculate_sum(10))

Enter fullscreen mode Exit fullscreen mode

ActiveCode 1

Look now ActiveCode 2 function. At first glance, it may seem strange, but upon further inspection, you can see that this function essentially does the same thing as the previous program. The reason for this unintuitiveness is poor coding habits. Instead of using good identifier names to improve readability, we used an extra assignment statement in the iteration step that wasn't necessary.

def calculate_total(jerry):
    total_sum = 0
    for jill in range(1, jerry+1):
        jack = jill
        total_sum = total_sum + jack
    return total_sum

    print(calculate_total(10))
Enter fullscreen mode Exit fullscreen mode

ActiveCode 2

Earlier we asked the question of which function is better, the answer depends on your criteria. If you are concerned about readability, the function sumOfN is better than calculate_total good. You've probably seen many examples in introductory programming courses where one of their goals is to help you write easy-to-read and understanding procedures. In this course, we are interested in how to represent algorithms (although we hope you will continue to strive to write readable, easy-to-understand code).

Algorithm analysis involves comparing algorithms based on the amount of computing resources each algorithm uses. We compare two algorithms and say one is better than the other because it is more efficient in using resources, or simply uses fewer resources. From this perspective, the above two functions look very resemblance. They all use essentially the same algorithm to solve the problem. At this point, what is more important is how we think about computing resources in the true sense. There are two methods. One is to consider the space or memory required by the algorithm. The space required is usually determined by the problem itself. However, the algorithm will have some special space requirements, and we can explain these changes in detail.

As an alternative to space requirements, we can analyze algorithms based on time. This measure is sometimes called the algorithm's ‘execution time’ or ’operation hours‘. We can measure functions by benchmarking SumOfN execution time. This means that we will record the program calculation required for actual time. exist Python, we can benchmark the function by recording the start time and end time relative to the system. There is a time function in the time module which will return the system clock time in seconds. By calling the time function at the start and end times and calculating the difference, you can get an exact number of seconds (in most cases).

import time 

    def sumOfN2(n): 
        start = time.time() 
        theSum = 0 
        for i in range(1, n+1): 
            theSum = theSum + i 
        end = time.time() 
        return theSum, end-start 
Enter fullscreen mode Exit fullscreen mode

Listing 1

Listing 1 A time function is embedded, and the function returns an array containing the result and the elapsed time. If we execute this function5 times, before each calculation10,000 The sum of integers will give the following result:

for i in range(5): 
        result, time_taken = sumOfN(10000)
        print("Sum is %d required %10.7f seconds" % (result, time_taken))
Enter fullscreen mode Exit fullscreen mode

We found that the time is fairly consistent, it takes on average to execute this code0.0019Second. If we run the calculation before100,000What about functions of integers?

for i in range(5): 
    print("Sum is %d required %10.7f seconds" % sumOfN(100000))
Enter fullscreen mode Exit fullscreen mode

Again, despite being longer, the time required for each run is very consistent, averaging more than 10 times. forn equal to1,000,000,we got:

for i in range(5): 
    print("Sum is %d required %10.7f seconds" % sumOfN(1000000))
Enter fullscreen mode Exit fullscreen mode

In this case, the average is also approximately the previous time 10 times. consider nowActiveCode 3, which shows different ways of solving summation problems Σ1 i=1 (n)(n+1)

 def sumOfN3(n): 
     return (n*(n+1))/2 
     print(sumOfN3(10)) 
Enter fullscreen mode Exit fullscreen mode

ActiveCode 3

If we are right sumOfN3 to do the same benchmark, using 5 different n (10,000, 100,000, 1,000,000, 100,000,000), We get the following results:

Sum is 50005000 required 0.00000095 seconds 
    Sum is 5000050000 required 0.00000191 seconds 
    Sum is 500000500000 required 0.00000095 seconds 
    Sum is 50000005000000 required 0.00000095 seconds 
    Sum is 5000000050000000 required 0.00000119 seconds
Enter fullscreen mode Exit fullscreen mode

There are two points of concern. Firstly, the time recorded above is shorter than any of the above examples. In addition, their time and It doesn't matter, it seems unaffected Impact.

But what does this benchmark tell us? We can intuitively see that using the iterative solution requires more work because some steps are repeated. This may be why it takes longer. In addition, the time required for iterations increases incrementally. Another problem is that if we run this function on a different computer or use a different programming language, we may get different results. If you use an old computer, it may take longer to complete sumOfN3.

We need a better way to characterize the execution time of these algorithms. Benchmarks measure the actual execution time. It doesn't provide us with a useful measurement because it depends on the specific machine, program, time, compiler and programming language. Instead, we want features that are independent of the program or computer used. However, benchmark metrics will help to judge algorithms individually and can be used to compare algorithms between solutions.


Big O symbol

When we try to characterize the efficiency of an algorithm in terms of execution time, and independent of any particular program or computer, it is important to quantify the number of operations, or steps, required by the algorithm. Choosing the appropriate basic unit of calculation is a complex issue and will depend on how the algorithm is implemented. As with the previous summation algorithm, a good basic unit of calculation is counting executed statements. exist sumOfN In the count of assignment statements is 1 (theSum = 0) plus n value (we execute theSum=theSum+i number of times).

We pass the function T express T(n)=1 + n parameters n usually called ‘scale of the problem’, we call it ‘T(n) is to solve the problem size of The time spent i.e. 1+n step size’. In the summation function above, It makes sense to express the problem size. We can say, 100,000 the sum of integers 1000 The problem is huge. Therefore, it takes longer. Our goal is to show how the execution time of an algorithm relative to the problem size changes depending on the size of the module.

Computer scientists prefer to take this analysis technique even further. It turns out that the number of steps is less important than the certainty T(n) The most important part comes first. In other words, when the problem size becomes larger, T(n) Some parts of the function will have more weight than other parts. The magnitude of a function represents The values that increase the fastest for those parts. orders of magnitude often called largeOsymbol, written as O(f(n)). It represents an approximation to the actual number of steps in the calculation. function (n) provided T(n) The most important part of the representation.

In the above example, T(n)=1+n. when n As it gets larger, the constant1 ay attention,1 forT(n) It's definitely important. But when When gets larger, our approximation is accurate without it.becomes less and less important to the result. If we are looking for (n) An approximation of, we can delete1, the running time is (n).

As another example, suppose that for some algorithms, the determined number of steps is T(n)=5n^2 +27n+1005. when n Very young, For example, 1 or 2, constant1005 Seems to be the main part of the function. However, withn get bigger, n^2 This is becoming increasingly important. When When is large, the other two items become unimportant in the role they play in determining the final result. n When getting larger, to approximate (n), we can ignore other terms and only focus on 5n^2. coefficient5 It also becomes unimportant. we say, T(n) Has an order of magnitude of f(n)=n^2. or O( n^2 )。

Although we didn't see this in the summation example, sometimes the performance of an algorithm depends on the exact values of the data rather than the size of the problem. For algorithms of this type, we need to characterize their performance in terms of best case, worst case, or average case. The worst case refers to a specific data set where the algorithm performs particularly poorly. The same algorithm may have very good performance on different data sets. In most cases, algorithm execution efficiency lies between the two extremes (average case). Computer scientists need to understand these differences so that they are not misled by a particular situation.

As you learn algorithms, some common order-of-magnitude functions will appear over and over again. See Table 1. To determine which of these functions is the most important, we need to see how they compare to each other as they get larger.

2nd

Table 1

As a final example, suppose we have a Listing 2 code snippet. Although this program does not do anything, it is useful for us to get actual code and performance analysis.

# Assign values to variables
    a = 5
    b = 6
    c = 10

    # Loop over range n for three variables i, j, and k
    for i in range(n):
        for j in range(n):
            # Perform operations on i and j
            x = i * i
            y = j * j
            z = i * j

        # Another loop over range n for variable k
        for k in range(n):
            # Perform operations on k
            w = a * k + 45
            v = b * b

    # Assign a value to variable d
    d = 33
Enter fullscreen mode Exit fullscreen mode

Listing 2

The allocation operand is divided into the sum of four terms. The first term is a constant 3, Represents the three assignment statements at the beginning of the fragment. The second item is 3n^2 because due to nested iteration, there are three statements executed n^2 Second-rate. The third item is Two statements iterations Second-rate. Finally, the fourth term a constant1, represents the final assignment statement. Finally concluded T(n)=3+3n^2 +2n+1 = 3n^2 + 2n+4, by looking at the index, we can see ^2 The terms are explicit, so this code snippet is O(n^ 2 ). When n increases all other terms as well as the coefficients on the principal term can be ignored.


An example of a palindrome string check

A good example of an algorithm showing different magnitudes is the palindrome check of strings. One string is the palindrome of another string. If the second string is just a rearrangement of the first, for example, 'heart' and 'earth' It's a palindrome string.' python' and 'Typhon' Too. For simplicity, we assume that the two strings in question have equal lengths and that they are given by 26 A collection of lowercase letters. Our goal is to write a Boolean function that takes two strings as arguments and returns whether they are palindromes.


Solution1: Examine

Our first solution to the palindrome problem is to check whether the first string appears in the second string. If every character can be detected, the two strings must be palindromes. This can be done by using None Replace characters to complete the check. But Python Strings are immutable, so the first step is to convert the second string to a list. Each character in the first string can be checked in the second list by checking whether the element exists, and if it exists, replace it with None.

See ActiveCode1

def anagramSolution1(s1, s2):
    alist = list(s2)
    pos1 = 0
    stillOK = True

    while pos1 < len(s1) and stillOK:
        pos2 = 0
        found = False

        while pos2 < len(alist) and not found:
            if s1[pos1] == alist[pos2]:
                found = True
            else:
                pos2 = pos2 + 1

        if found:
            alist[pos2] = None
        else:
            stillOK = False

        pos1 = pos1 + 1

    return stillOK

    print(anagramSolution1('abcd', 'dcba'))
Enter fullscreen mode Exit fullscreen mode

ActiveCode1

To analyze this algorithm, we observe that each character of string 1 s1 will be iterated over. In each iteration, every character of string 2 s2 in the list of length n locations will be visited once to match with the character from 1 s1. The number of visits can be expressed as the sum of integers from 1 to n, which is an arithmetic series with a common difference of 1.

Therefore, the sum of the series can be written as:

∑ 1 ⋅ ( + 1 ) ∑ i=1 ni⋅(i+1)

We can expand this expression:

∑  = 1  (  2 +  ) = ∑  = 1   2 + ∑  = 1   ∑ i=1 n(i 2 +i)=∑ i=1 ni 2 +∑ i=1 ni
Enter fullscreen mode Exit fullscreen mode

Using the formula for the sum of squares and the sum of an arithmetic series, we get:

 ⋅ (  + 1 ) ⋅ ( 2  + 1 ) 6 +  ⋅ (  + 1 ) 2 6 n⋅(n+1)⋅(2n+1)+ 2 n⋅(n+1)
Enter fullscreen mode Exit fullscreen mode

Now, factoring out ⋅ ( + 1 ) n⋅(n+1) from both terms:

 ⋅ (  + 1 ) ( 2  + 1 6 + 1 2 ) n⋅(n+1)( 6 2n+1+ 2 1)
Enter fullscreen mode Exit fullscreen mode

Simplifying the expression inside the parentheses, we get:

 ⋅ (  + 1 ) ⋅ ( 1 2 + 1 2 ) =  ⋅ (  + 1 ) ⋅ 1 2 n⋅(n+1)⋅( 2 1+ 2 1)=n⋅(n+1)⋅ 2 1
Enter fullscreen mode Exit fullscreen mode

Thus, the sum of the series from 1 i=1 to n of ⋅ ( + 1 ) i⋅(i+1) is ⋅ ( + 1 ) 2 2 n⋅(n+1).

As n gets larger, the term n^2 dominates over n^(1/2). Consequently, the term n^(1/2) becomes negligible compared to n^2. Therefore, the complexity of this algorithm can be approximated as O(n^2)


Solution2: Sort and compare

Another solution is to exploit the fact that evens1, and s2Different, they are palindromes only if they are composed of the same characters. So, if we arrange each string alphabetically, starting with a arrive With, if two strings are the same, the two strings are palindromes.

See ActiveCode2.

def anagramSolution2(s1, s2):
    alist1 = list(s1)
    alist2 = list(s2)
    alist1.sort()
    alist2.sort()
    pos = 0
    matches = True
    while pos < len(s1) and matches:
        if alist1[pos] == alist2[pos]:
            pos = pos + 1
        else:
            matches = False
    return matches

    print(anagramSolution2('abcde', 'edcba'))
Enter fullscreen mode Exit fullscreen mode

ActiveCode2

First, you might think that this algorithm is O(n) since there is only a simple iteration to compare the sorted characters. However, calling Python Sorting is not without cost. As we will see in later chapters, sorting is usually O(n^2 )or O(nlogn). So sorting operations cost more than iteration. In the end, the algorithm has the same order of magnitude as the sorting process.


Solution3: Exhaustive method

A powerful way to solve this type of problem is to exhaust all possibilities. For palindrome detection, we can generate a 1A list of all palindrome strings, and then check whether there is s2. This approach is a little difficult when s1 When generating all possible strings, the first position hasn There is a possibility that the second position has n-1 species, the third position hasn-3 Plant, etc. The total is n∗(n−1)∗(n−2)∗...∗3∗2∗1n∗(n−1)∗(n−2)∗...∗3∗2∗1, Right now n!. Although some strings may be repeated, the program cannot know this in advance, so it will still generate! string.

It turns out that! Compare n^2 Growth is still fast ifs1have 20 characters long, there will be 20! = 2,432,902,008,176,640,000 A string is generated. If we process one possible string per second, then we need 77,146,816,596 It takes years to complete the entire list. So this is not a good solution.


Solution4: Counting and comparing

Our final solution to the palindrome is to use two palindrome strings with the same, b, c Wait for the fact. The first thing we count is the number of times each letter appears. Because there are 26 possible characters, we use a length of 26 A list with one position for each possible character. Each time a specific character is seen, the counter at that position is incremented. Finally, if the counters of the two lists are the same, the string is a palindrome string.

See ActiveCode 3

def anagramSolution4(s1, s2):
    c1 = [0] * 26
    c2 = [0] * 26

    for char in s1:
        pos = ord(char) - ord('a')
        c1[pos] += 1

    for char in s2:
        pos = ord(char) - ord('a')
        c2[pos] += 1

    j = 0
    stillOK = True

    while j < 26 and stillOK:
        if c1[j] == c2[j]:
            j += 1
        else:
            stillOK = False

    return stillOK

    print(anagramSolution4('apple', 'pleap'))
Enter fullscreen mode Exit fullscreen mode

ActiveCode 3

Again, this solution has multiple iterations, but unlike the first solution, it is not nested. Both iterations are, The third iteration, comparing the two count lists, requires 26 steps, because there are 26 letters. total T(n)=2n+26T(n)=2n+26, Right now O(n), we found a linear magnitude algorithm to solve this problem.

Before concluding this example, let's discuss the space cost. Although the last solution performs in linear time, it requires additional storage to hold the two character count lists. In other words, the algorithm sacrifices space for time.

In many cases, you need to make a trade-off between space and time. In this case, the extra space doesn't matter, but if there are millions of characters, it does. As a computer scientist, when given a particular algorithm, it is up to you to decide how to use computing resources.


Python Data structure performance

Now you are big O Gain an understanding of algorithms and the differences between different functions. The goal of this section is to tell you, Python Large, for list and dictionary operations performance. We will then do some time-based experiments to illustrate the cost of each data structure and the benefits of using these data structures. It is important to understand the efficiency of these data structures because they are the building blocks used to implement other data structures in this book. In this section, we will not explain why this performance is. In later chapters, you'll see some possible implementations of lists and dictionaries, and how performance depends on the implementation.


List

Python Designers have many choices when implementing list data structures. Each of these choices may affect the performance of list operations. To help them make the right choice, they looked at how list data structures were most commonly used and optimized the implementation so that the most common operations were very fast. Of course, they also try to make less common operations fast, but when tradeoffs need to be made, performance for less common operations is usually sacrificed in favour of more common operations.

Two common operations are indexing and assigning to index positions. No matter how big the list is, both operations take the same amount of time. When such operations are independent of the size of the list, they are O(1).

Another very common programming task is adding a list. There are two ways to create longer lists, you can use the append Method or the concatenation operator. append the way O(1). However, the concatenation operator is(k), in k is the size of the list to be concatenated. This is for you because it can help you make your programs more efficient by choosing the right tools.

Let's look at four different ways we can generate a0startednlist of numbers. First, we will try for Loop through and create the list and then we will useappend rather than splicing. Next, we create the list using a list builder, and finally, and most obviously, by calling the list constructor wrapper range function.

import timeit

    def test1():
        l = []
        for i in range(1000):
            l = l + [i]

    def test2():
        l = []
        for i in range(1000):
            l.append(i)

    def test3():
        l = [i for i in range(1000)]

    def test4():
        l = list(range(1000))

    # Measure the time taken by each function
    print("Test 1:", timeit.timeit("test1()", setup="from __main__ import test1", number=1000))
    print("Test 2:", timeit.timeit("test2()", setup="from __main__ import test2", number=1000))
    print("Test 3:", timeit.timeit("test3()", setup="from __main__ import test3", number=1000))
    print("Test 4:", timeit.timeit("test4()", setup="from __main__ import test4", number=1000))
Enter fullscreen mode Exit fullscreen mode

To capture the time each of our functions takes to execute, we will use the Python Timeit module. Timeit The module is designed to allow Python Developers to make cross-platform timing measurements by running functions in a consistent environment and using the operating system's timing mechanisms as similarly as possible.

To use timeit, you need to create a timeit object whose parameters are two Python sentences. The first parameter is the time you want the execution to take; the second parameter is a statement that will be run once to set up the test. Then timeit module will calculate the time required to execute the statement. By default, timeit will try to run the statement a million times. When it completes, it returns the time as a floating-point value representing the total number of seconds.

Since it executes the statement a million times, it can use the read result as the number of microseconds to execute the test once. You can also pass Timer a parameter named number, which allows you to specify the number of times to execute the test statement. The following shows each test function running our 1000 How long does it take.

from timeit import Timer

    t1 = Timer("test1()", "from __main__ import test1")
    print("concat ", t1.timeit(number=1000), "milliseconds")

    t2 = Timer("test2()", "from __main__ import test2")
    print("append ", t2.timeit(number=1000), "milliseconds")

    t3 = Timer("test3()", "from __main__ import test3")
    print("comprehension ", t3.timeit(number=1000), "milliseconds")

    t4 = Timer("test4()", "from __main__ import test4")
    print("list range ", t4.timeit(number=1000), "milliseconds")
Enter fullscreen mode Exit fullscreen mode

In the above example, we have function calls test1(), test2(), etc., for which we are measuring the execution time using the timer function. The setup statement may seem unusual, so let's elaborate. You may be familiar with the from ... import statement, typically used at the beginning of a Python program file. In this case, from main import test1 imports test1 from the main namespace into the Timer function's setup namespace. This is done to ensure that the function is tested in a clean environment without any unintended interference from other variables.

It is evident from the experiments above that the append operation is significantly faster than concatenation. The other two methods, list comprehension and list(range()), have similar speeds, both being faster than concatenation.

Finally, the times displayed above include some overhead from calling the function itself. However, we can assume that the function call overhead is consistent across all cases, enabling us to make meaningful comparisons. Therefore, while the reported concatenation time of 6.54 milliseconds may not be entirely accurate, it represents the time taken for the concatenation operation. To obtain a more accurate measurement, one could test the time it takes to call an empty function and subtract it from the reported time.

Now that we have demonstrated how to perform performance testing, consider a scenario where we want to compare the runtime of different operations on a list, such as pop. The pop operation has different time complexities depending on its position in the list. When called at the end of the list, it operates in constant time ( 1 ) O(1). However, when called on the first element or anywhere in between, it operates in linear time ( ) O(n). This is because, in Python, when an item is removed from the beginning of the list, the remaining elements must shift one position closer to the start. Therefore, indexing operations are ( 1 ) O(1). Python's list implementation involves trade-offs, and the implementers carefully choose efficient solutions.

3rd

As a way to demonstrate the performance difference, we use a timeit. Let’s experiment. Our goal is to verify that from the end of the list elements and start from pop element properties. Likewise, we also want to measure the impact of different list sizes on this time. What we expect to see is that the time it takes to pop from the end of the list will remain the same even as the list grows. The time to pop elements from the beginning of the list will increase as the list grows.

Listing 4 shows two pop Comparisons of ways. As seen from the first example, popping from the end requires 0.0003 milliseconds. It costs to pop up from the start 4.82 milliseconds, for a 200 List of elements of 10,000, a difference of 16000 times.

Listing 4 A few points to note, first, from main import_x, Although we don't define a function, we do want to be able to use list objects in our tests, This approach allows us to count only a single pop statement, obtaining the most precise measurement of time for that operation. because the timer is repeated 1000 times, the list is decremented in size each time it is looped1. However since the initial list size is 200 million, we only reduce the overall size by 0.05%.

import timeit

    # Define timers for pop(0) and pop()
    popzero = timeit.Timer("x.pop(0)", "from __main__ import x")
    popend = timeit.Timer("x.pop()", "from __main__ import x")

    # Create a list with 2 million elements
    x = list(range(2000000))

    # Measure the time taken for pop(0) operation
    time_popzero = popzero.timeit(number=1000)
    print("Time taken for pop(0):", time_popzero)

    # Measure the time taken for pop() operation
    time_popend = popend.timeit(number=1000)
    print("Time taken for pop():", time_popend)
Enter fullscreen mode Exit fullscreen mode

Listing 4

Although our first test showed pop(0) Compared to pop() slow but it doesn't prove pop(0) yes O(n), pop() yes O(1). To verify this, we need to look at the effect of a series of calls to list sizes.

from time import Timer

    # Define timers for pop(0) and pop()
    popzero = Timer("x.pop(0)", "from __main__ import x")
    popend = Timer("x.pop()", "from __main__ import x")

    print("pop(0)        pop()")
    print("------------------------")

    # Loop over list sizes from 1 million to 100 million in steps of 1 million
    for i in range(1000000, 100000001, 1000000):
        x = list(range(i))

        # Measure time for pop(0)
        pz = popzero.timeit(number=1000)

        # Measure time for pop()
        pt = popend.timeit(number=1000)

        # Print results
        print("%15.5f   %15.5f" % (pz, pt))
Enter fullscreen mode Exit fullscreen mode

Listing 5


Dictionary

In Python, the second major data structure is the dictionary. Unlike lists, dictionaries allow access to items by key rather than position. Later in the book, you'll encounter various implementations of dictionaries. Operations such as get and set in dictionaries have a time complexity of (O(1)). Another essential operation is contains, which checks if a key is present in the dictionary, also with a time complexity of (O(1)). The efficiency of all dictionary operations is summarized in Table 3.

An important aspect of dictionary performance is that the efficiencies provided in the table are for average performance. In some rare cases, operations such as contains, get item, and set item can degenerate into (O(n)). We will delve into these cases in the following chapters.

4th

Table 3

In our final experiment, we will compare lists and dictionaries containing Operational performance. During this process, we will confirm the list contains operator O(n), and operator O(1). We will list a series of numbers in the experiment. Then randomly select a number and check if the number is in the list. If our performance table is correct, the larger the list, the longer it should take to determine whether any number is contained in the list.

Listing 6 This comparison is achieved. Notice that we do the same thing with the numbers in the container. The difference is that in the 7 On the line is a list, No.9 on the line is a dictionary.

import timeit
    import random

    for i in range(10000, 1000001, 20000):
        t = timeit.Timer("random.randrange(%d) in x" % i, "from __main__ import random, x")

    x = list(range(i))
    lst_time = t.timeit(number=1000)

    x = {j: None for j in range(i)}
    d_time = t.timeit(number=1000)

    print("%d,%10.3f,%10.3f" % (i, lst_time, d_time))
Enter fullscreen mode Exit fullscreen mode

Listing 6


Summarize

Algorithmic analysis is an independent method of measuring algorithms.
Big O notation allows the classification of algorithms by their main parts according to the size of the problem.


Target

  1. Understand abstract data types: stacks, queues, and lists.
  2. Utilize Python's list implementation for ADT stack, queue, and deque.
  3. Grasp the performance characteristics of basic linear data structure implementations.
  4. Familiarize with prefix, infix, and postfix expression formats.
  5. Implement postfix expressions using stacks.
  6. Convert expressions from infix to postfix using stacks.
  7. Employ queues for basic timing simulation.
  8. Develop the ability to identify issues with stacks, queues, and deques.
  9. Use data structures appropriately.
  10. Implement abstract data types lists as linked lists using nodes and references.
  11. Compare the performance achieved by our linked list implementation with Python's list implementation.

What is a linear data structure

We start our study of data structures with four simple but important concepts. stack, queue, deques, A list is a container of data in which the order of items is determined by the order in which they are added or deleted. Once a data item is added, it maintains its position relative to the elements before and after it. Data structures such as these are called linear data structures.

Linear data structures have two ends, sometimes called left and right, and in some cases front and back. You can also call them top and bottom, the names don't matter. What distinguishes two linear data structures is the way items are added and removed, specifically the positions at which items are added and removed. For example, some structures allow items to be added from one end, and others allow items to be removed from the other end.

These variant forms give rise to some of computer science's most useful data structures. They appear in a variety of algorithms and can be used to solve many important problems.


What is stack

Stack (sometimes called “Last in, first out”) is an ordered collection of items in which the addition and removal of new items always occur at the same end. This end is often called“top”. The end corresponding to the top is called “bottom”.

The bottom of the stack is important because items near the bottom of the stack are stored the longest. Recently added items are the first to be removed. This sorting principle is sometimes called LIFO, last in first out. It sorts based on the length of time within the collection. Newer items are closer to the top and older items are closer to the bottom.

Stack examples are very common. Almost all cafeterias have a stack of trays or plates, and you take one from the top and there's a new tray for the next customer. Imagine there is a pile of books on the table (Figure 1), Only the cover of the top book is visible. To see the covers of other books, you have to remove the books above them first. Figure 2 shows another stack containing many Python objects.

One of the most useful ideas about the stack comes from looking at it. Suppose you start with a clean desktop and now stack the books one on top of the other. You are constructing a stack. Consider what happens if you remove a book. The order of removal is the reverse of the order in which they were just placed. The stack is important because it reverses the order of items. Insertion and deletion are in reverse order. Figure 3 Shows Python During the process of creating and deleting data objects, pay attention to their order.

Think about this inverted property, and you can think of examples you encounter when using computers. For example, each Browser has a back button. When you browse the web, the web pages are placed in a stack (actually the URL of the web page). The page you are viewing now is at the top
Department, the page you viewed first is at the bottom. If you press the return button to browse the previous pages in reverse order.


Abstract data type of stack

The abstract data type of the stack is defined by the following structures and operations. As mentioned above, a stack is constructed as an ordered collection of items, where the positions where items are added and removed from the end are called “top”. The stack is ordered LIFO. The stack operates as follows.

Stack():

  • Create a new, empty stack.
  • It takes no parameters and returns an empty stack.

push(item):

  • Adds a new item to the top of the stack.
  • It takes an item as a parameter and does not return anything.

pop():

  • Removes the top item from the stack.
  • It takes no parameters and returns the item removed.
  • The stack is modified.

peek():

  • Returns the top item from the stack without deleting it.
  • No parameters are required.
  • The stack is not modified.

isEmpty():

  • Tests whether the stack is empty.
  • Requires no parameters and returns a Boolean value.

size():

  • Returns the number of items in the stack.
  • No parameters are required and it returns an integer.

For example, let's say "s" is an empty stack that has been created. Table 1 shows the results of a sequence of stack operations. On the stack, the top item is listed on the far right.


Python Implement stack

Now that we have the stack clearly defined with abstract data types, we will turn our attention to using Python to Implement the stack. Recall that when we give an abstract data type a physical implementation, we call the implementation a data structure.

As we did describe, in Python, as in any object-oriented programming language, the implementation of choice for an abstract data type (such as a stack) is to create a new class. Stack operations are implemented as methods of the class. Additionally, to implement a stack as a collection of elements, use Python The ability to provide a collection of primitives makes sense. We will use lists as the underlying implementation.

Think back, to Python The list class provides an ordered collection mechanism and a set of methods. For example, if we have the list[2,5,3,6,7,4], we need to determine which end of the list will be considered the top of the stack. Once determined, one can use something like the appendd and pop list method to implement the operation.

The following stack implementation (ActiveCode 1) assumes that the top element of the stack will be held at the end of the list. As the stack grows (push operation), the new item will be added to the end of the list. Similarly, the pop operation also operates on elements at the end of the list.

class Stack:
    def __init__(self):
        # Initialize an empty list to hold the stack items
        self.items = []

    def isEmpty(self):
        # Check if the stack is empty by comparing the list to an empty list
        return self.items == []

    def push(self, item):
        # Add an item to the top of the stack by appending it to the list
        self.items.append(item)

    def pop(self):
        # Remove and return the top item from the stack using the pop() method of lists
        return self.items.pop()

    def peek(self):
        # Return the top item from the stack without removing it
        # Access the last element of the list using list indexing
        return self.items[len(self.items)-1]

    def size(self):
        # Return the number of items in the stack using the len() method of lists
        return len(self.items)
Enter fullscreen mode Exit fullscreen mode

ActiveCode 1

Remember, we only define the implementation of the class; we need to create a stack and then use it. ActiveCode 2 demonstrates how we instantiate the Stack class and execute the operations listed in Table 1. Please note that the Stack class definition is imported from the Python module.

Note: The Python module contains implementations of all data structures discussed in this book. It encompasses basic data types, trees, and graphs. This module can be obtained from [pythonworks.org][5]/download.

from pythonds.basic.stack import Stack 
    s=Stack() 
    print(s.isEmpty()) 
    s.push(4) 
    s.push('dog') 
    print(s.peek()) 
    s.push(True) 
    print(s.size()) 
    print(s.isEmpty()) 
    s.push(8.4) 
    print(s.pop()) 
    print(s.pop()) 
    print(s.size())
Enter fullscreen mode Exit fullscreen mode

ActiveCode 2


Simple bracket matching

We now turn our attention to using the stack to solve real computer problems. How would you write an arithmetic expression like this?

(5+6)∗(7+8)/(4+3)

The parentheses are used for the execution of command operations. You may also have some experience with languages such as Lisp.

The structure:

(defunct square(n)
(* n n))

This code defines a function named square within a file. It calculates the square of the parameter ( n ). Lisp is renowned for its heavy use of parentheses.

In both instances, it's essential that the parentheses appear in a matching manner. Bracket matching ensures that each opening symbol has a corresponding closing symbol, and the brackets are nested correctly. For example, consider the following properly matched bracket string:

`(()()()())

(((())))

(()((())()))`

Compare those mismatched parentheses:

`((((((())

()))

(()()(()
`
The ability to distinguish matching parentheses is an important part of identifying constructs in many programming languages. The challenge is to write an algorithm that can read a string of symbols from left to right and decide whether the symbols are balanced. To address this issue, we need to make an important observation. When processing symbols from left to right, the most recent starting symbol must match the next closing symbol (See Figure 4). Additionally, the first start symbol of processing must wait until it matches the last symbol. End symbols match start symbols in reverse order. They match from the inside out. This is a clue that the stack can be used to solve the problem.

The algorithm is straightforward once you consider that the stack is an appropriate data structure for holding parentheses. Starting from an empty stack, the bracketed string is processed from left to right. If a symbol is a start symbol, treat it as a signal that the corresponding end symbol will appear later. On the other hand, if the symbol is a closing symbol and the stack is popped, the parentheses remain as long as the starting symbol popped from the stack can match each closing symbol.

Match status. If at any time there is no end symbol on the stack that matches the start symbol, the string does not match. Finally, when all symbols have been processed, the stack should be empty. Implement this algorithm in Python Code. See ActiveCode 1:

from pythonds.basic.stack import Stack

    def parChecker(symbolString):
        s = Stack()
        balanced = True
        index = 0

        while index < len(symbolString) and balanced:
            symbol = symbolString[index]

            if symbol == "(":
                s.push(symbol)
            else:
                if s.isEmpty():
                    balanced = False
                else:
                    s.pop()

            index = index + 1

        if balanced and s.isEmpty():
            return True
        else:
            return False

    print(parChecker('((()))'))
    print(parChecker('(()'))
Enter fullscreen mode Exit fullscreen mode

ActiveCode 1


Symbol matching

The matching parentheses problem shown above is a specific case of a general situation that occurs in many programming languages. It often happens that different kinds of start and end symbols are matched and nested. For example, inPython middle, square brackets [ and ] for lists, curly braces { and } for dictionaries. brackets ( and ) Used for tuples and arithmetic expressions. Symbols can be mixed as long as each symbol maintains its own start and end relationships. Symbol strings such as:

`{ { ( [ ] [ ] ) } ( ) }

[ [ { { ( ( ) ) } } ] ]

[ ] [ ] { }`

These are matched appropriately because not only does each start symbol have a corresponding end symbol, but the types of the symbols also match.

On the contrary, these strings cannot be matched:

`( [ ) ]

( ( ( ) ] ) )

[ { ( ) ]`

The simple bracket checker in the previous section can be easily extended to handle these new types of symbols. Recall that each start symbol is simply pushed onto the stack, waiting for a matching end symbol to appear. When the end symbol appears, the only difference is that we have to check to make sure it correctly matches the type of the start symbol at the top of the stack. If the two symbols do not match, the strings do not match. The string matches if the entire string has been processed and nothing is left on the stack.

Python See programActiveCode 1. The only change is 16 OK, let's call the helper function match. Each deleted symbol on the stack must be checked to see if it matches the current end symbol. Boolean variable if there is no matchbalanced set as False.

from pythons.basic.stack import Stack 
    def parChecker(symbolString): 
     s = Stack() 
     balanced = True 
     index = 0 
     while index < len(symbolString) and balanced:  symbol = symbolString[index] 
     if symbol in "([{": 
     s.push(symbol) 
     else: 
     if s.isEmpty(): 
     balanced = False 
     else: 
     top = s.pop() 
     if not matches(top,symbol):  balanced = False  index = index + 1 
     if balanced and s.isEmpty(): 
     return True 
     else: 
     return False 
    def matches(open,close): 
     opens = "([{" 
     closers = ")]}"
     return opens.index(open) == closers.index(close) 
    print(parChecker('{{([][])}()}')) 
    print(parChecker('[{()]'))
Enter fullscreen mode Exit fullscreen mode

ActiveCode 1

These two examples show that the stack is a very important data structure for computer language structure processing. Almost any nested symbol you can think of must be in a balanced matching order. The stack also has other important uses, which we will discuss in the next section.


Convert decimal to binary

In the process of learning computers, you may have come into contact with binary. Binary is important in computer science because all values stored in a computer are 0 and 1. Without the ability to convert between binary numbers and ordinary strings, our interactions with computers would be very tricky.

Integer values are common data items. They have always been used in computer programs and calculations. We learn them in math class, and of course, end up using decimals or radix10 to represent them. decimal233^10 and the corresponding binary representation11101001^2.

But how can we easily convert integer values to binary? the answer is the “remove 2” Algorithm that uses a stack to keep track of binary result numbers.

“remove 2” The algorithm assumes that we start from greater than 0 and start with an integer. Iteratively divide the decimal by 2, and keep track of the remainder. Divide the first 2 The remainder indicates whether the value is even or odd. There are even numbers The remainder is recorded as odd numbers have remainder1, recorded as1. We construct the resulting binary as a sequence of numbers, with the first remainder being the last number in the sequence. See Figure 5, Again we see the inverted property, indicating that the stack may be the data structure to solve this problem.

Activecode 1 neutralPython The code is implemented “remove 2” algorithm, function-divide By 2 A decimal parameter is passed in and divided repeatedly by 2. No.7 Lines use the built-in modulo operator% to extract the remainder, and No.8 Line pushes the remainder onto the stack. When divided by back, the 11-13 line constructs a binary string.

from pythonds.basic.stack import Stack

    def divideBy2(decNumber):
        remstack = Stack()

        while decNumber > 0:
            rem = decNumber % 2
            remstack.push(rem)
            decNumber = decNumber // 2

        binString = ""

        while not remstack.isEmpty():
            binString = binString + str(remstack.pop())

        return binString

    print(divideBy2(42))
Enter fullscreen mode Exit fullscreen mode

ActiveCode 1

This algorithm for binary conversion can be easily extended to perform conversions to any base. In computer science, various encodings are commonly used, including binary, octal, and hexadecimal. For example, the decimal number 233 corresponds to octal 352 and hexadecimal E9. The divide By 2 function can be modified to accept not only decimal arguments but also the base into which the conversion is expected. Instead of the specific concept of "divide by 2", a more general "divide by base" approach can be used.

ActiveCode2 presents a file named baseConverter function. It accepts decimal numbers and converts them to any base between 2 and 16 as parameters. The remainder is still pushed onto the stack until the converted value becomes 0. A set of symbols representing numbers greater than 9 is created to handle bases larger than 9.

from pythonds.basic.stack import Stack 
    def baseConverter(decNumber,base): 
     digits = "0123456789ABCDEF" 
     remstack = Stack() 
     while decNumber > 0: 
     rem = decNumber % base 
     remstack.push(rem) 
     decNumber = decNumber // base 
     newString = "" 
     while not remstack.isEmpty(): 
     newString = newString + digits[remstack.pop()]  return newString 
    print(baseConverter(25,2)) 
    print(baseConverter(25,16))
Enter fullscreen mode Exit fullscreen mode

ActiveCode2

Thank you! Cheers. Please send me an email message, if you find any bit of bugs in my generated content, or else I will take immediate action to fix and comfort your reading experience on this GUI platform. Love.

Support me on Linkedin and 𝕏 be followed on GitHub.

Top comments (0)