DEV Community: Hugo Estrada S.

Get Certified - Pass the AWS Cloud Practitioner 2022-2023

Hugo Estrada S. — Wed, 30 Nov 2022 15:26:40 +0000

A certification provides competency, shows commitment, professionalism and most importantly: it motivates you to keep learning and gather knowledge. It's also a great way to get a better job.

Currently AWS is the number 1 cloud provider, and continues growing. Amazon constantly improves, changes and adds new services and resources to it's cloud platform.

Every cloud provider has its pros and cons, but let's focus on the positive side of AWS:

AWS provides very strong and fast PaaS capabilities, which in today’s world is a very important part of any cloud infrastructure.
AWS brings to the array an integrated environment for deploying cloud apps, development and testing.
The new design of AWS is intuitive, easy to digest and with a solid security development life-cycle included.
The developer’s tools are amazing, and with a lot of flavors and colors. From an IoT suite to machine learning.

So with all these cool features, sounds like AWS is a very competitive cloud provider, and it's worth it to get certified.

The AWS Cloud Practitioner Exam, is a multi-choice, theoretical exam, and covers the foundations upon AWS designed its cloud platform, from standards to cases of use for the services and/or products available on AWS.

If you're interested in getting the certification, download my document. This exam covers topics like security, privacy, compliance, general cloud concepts, and basic understanding of pricing and support for several AWS services and resources. Does not matter if you're unfamiliar with AWS, my document will help you out!

DOWNLOAD
https://mega.nz/file/cU1nSJyQ#5bwSdl_VlSbgdHK2oOL7NZU-l9STRmYVQUnTCLMmY2A

Pandas Skills Pt. 2 🐼📊📚📐

Hugo Estrada S. — Wed, 16 Mar 2022 13:40:10 +0000

Before all, here is the link with the code of this lecture.

In this new installment, I will go over 2 very useful notebooks fully documented.

I specifically focused on two main, and very important topics: dealing with dates and missing values.

The two new notebooks are:

missing_values.ipynb
dealing_with_dates.ipynb

ENJOY!

Pandas Skills Pt. 1 🐼📊📚📐

Hugo Estrada S. — Mon, 06 Dec 2021 19:01:00 +0000

Before all, here is the link with the code of this lecture.

Part 1 - The Pandas DataFrame

The most fundamental aspect of Pandas, is the DataFrame.
This is how your data is stored, and it's a tabular format with rows and columns as you'd find them in a spresheet or a relational database table. So, before I dive into some more advanced Pandas topics, let me review the DataFrame concept.

import pandas as pd

After importing Pandas as 'pd', I'm going to create a Dictionary called 'scores'. A dictionary, is a Python structure which stores key-value pairs. In this dictionary the keys are 'name', 'city' and 'score', and the values are lists, as denoted by the square brackets, which are mapped to their corresponding key.

scores = {'name':['Hugo,', 'David', 'René'],
          'city':['Guatemala', 'Estanzuela', 'Zacapa'],
            'score':[50,70,100]}

Now, I'm going to transform this dictionary, into a Pandas DataFrame.

df = pd.DataFrame(scores)

To see the data, just type the name of the DataFrame.

df

And you should see a table with 'name', 'city' and 'scores' as column headers. And three rows of corresponding data.

Each columns is a series, and notice the values zero, one, and two to the left. These are the Index of our DataFrame. And are useful for referencing and subsetting our
data.

If we wanted to just return one column in our Data Frame, the notation is the DataFrame, and then the column name or names in square brackets. Let's take a look at 'score'.

df['score']

You can also call df.score to return the same result.

Similarly, you can also create new columns in your DataFrame by passing a new column name into the square brackets and assigning it.

Here, I'm creating a new column that combines the 'name' and 'city' columns.

df['name_city'] = df['name'] + '_' + df['city']

Now let's say I wanted to subset my data to only show those folks with scores above, say 70. To do that, I can create a boolean expression which returns true for scores greater than 70, and only return those records where this condition is true.

df[df['score']>69]

Pandas is very flexible, in that you can import data, from a wide variety of data sources, including CSV's, Excel files, JSON files, databases, Parquet files, you name it.

I'm going to import the Iris dataset, as a DataFrame called 'iris'. This is a common sample dataset for practicing Datascience.

You can find the link of the dataset from Kaggle, here.

iris = pd.read_csv('iris.csv')

DataFrames have an attribute called 'shape', which tell us the dimensionality of our data. By calling the name of the data frame, followed by the '.shape' I can see the number of rows and columns that the data frame has.

To preview the data, there's the 'head' function will return the top records of the DataFrame.

Similarly, I can see the bottom rows with the 'tail' function.

Working with data in Pandas, datatypes are very important and will influence what operations can be performed. I found Pandas to be pretty intelligent in how it assigns datatypes, but as the Russians say: "trust, but double check".

To do this, call the dtypes attribute on your data
frames.

There are two datatypes represented in this DataFrame, float for all the measurement data, and object for the species.

Often when using pandas, you'll want to subset your data, and 'loc' allows you to subset your data based on index labels, so either the row indexes or column names, 'iloc' subsets by position, so the row number or column order.

I'm going to subset this DataFrame based on row indexes three, four and five, which are the fourth through sixth rows of our data. Note indexing begins at zero.

We can also return a single-cell value, by passing a row and column names separated by a comma.

This returns 3.1, which is the measurement for sepal length for the row at index three in our data frame.

Using 'iloc' I can return the same value by referencing the same row index of three but a column index of zero.

Often, after you've done a whole host of data transformation with Pandas, you want to export your DataFrame for analysis or visualization.
A handy way to do this is the 'to_csv' function.

iris.to_csv ('iris-output.csv', index=False)

Note you may want to include index equal to false, so the index isn't included in your CSV.

Part 2 - Configuring Options in Pandas

Pandas has an option system, which allows you to customize how the package works for you. Most often, this can be useful to change how results are displayed in Pandas. Here's an example.

import pandas as pd

emissions = pd.DataFrame \
({"country": ['China', 'United States', 'India'],\
"year": ['2018', '2018', '2018'],\
"co2 emissions": [10060000000.0,5410000000.0,2650000000.0]})

emissions

I will start with this simple DataFrame, the first option which comes in handy is to configure the maximum row size display for a Pandas DataFrame.

If we set the max row size to two, here's what we get.

pd.set_option('max_rows', 2)
emissions

So you see two rows displayed, separated by an ellipses, that's what this option does.You can either use it to limit the screen space your displayed data frames take up or conversely to expand the row size, to reveal more of your data. Similarly, the max columns display option will
reveal or hide columns. I find this most useful when viewing the head of a data frame that has a lot of columns as Pandas will truncate these by default.

By modifying the float format option, you can display values with a lots of decimals, normally, and even add in a comma as 1,000 separator.

pd.options.display.float_format = '{:,.2f}'. format

Part 3 - Advanced Calculations

One area where you might encounter some hurdles with pandas is dealing with data types. As I said at the beginning of this article, Pandas, generally speaking it's pretty good at assigning proper data types nonetheless, you'll find many instances when you need to convert data types.

To give you a couple of examples, I'm going to leverage the planets dataset, as it has a good variety of data types.

From looking over the data frame, you can probably infer what the data type assignments will be, but to be sure I can access the types attribute of planets.

Now, the data types varying from an object to integers to floats. How pandas handles your data depends on the data types you've designated. For example, If I use the 'mean' function to return the average for all float and integers in the dataset.

Aside from some warning, everything looks OK, but you might question whether it really makes sense to take an average of the year as I've done here.

Let's see how different data types interact.

Here, I'm dividing an integer column by a float.

planets['number'][0]/planets ['mass'][0]

The result is a float, great, that's what you'd hope
for.

I also have the ability to change data types using the 'astype' function. For instance, I can convert the integer value of the 'number' column to a float.

planets['number'][0].astype(float)

It's useful to see what happens when you convert a float to an int.

In this case, we've lost the decimal point. And it's worth noting that this approach would effectively round down any floats as you convert to integers.

I can also cover the 'year' to an object by calling the 'astype(str)' for string.

planets['year'][0].astype(str)

To take advantage of the date time data type in Pandas I can convert the integer 'year' value to a date time using 'to_datetime' and then specify how the data is currently
formatted.

planets['year_dt'] = pd.to_datetime(planets['year'], format='%Y')
planets['year_dt']

Become a DevOps in 2021 pt. 1

Hugo Estrada S. — Sun, 07 Mar 2021 17:48:39 +0000

The DevOps Concept

DevOps is the combination of cultural philosophies, practices, and tools that increases an organization’s ability to deliver applications and services at high velocity: evolving and improving products at a faster pace than organizations using traditional software development and infrastructure management processes. This speed enables organizations to better serve their customers and compete more effectively in the market.

How DevOps Works

Under a DevOps model, development and operations teams are no longer “siloed.” Sometimes, these two teams are merged into a single team where the engineers work across the entire application lifecycle, from development and test to deployment to operations, and develop a range of skills not limited to a single function.

In some DevOps models, quality assurance and security teams may also become more tightly integrated with development and operations and throughout the application lifecycle. When security is the focus of everyone on a DevOps team, this is sometimes referred to as DevSecOps.
These teams use practices to automate processes that historically have been manual and slow. They use a technology stack and tooling which help them operate and evolve applications quickly and reliably. These tools also help engineers independently accomplish tasks (for example, deploying code or provisioning infrastructure) that normally would have required help from other teams, and this further increases a team’s velocity.

Benefits of DevOps:

Speed
Rapid delivery
Reliability
Improved collaboration
Security

Continuous Integration

Continuous integration is a software development practice where developers regularly merge their code changes into a central repository, after which automated builds and tests are run. The key goals of continuous integration are to find and address bugs quicker, improve software quality, and reduce the time it takes to validate and release new software updates.

Continuous Delivery

Continuous delivery is a software development practice where code changes are automatically built, tested, and prepared for a release to production. It expands upon continuous integration by deploying all code changes to a testing environment and/or a production environment after the build stage. When continuous delivery is implemented properly, developers will always have a deployment-ready build artifact that has passed through a standardized test process.

Infrastructure as Code

Infrastructure as code is a practice in which infrastructure is provisioned and managed using code and software development techniques, such as version control and continuous integration. The cloud’s API-driven model enables developers and system administrators to interact with infrastructure programmatically, and at scale, instead of needing to manually set up and configure resources. Thus, engineers can interface with infrastructure using code-based tools and treat infrastructure in a manner similar to how they treat application code. Because they are defined by code, infrastructure and servers can quickly be deployed using standardized patterns, updated with the latest patches and versions, or duplicated in repeatable ways.

Communication and Collaboration

Increased communication and collaboration in an organization is one of the key cultural aspects of DevOps. The use of DevOps tooling and automation of the software delivery process establishes collaboration by physically bringing together the workflows and responsibilities of development and operations. Building on top of that, these teams set strong cultural norms around information sharing and facilitating communication through the use of chat applications, issue or project tracking systems, and wikis. This helps speed up communication across developers, operations, and even other teams like marketing or sales, allowing all parts of the organization to align more closely on goals and projects.

Learn a Programming Language

DevOps teams can choose from many available programming languages. All languages have both strengths and weaknesses -- some inherent to the language itself, and others dependent on a given application or context in which the language is used.

DevOps explores the intersection between software development and traditional IT operations. While developers work most closely with programming languages, IT ops admins and DevOps engineers still need some level of familiarity with the languages used in their organizations to, for example, handle integrations and develop scripts.

There are some general tradeoffs to consider when choosing a programming language. For example, many IT operations admins use scripted or interpreted languages, as they enable rapid development. Compared to compiled languages, however, interpreted languages have a slower execution speed.

In addition, some programming languages are statically typed, while others are dynamically typed. Statically typed languages check data types for errors at compile time -- which results in fewer errors at runtime; dynamically typed languages don't check for errors until runtime. Statically typed languages also require DevOps teams to define variables before use -- something dynamic typing does not require.

Go

Golang, also known as “Go,” is a compiled language, fast and high-performance language intended to be simple and is designed to be easy to read and understand. Go was created at Google by Rob Pike, Robert Griesemer, and Ken Thompson, and it first appeared in Nov 2009.
The syntax of Golang is designed to be highly clean and accessible.

Here is a classic “hello world” example code with Golang:

Python

What’s Python’s role in DevOps? Python is one of the primary technologies used by teams practicing DevOps. Its flexibility and accessibility make Python a great fit for this job, enabling the whole team to build web applications, data visualizations, and to improve their workflow with custom utilities. On top of that, Ansible and other popular DevOps tools are written in Python or can be controlled via Python.

Unlike Go, Python has been around for a very long time. Python is an interpreted language, which means it is evaluated at runtime, but it supports quick development speeds -- a notable advantage in a fast-moving DevOps environment. Additionally, Python is very flexible, as it's a dynamically typed language; this enables it to interface with a variety of other tools within a DevOps workflow.

However, because it's an interpreted language, Python has a more complicated prerequisite setup and slower execution speed. The nature of dynamic typing can also introduce runtime errors more easily.

Here is a classic “hello world” example code with Python.

C and C++

Both the C and C++ languages have a long and storied history. Powerful and developed extensively, these languages offer unparalleled capability across a variety of OSes. Execution speed and low-level access are among the most desirable features. C is a classic low-level procedural language, while C++ is a superset of C that offers object-oriented features on top.

Disadvantages include the languages' complexity, need for manual memory management, longer build times and the challenge to configure compilers correctly for the organization's needs.

Here is a classic “hello world” example code with C++:

Ruby

The biggest advantage of Ruby -- another interpreted language -- is its simplicity, as well as the industry's diverse support for Ruby Gems, or modules. Ruby's simplicity enables the rapid development and implementation of necessary scripts for DevOps processes.

Ruby, however, often has a slower execution speed, not only in terms of general performance, but also for boot speed in certain circumstances. Finally, if an IT organization uses Ruby for database access, its tight Active Record coupling means that admins might lose necessary flexibility, depending on the requirements.

Here is a classic “hello world” example code with Ruby:

Linux Basics

Modern Linux and DevOps have much in common from a philosophy perspective. Both are focused on functionality, scalability, as well as on the constant possibility of growth and improvement. While Windows may still be the most widely used operating system, and by extension the most common for DevOp practitioners, it is not the preferred OS by many. That honor goes to Linux. There are many DevOp practitioners who would like to try Linux for a variety of reasons but do not know exactly which distribution to use. This is a common problem and one that stems from a poor understanding of what each distribution offers.

No single distribution can be considered the best. One of the core principles behind Linux is customization. Different distributions, or versions, of Linux, can be created depending on the exact needs of a particular individual or group much the same way different cryptocurrencies are based on the same blockchain technology but with minor changes to fulfill specific functions; take Ethereum classic vs Ethereum for example.

DevOps and Linux

As previously mentioned, Linux and DevOps share very similar philosophies and perspectives; both are focused on customization and scalability. The customization aspect of Linux is of particular importance for DevOps. It allows for design and security applications specific to a particular development environment or development goals to be created. There is much more freedom over how the operating system functions compared to Windows. Another item of convenience is that most Software delivery pipelines use Linux based servers. If the DevOps team is using a Linux based operating system they can do all testing in house and with extreme ease.

Because the Linux Kernel can process huge amounts of memory any Linux based system is highly scalable. If the hard drive or other hardware requirements change during the development process these requirements can be added without losing processing power. The same cannot always be said of Windows.

There are many good reasons why DevOps practitioners should implement Linux distributions in their workplace. The benefits far outweigh the negatives and it can ultimately lead to a smoother, more efficient development environment. This being said, choosing the right distribution is not always easy. One must properly identify what exact requirements need to be fulfilled before making the decision.

As development requirements become more demanding, especially with the rise in cloud computing software, many more developers will begin turning to Linux not only for its customization and scalability but also because of its efficiency and superior processing capabilities compared to Windows and Apple.

Best Linux Distros for DevOps

Ubuntu
CentOS
Fedora
Cloud Linux OS
Debian

Shell Commands

$ ls

This command lists all the contents in the current working directory.
ls

By specifying the path after ls, the content in that path will be displayed.
ls -l
Using the ‘l’ flag, lists all the contents along with its owner settings, permissions & time.
ls -a
Using ‘a’ flag, lists all the hidden contents in the specified directory.

$sudo

The sudo command allows you to run programs with the security privileges of another user (by default, as the superuser). It prompts you for your personal password and confirms your request to execute a command by checking a file, called sudoers , which the system administrator configures.
sudo useradd
Adding a new user.
sudo passwd
Setting a password for the new user.
sudo userdel
Deleting the user.
sudo groupadd
Adding a new group.
sudo groupdel
Deleting the group.
sudo usermod -g
Adding a user to a primary group.

$ cat {flag}

This command can read, modify or concatenate text files. It also displays file contents.
cat -b
This adds line numbers to non-blank lines.
cat -n
This adds line numbers to all lines.
cat -s
This squeezes blank lines into one line.
cat –E
This shows $ at the end of line.

$ grep {filename}

It’s used to search for a string of characters in a specified file. The text search pattern is called a regular expression. When it finds a match, it prints the line with the result. The grep command is handy when searching through large log files.
grep -i
Returns the results for case insensitive strings.
grep -n
Returns the matching strings along with their line number.
grep -v
Returns the result of lines not matching the search string.
grep -c
Returns the number of lines in which the results matched the search string.

$ sort {filename}

It’s used to sort a complete file by arranging the records in a specific order. By default, the sort command sorts files assuming that the contents are ASCII characters. The file is sorted line by line, and the blank space is used as the field separator.
sort -r
The flag returns the results in reverse order.
sort -f
The flag does case insensitive sorting.
sort -n
The flag returns the results as per numerical order.

$ head

The head command is a command-line utility for outputting the first part of files given to it via standard input. It writes results to standard output. By default head returns the first ten lines of each file that it is given.

$ tail

It is complementary to head command. The tail command, as the name implies, prints the last N number of data of the given input. By default, it prints the last 10 lines of the specified files. If you give more than one filename, then data from each file precedes by its file name.

$ chown

Different users in the operating system have ownership and permission to ensure that the files are secure and put restrictions on who can modify the contents of the files. In Linux there are different users who use the system:
Each user has some properties associated with them, such as a user ID and a home directory. We can add users into a group to make the process of managing users easier.
A group can have zero or more users. A specified user is associated with a “default group”. It can also be a member of other groups on the system as well.
Ownership and Permissions: To protect and secure files and directories in Linux we use permissions to control what a user can do with a file or directory. Linux uses three types of permissions:
Read: This permission allows the user to read files and in directories, it lets the user read directories and subdirectories stores in it.
Write: This permission allows a user to modify and delete a file. Also, it allows a user to modify its contents (create, delete and rename files in it) for the directories. Unless you give the execute permission to directories, changes do not affect them.
Execute: The write permission on a file executes the file. For example, if we have a file named sh so unless we don’t give it execute permission it won’t run.
Types of file Permissions:
User: This type of file permission affects the owner of the file.
Group: This type of file permission affects the group which owns the file. Instead of the group permissions, the user permissions will apply if the owner user is in this group.
Other: This type of file permission affects all other users on the system.
To view the permissions we use:
ls -l
The chown command is used to change the file Owner or group. Whenever you want to change ownership you can use the chown command.

$ chmod {filename}

This command is used to change the access permissions of files and directories.
$ lsof [option] [username]
While working in the Linux/Unix system there might be several files and folders which are being used, some of them would be visible and some not. lsof command stands for List Of Open File. This command provides a list of files that are opened. Basically, it gives the information to find out the files which are opened by which process. With one go it lists out all open files in the output console.
$ id [option]… [user]
It’s used to find out the user and group names and numeric IDs (UID or group ID) of the current user or any other user in the server. This command is useful to find out the following information as listed below:
User name and real user id.
Find out the specific Users UID.
Show the UID and all groups associated with a user.
List out all the groups a user belongs to.
Display security context of the current user.
Options:
-g: Print only the effective group id.
-G: Print all Group IDs.
-n: Prints name instead of a number.
-r: Prints real ID instead of numbers.
-u: Prints only the effective user ID.
–help: Display help messages and exit.
–version: Display the version information and exit.

$ cut

It’s used for extracting a portion of a file using columns and delimiters. If you want to list everything in a selected column, use the “-c” flag with the cut command. For example, let's select the first two columns from our demo1.txt file.

$ sed

Sed is a text-editor that can perform editing operations in a non-interactive way. The sed command gets its input from standard input or a file to perform the editing operation on a file. Sed is a very powerful utility and you can do a lot of file manipulations using sed. I will explain the important operation you might want to do with a text file.
If you want to replace a text in a file by searching it in a file, you can use the sed command with a substitute “s” flag to search for the specific pattern and change it. For example, lets replace “mikesh” in test.txt file to “Mukesh”

$ diff

It’s used to find the difference between two files. This command analyses the files and prints the lines which are not similar. Let's say we have two files, test and test1. you can find the difference between the two files using the following command.

$ history

It’s used to view the previously executed command. This feature was not available in the Bourne shell. Bash and Korn support this feature in which every command executed is treated as the event and is associated with an event number using which they can be recalled and changed if required. These commands are saved in a history file. In Bash shell history command shows the whole list of the command.

$ dd

It’s a command-line utility for Unix and Unix-like operating systems whose primary purpose is to convert and copy files.

$ find

command in UNIX is a command-line utility for walking a file hierarchy. It can be used to find files and directories and perform subsequent operations on them. It supports searching by file, folder, name, creation date, modification date, owner and permissions. By using the ‘-exec’ other UNIX commands can be executed on files or folders found.

$ free [option]

In LINUX, there exists a command-line utility for this and that is free command which displays the total amount of free space available along with the amount of memory used and swap memory in the system, and also the buffers used by the kernel.
Free command without any option shows the used and free space of swap and physical memory in KB.
When no option is used then free command produces the columnar output as shown above where column:
Options for free command:
-b, – -bytes :It displays the memory in bytes.
-k, – -kilo :It displays the amount of memory in kilobytes(default).
-m, – -mega :It displays the amount of memory in megabytes.
-g, – -giga :It displays the amount of memory in gigabytes.

$ ssh-keygen

Use the ssh-keygen command to generate a public/private authentication key pair. Authentication keys allow a user to connect to a remote system without supplying a password. Keys must be generated for each user separately. If you generate key pairs as the root user, only the root can use the keys.
ip [ OPTIONS ] OBJECT { COMMAND | help }
The ip command in Linux is present in the net-tools which is used for performing several network administration tasks.This command is used to show or manipulate routing, devices, and tunnels. This command is used to perform several tasks like assigning an address to a network interface or configuring network interface parameters. It can perform several other tasks like configuring and modifying the default and static routing, setting up a tunnel over IP, listing IP addresses and property information, modifying the status of the interface, assigning, deleting and setting up IP addresses and routes.

$ nslookup [option]

Nslookup (stands for “Name Server Lookup”) is a useful command for getting information from a DNS server. It is a network administration tool for querying the Domain Name System (DNS) to obtain domain name or IP address mapping or any other specific DNS record. It is also used to troubleshoot DNS related problems.

$ curl [options] [URL...]

curl is a command-line tool to transfer data to or from a server, using any of the supported protocols (HTTP, FTP, IMAP, POP3, SCP, SFTP, SMTP, TFTP, TELNET, LDAP or FILE). This command is powered by Libcurl. This tool is preferred for automation since it is designed to work without user interaction. It can transfer multiple files at once.

$ ps

Every process in Linux has a unique ID and can be seen using the command ps.
Options for the ps command:
-a: show processes for all users.
-u: display the process’s user/owner.
-x: also show processes not attached to a terminal.

$ kill

Kill command in Linux (located in /bin/kill), is a built-in command which is used to terminate processes manually. This command sends a signal to a process that terminates the process. If the user doesn’t specify any signal which is to be sent along with the kill command then the default TERM signal is sent that terminates the process.

$ df and $ du

The df (disk free) command reports the amount of available disk space being used by file systems. The du (disk usage) command reports the sizes of directory trees inclusive of all of their contents and the sizes of individual files.
The aim is to make sure you are not overshooting the 80% threshold. If you exceed the threshold it’s time to scale or clean-up the mess, because running out of resources you have to change your application shows some fickle behavior.

Linux Directory Structure

The directory separator in Linux is the forward slash (/). When talking about directories and speaking directory paths, “forward slash” is abbreviated to “slash.” Often the root of the file system is referred to as “slash” since the full path to it is /. If you hear someone say “look in slash” or “that file is in slash,” they are referring to the root directory.

/: The directory is called “root.” It is the starting point for the file system hierarchy. Note that this is not related to the root, or superuser, account.
/bin: Binaries and other executable programs.
/etc: System configuration files.
/home: Home directories.
/opt: Optional or third party software.
/tmp: Temporary space, typically cleared on reboot.
/usr: User related programs.
/var: Variable data, most notably log files.

Directory /bin

The /bin directory is where you will find binary or executable files. Programs are written in source code which is human readable text. Source code is then compiled into machine readable binaries. They are called binaries because machine code is a series of zeros and ones. The import thing to know is that commands, programs, and applications that you can use are sometimes located in /bin.

Directory /etc

Configuration files live in the /etc directory. Configuration files control how the operating system or applications behave. For example, there is a file in /etc that tells the operating system whether to boot into a text mode or a graphical mode.

Directory /home

User home directories are located in /home. If your account name is “pat” your home directory will be /home/pat. Linux systems can and often do have multiple user accounts. Home directories allow each user to separate their data from the other users on the system. The pat directory is known as a subdirectory. A subdirectory is simply a directory that resides inside another directory.

Directory /opt

The /opt directory houses optional or third party software. Software that is not bundled with the operating system will often been installed in /opt. For example, the Google Earth application is not part of the standard Linux operating system and gets installed in the /opt/google/earth directory.

Directory /tmp

Temporary space is allocated in /tmp. Most Linux distributions clear the contents of /tmp at boot time. Be aware that if you put files in /tmp and the Linux system reboots, your files will more than likely be gone. The /tmp directory is a great place to store temporary files, but do not put anything in /tmp that you want to keep long term.

Directory /usr

The /usr directory is called “user.” You will find user related binary programs and executables in the /usr/bin directory.

Directory /var

The /usr directory is called “user.” You will find user related binary programs and executables in the /usr/bin directory.

Linux Directory Tree

SSH Management

SSH stands for Secure Shell and it is a protocol that is used to securely access a remote server on a local network or internet for configuration, management, monitoring, and troubleshooting, etc.

Official SSH Guide

https://www.ssh.com/ssh/command/

Networking Layers OSI Model

The Open System Interconnection model (OSI) is a seven layer model used to visualize computer networks. The OSI model is often viewed as complicated and many fear having to learn the model. However, the OSI model is an extremely useful tool for development and problem solving. Each of the seven layers goes up in increments of one as it gets closer to the human user. Layer one — the application layer, is closest to the person while layer seven — the physical layer is where the network receives and transmits raw data. The OSI model belongs to the International Organization for Standards (ISO) and is maintained by the identification ISO/IEC 7498–1. In this post, each of the seven layers of the OSI model will be explained in simple terms. The layers will be explained from layer seven to layer one, as this is where the data flow starts.

OSI Model Diagram

$ ip link

It’s for configuring, adding, and deleting network interfaces. Use ip link show command to display all network interfaces on the system.

$ ip address

Use the ip address command to display addresses, bind new addresses or delete old ones. The man page ipad.

$ ip route

Use the IP route to print or display the routing table.

$ nmap

Nmap (“Network Mapper”) is a powerful utility used for network discovery, security auditing, and administration. Many system admins use it to determine which of their systems are online, and also for OS detection and service detection.
The default Nmap scan shows the ports, their state (open/closed), and protocols. It sends a packet to 1000 most common ports and checks for the response.

$ ping

Use ping to see if a host is alive. This super simple command helps you check the status of a host or a network segment. Ping command sends an ICMP ECHO_REQUEST packet to the target host and waits to see if it replies.
However, some hosts block ICMP echo requests with a firewall. Some sites on the internet may also do the same.
By default, ping runs in an infinite loop. To send a defined number of packets, use -c flag.

$ iperf

While ping verifies the availability of a host, iPerf helps analyze and measure network performance between two hosts. With iPerf, you open a connection between two hosts and send some data. iPerf then shows the bandwidth available between the two hosts.
You can install an iPerf using your distribution package manager.

$ traceroute

If ping shows missing packets, you should use traceroute to see what route the packets are taking. Traceroute shows the sequence of gateways through which the packets travel to reach their destination.

$ tcpdump

tcpdump is a packet sniffing tool and can be of great help when resolving network issues. It listens to the network traffic and prints packet information based on the criteria you define.

$ netstat

Netstat command is used to examine network connections, routing tables, and various network settings and statistics.
Use -i flag to list the network interfaces on your system.

$ ss

Linux installations have a lot of services running by default. These should be disabled or preferably removed, as this helps in reducing the attack surface. You can see what services are running with the netstat command. While netstat is still available, most Linux distributions are transitioning to ss command.
use ss command with -t and -a flags to list all TCP sockets. This displays both listening and non-listening sockets.

Data Visualization with Python pt. iii

Hugo Estrada S. — Mon, 18 Jan 2021 17:56:06 +0000

The notebooks of my Data Visualization Series are here:

https://github.com/hugoestradas/Data_Visualisation_with_Python.git

Let's talk about time series.

Anyone interested in data visualization should know and understand time series, but what on earth are time series <?>

In a nutshell, a time series is any chart that shows a trend over time; and often is a line chart.

Here's an example of a time series:

Usually times series in Python are built using Matplotlib, as the one shown above; and as you can see it's a combination of a line chart and a scatter plot.

Part 1: When should you use charts and time series <?>

Let's suppose you are a marketing manager for an online store.
You started selling some kind of popular product recently, and you want to see what kind of customers are buying this product.
So you started analyzing the sales data, and you found this piece of data:

When you looked at the sales volume on a particular Sunday, it turns out that there are more male buyers than female buyers, about 450 units sold for male customers, versus about 300 for female customers.

You might conclude, this product is more popular with males than females. With this information in mind you might then start targeting male customers in your marketing strategy. But, here's the question: "Using this graph alone, can you actually conclude that this product is more popular with male customers than female customers <?>".

Short answer: NOT NECESSARILY.

First of all, there are only about 800 units sold in total here, meaning the sample size is quite small. And even if the difference is statistically significant, it's possible that male customers tend to buy this product more than female customers, only on Sundays.

In order to make an analysis much more robust, one approach is to plot a line chart over time, and make a time series for male and female customers.

After doing that, you might see a chart like this:

After seeing a chart like this and you can be more confident of your conclusion that male customers buy this product more than female customers, because the difference is consistent over time.

However, it is possible that you ended up with a chart like this:

Then, you wouldn't be able to make the same conclusion anymore.

Summarizing the main reasons why time series and line charts are so useful:

It's a consistent way to examine the trend over time.
If you have a particular hypothesis that you want to test, or an experiment that you're running; time series and line charts allow you to test it on a variety of conditions.
They make your analysis much more statistically robust, and reduces misinterpretation of your data.

Part 2: Creating Line Charts with Matplotlib

I am going to compare the GDP Per Capita growth in the US and China, using the same 'dataii.csv' file.

Since I want to compare GDP Per capita's trend over time; I'm going to create a time series with a line chart.

The important columns for this exercise are:

country
year
gdpPerCapita

Also, I'm going to use the iloc syntax to select an item in the Pandas series, and I'll be multiplying and dividing it with a scalar.

First, lets load the data into a Pandas Dataframe:

Time to examin how the GDP Per capita in the US has grown over time:

Now I'll grab the data for China, and compare it with the US data and plot it:

Now, for comparing the growth itself:

Now, to plot the final chart, I'm going to call the
'plt.plot()' function twice, by putting US and China growth on the same graph, instead of the raw GDP Per Capita values:

And the final graph is the following:

Part 3: When to use Scatter Plots

In a nutshell, scatter plots provide a convenient way to visualize how two numeric variables are related in your data.

Here's a glimpse to a scatter plot that shows how weights and heights are related in a hundred people:

Part 4: Creating Scatter Plots with Matplotlib

We're going to examine how to create scatter plots with Matplotlib.

Suppose as an example, you need to find the GDP Per Capita and life expectancy are related to each other in different countries.

To do this, in the 'dataii.csv' file, our countries dataset; the columns we'll need to use are lifeExpectancy and gdpPerCapita, as well as year, so we can find the relationship between life expectancy and GDP Per Capita for each given year.

For this part I'm going to be importing the NumPy library, since I'm going to need the 'log10()' function:

Let's first examine how GDP Per Capita and life expectancy are related in 2007:

To create a scatter plot with gdpPerCapita and lifeExpectancy using the data of 'data2007' with plt, just type:

And the result plot is the following:

Data Visualization with Python pt. ii

Hugo Estrada S. — Sat, 16 Jan 2021 18:37:47 +0000

The notebook of this lecture is in my GitHub repo:

https://github.com/hugoestradas/Data_Visualisation_with_Python.git

Part 1: What on earth are "Histograms" <?>

Suppose you're in charge that a website always load fast and one day the average page loading time in... lets say June is significantly slower than the previous 5 months.

This type of scenarios are where the histograms really shine, because they show a kind of history in their graphs.

Histograms helps you understand the distribution of a numeric value in a way that cannot with mean or median alone.

Part 2: Histograms with Matplotlib

Let's import Pandas and Matplotlib:

For this example I'm going to be using a larger dataset called "dataii.csv", let's import it:

For this part, I'll create histograms using the 'subplot()' function.

To check which continents are included within the data I'll use the 'set()' function:

And this returns the following output, showing all the continents grouped:

Now, for example, if you need to select the data of Asia and Europe in 2007, first you need to select the data for 2007:

Then select the data for Asia out of the 'data2007' variable, and then the same procedure for Europe:

Check both 'asia2007' and 'europe2007' with the 'head()' function:

To check how many countries are in these two newly created datasets let's use the 'set()' function:

If you don't want to see the complete list of countries, instead only the number of countries for reach data set, use the 'len()' function combined with the 'set()' function:

Use this combined with the 'print()' function for both datasets, ant this should be the output:

Let's now find the mean and median of GDP per Capita in Asia and Europe in 2007:

To create a histogram of GDP per capita in Asia, type:

Now, to compare this histogram of the GDP Per Capita of Asia with the GDP Per Capita of Europe, both of 2007, lets use the 'suplot()' function:

And the result is the following histogram:

Part 3: Comparing Complex Histograms

Now, let's compare Europe and America's life expectancy in 1997.

There are many ways to solve this problem, but my approach is the following:

First select only the data of 1997:

Then, from newly created dataset ('data97') extract America's and Europe's data:

Now, to check the number of countries in each new dataset:

Now to get the mean and median life expectancy of each new data set:

Now, finally to compare both datasets in histogram:

Being the final chart the following:

Data Visualization with Python pt. i

Hugo Estrada S. — Fri, 15 Jan 2021 16:49:41 +0000

First things first, all the code I cover on this lecture it's right here:

https://github.com/hugoestradas/Data_Visualisation_with_Python.git

Part 1: Using Matplitlib for the Very First Time

Matplotlib is a popular data visualization library for Python.

The reason I'm going to use it, it's because it's fairly easy to use and out of many Python data visualization libraries it's the most commonly used one.

With Matplotlib you'll be able to create many different types of charts.

Let's see how to create a line chart with Matplotlib:

The first thing to do is to import the Matplotlib module:

Now to start plotting data I'll use the following line:

This line says "put 1, 2 and 3 in the 'x' axis; and 1, 4 and 9 in the 'y' axis".
To show this plot, it's necessary the following line:

It is possible to add labels for the 'x' and 'y' axis and a title for the whole plot:

The whole cell should look like this, and the end plot should be the following:

It's also possible to plot multiple lines on the same plot:

And the plot looks like this:

To clarify the values of each line, it is possible to define them by name using the "plt.legend" method:

And the plot looks like this:

It is possible to export the plot as an image as well:

Part 2: Using Pandas

Pandas is a Python library that helps you import, organize and process data, it's familiar to "dataframes" in the R language.

Let's create a dataframe in Pandas, select data with Boolean indexing and finally plots using the same Pandas dataframe:

This is the data I'll be using:

To create a dataframe to store this data, I'm going to create a dummy data as dictionary with three attributes: 'year', 'attendees' and 'average age'.

And after executing the cell, the displayed dataframe should look like this one:

I can assign this newly created dataframe to a variable called 'df' (the standard variable name for a dataframe in Pandas):

And the result should be the same:

There are three columns in this dummy dataframe, you can select a single column out of this dataframe, for example:

The type of this new data is something called a "Pandas Series".

It's similar to a regular Python list and also to the NumPy array, if you're familiar with the NumPy library.

Knowing this, you can apply an inequality operation on the series with df['year'] < 2010:

This returns a series of Boolean values:

Let's store the output into a variable:

Using the Boolean Series you can select only the part of the data where the year is earlier than 2010, this is called "Boolean Indexing":

Imagine that you want to examine how the number of attendees has changed for the last three events.

To best figure this out, you might want to plot the number of attendees against the year:

This line of course puts the year on the x axis and the attendees on the y axis, and the result it's the following:

If you want to plot the number of attendees and average age on the same plot we can just call 'plt.plot()' multiple times:

Part 3: Importing Data with Pandas

For this example, the sample data that I'm going to use, is the following .csv file:

It is a list of countries and their basic demographics for each year, years ranging from 1952 to 2007 for every five years.

To import this .csv file make sure that you ether know the path of the file or the both the notebook and the .csv file are located in the same location within the Jupyter intance.

This dataset is pretty small, but in real world scenarios if you want to have a glimpse of the data you're dealing with, all you need to to is to use the 'head()' method:

This gives you the first five rows of the dataframe:

Now, if you would like to plot how GDP Per Capita has changed over time in Afghanistan.

To do that, it's necessary to isolate the data bout Afghanistan from the data variable.
To Select the country column you can write:

Either syntax does exactly the same thing:

Using this and Boolean Indexing you can select only the data about Afghanistan with:

And to plot it:

And the final plot is the following:

Another 5 Cool Python Exercises

Hugo Estrada S. — Wed, 11 Nov 2020 03:19:48 +0000

As usual, if you wanna go straight forward for the code, here it is: >>> https://github.com/hugoestradas/Python_Basics

1) Counting unique words.

Almost every modern word processor software has a counting tool go get the total number of words in a document. I'll take this concept a little bit further, to practice both breaking down text and counting items.

For this exercise I'll write down a Python function to count the number of unique words and how often each occurs.

My input would be the path of a text file and the output or result will be the total number of words, the top 10 most frequent words and finally the number of occurrences for the top 10:

This time I'm importing two Python modules: "re" for regular expressions and "Collections" for counting.
My function begins by opening the file, with a given "path" variable which stores the location of the file, then using a regular expression I find all the words within its text.
The search pattern looks for any sequence of one or more letters, numbers, hyphens, and/or apostrophes.

Then I convert the list of words that it finds into all uppercase and then print out the length of that list, which indicates the total number of words that were found.

On line 10 I'm creating a new "Counter" object and the use a for loop to iterate through the entire list of words and increment the list entries within the counter's dictionary.

In the last block of code I use the Counter's "most_common" method
to retrieve a list of the 10 most common words, along with their count values to display:

2) Merging CSV files

Comma-Separated Values (CSV) files is a file format that stores tabular data in plain text.

I'm going to write a Python function to merge multiple CSV files into one.

I'm going to receive as input a list of files in a "path" variable.

The function should be robust enough to merge files which the headers don't even match.
The fields might be in different order, or a file could have additional fields that the other does not. It'll handle all of these cases, without losing any fields or data:

The first block of code builds up a list of field names, I start by creating an empty list then use a for loop to open up all of the files in the input list of CSV files to merge.
I used the CSV module's "DictReader" on line eight to extract all the field names from each file and then on line nine I add them to the fieldnames list, if they're not already in there from a previous input file.

The second part of the function, handles the writing of the records to the output file, based on those field names.
I used a context manager to open the output file to write on line 12, and then I created a new "DictWriter" object from the CSV module passing in the list of fields of names I created.
Every record I added using this "DictWriter" method includes all of the field from that list.
On line 14, I write the first header row to the output file and then use the for loop to iterate through all of the input CSV files again.
I open each one up and create a new "DictReader" from it, then I use a for loop to iterate through each record in that input file and write it to the output file.
If that row I just read is missing certain fields, the "DictWriter" will leave them blank or empty in the output.

These are the CSV files I will be comparing with my function:

And this is the final merged CSV file:

3) Save a dictionary.

Python's dictionaries are very popular among data scientist, data engineers and other data professionals. This is because their are awesome for storing and retrieving information. The only problem is that this data is kept in memory.

What if you need to use this dictionary later?

In this exercise I'll write a Python function that stores a dictionary into a file.

My two inputs are the dictionary to save, and the path for the output file.

I'll start by importing the "pickle" module, if you're not familiar with this library, I'll leave the official documentation here:

https://docs.python.org/3/library/pickle.html

And here's the code:

For the execution of this program, I need to create a test dictionary object with the keys, after saving the dictionary into the file I can simply print the content from this file and show you the content remains there:

4) Create a ZIP archive

Okay, straightforward here's my code:

As you can see I imported the "os" module to search directories and to manipulate file paths and the "zipfile" module to actually build my zip file.

My function starts by opening the "output_zip" file using a context manager.
On the next line I use the "os.walk" function to explore and search in the directory.

As you can see my foor loop is separated as a linux-like directory structure "root", "dirs" and finally "files.

I need to maintain the relative file path for files in the output archive, but if the user calls the "zip_all" function with an absolute path the root path I get from the "walk" function will also be absolute. That's why on line 7 I use the "os.relpath" function.

5) Find All List Items

Python's index method finds the index of the first item in a list;
but what if there are multiple instances of that item?

In this one, I'm writing a Python function to find the indexes for all of the
items in a list that are equal to a given value.

The inputs are the list to be search and the value to be search for.

The output should be the list of indices, each represented by a list of numbers.

It's good to keep in mind that Python lists can also contain other lists.
So, this function should be able traverse multidimensional lists to find all
indices of the given value.

Result

5 Cool Python Exercises

Hugo Estrada S. — Mon, 09 Nov 2020 21:23:02 +0000

First things first, the repo with all the exercises of this lecture is right here:

https://github.com/hugoestradas/Python_Basics

Let's begin!

1) Find prime factors.

For the very basics, let's start with something unusual: Public Key Encryption. This technique relies on certain really large numbers being computationally hard to factor to keep data secure.
In this first exercise I'll factor some numbers that are easy to deal with; the goal is to create a Python function to find all prime factors, I'll do it by taking an integer value as input and the return or output will be a list of prime factors.

In this solution I decided to search for factors by dividing the given sequentially larger values (starting from 2) to see which one divide evenly into it, without leaving a remainder behind:

As you can see I'm calling the function with the 500 number, so it will begin with 2 as the original divisor, then it'll go on keep dividing until the remainder is no longer an even number, in this case resulting in the result of 2, 2, 5, and finally 5:

2) Identifying Palindromes.

This a very usual programming and software engineering exercise, maybe you already did it on colleague, school or watching another tutorial, it's a very cool puzzle to solve because involves pattern recognition, logic and of course coding.

In case it's your first time dealing with palindromes, a palindrome is a word or text that reads exactly the same, either forwards or backwards.

Again, I'll write a function to detect palindromes, where my input will be the string I'm checking and the result or output is going to be a boolean value (false/true):

Going line by line, first I'm importing the "re" library, which contains regular expressions to extract letters from an input string, then I'm defining a "palindrome" function that receives a "string" parameter.
Then I use the lower operator in the input string to convert all of the letters to lowercase, then I pass the result to the regular expression "findall" function with a pattern that will search for combinations of one or more letters. Tat will produce a list with all of the matched sub-strings that I merged together into a single string using the "join" function.

Then I slice the entire string, with the stride set to negative one, meaning I'll get a copy of the original string in reverse order.

Finally, I'm comparing both strings and return it:

3) Sort a string.

Another common task in programming is sorting things.

The goal is to create a Python function that sorts the words within a given string.

The input will be a list of words separated by spaces, and the result or output will be the same string of words sorted alphabetically:

My "sorted" function starts with the "split" method, which breaks apart the input string at each of the spaces and gives me a list of the individual word.

Then, to ignore the capitalization (if there is any) in the loop I convert each word within the list into lower case, to later on sort the entire list:

4) The waiting game.

For this exercise I'll write a Python function, which is when invoked it'll print a message to wait a random amount of time.

The user press enters, then the timer starts. The user's goal is to wait the specified number of seconds in the message, and then press enter again.

For this exercise I used to modules, "time" module to measure the amount of time,
and the "random" module to generate a random number of seconds.

The input function prompts the user to press enter to begin and then blocks the
execution until the user hits enter again.

5) Generate a new password.

For this final example, I'll implement a function based on the "Diceware" method, which is a method for creating passphrases and passwords using the numbers of an ordinary dice as hardware random number generator. It involves a list of over 7000 different words.

Instead of rolling a physical dice, I'll write a Python function that simulates this behavior.

The input will be a number of words in a passphrase and the output or result will be a string of random words, separated by spaces.

For this one, I could've used the "random" module, but instead I went for the "secret" module, since the random module is not recommended when dealing with cryptographic procedures:

My function begins by getting the number of words, then opening the "diceware.wordlist.asc" file with a context manager and then uses "readlines" function to get a list with each of the lines within the file.

The top of the file diceware that I used has two extra lines before the word list actually begins, and at the bottom there are also several extra lines for a PGP signature:

So I indexed out the 7K (7776) lines from the middle of the file that I actually care about. Remembering that each of these lines contain both a five-digit number and the corresponding word, I used the split method to break them apart, and then build the list containing just the words.

Then I used the "secrets.choice" function within another list comprehension to build a list with the desired number of random words.

And finally, I used the join method to combine the random words into a single string with spaces between them:

Start your Journey with TensorFlow

Hugo Estrada S. — Fri, 09 Oct 2020 20:59:03 +0000

First things first, you can find the notebook with all the content of this lecture in my GitHub repo:

https://github.com/hugoestradas/DataScience-101

Let's find out how to start using TensorFlow from the very entry level.

For this lecture, use any TensorFlow 2.x version.
I'm going to use the Azure Databricks platform, but feel free of using your preferred notebook solution (Jupyter, Kaggle, Colab, etc).

Part i: What on earth is a Tensor <?>

According to Wikipedia: "Is an algebraic object that describes a (multilinear) relationship between sets of algebraic objects related to a vector space."

Using the previous definition, we can say for our needs that tensors are TensorFlow's multi-dimensional arrays with uniform type. They're very similar to NumPy arrays and they're immutable, meaning once they're created they cannot be modified or altered. You can only create a new copy with the edits.

Let's go to our environment to see how Tensors behave using a simple coding example.
Before actually making our first tensor, we need to (of course) import the TensorFlow library, and for now I'm importing NumPy as well.
Quick note here, in my Azure Databricks Cluster, I need to manually install TensorFlow before importing it:
.

Checking, the library was installed without problems:

Now, opening a fresh notebook I should be able to import TensorFlow without any issues:

Part ii: Creating Tensors

So far I installed TensorFlow with NumPy and imported both libraries into my notebook, time to work with Tensor Objects.

There are many ways to create a "tf.Tensor" object.

Here are a few examples.

First, you can create a tensor object with several TensorFlow functions.

Let's create a tensor object using "tf.constant":

It is possible to create tensor objects only consisting of "1s" with the "tf.ones" function:

Also, you can create tensor objects only consisting of "0s" usnig the "tf.zeros" function:

And finally, there's the "tf.range()" function to create tensor objects:

And here's the output of each method I used:

These are the easiest and most common ways of creating tensor objects in TensorFlow.

If you were observer, we created tensor objects with the shape (1,5) with three different functions, and a fourth tensor object with shape (5,) using the function "tf.range()".

The "tf.ones" and "tf.zeros" accept the shape as the required argument, since their element avlues are pre-determined.

Part iii: Classifying Tensors

We use "tf.Tensor" in order to create TensorFlow tensor objects, and they have many characteristic features, the main 3 are:

The first one that is important to understand, is that they have a rank based on the number of dimensions they have.

The second mort important characteristic of the tensor objects in TensorFlow, is to know that they do have a shape. Which for our understanding means that is a list that consists of the lengths of all their dimensions. All tensors have a size, which represents the total number of elements within a tensor.

And thirdly, their elements are all recorded in a uniform Dtype (data type).

Let's dig deeply into these features.

Tensors can be categorized based on the number of dimension they have:

0-D (Scalar) Tensor: it's a tensor containing a single value and no axis.

1-D (Rank-1) Tensor: it's a tensor containing a list of values in a single axis.

2-D (Rank-2) Tensor: it's a tensor containing 2 axis.

3-D (Rank-n) Tensor: it's a tensor containing n-axis.

Taking the above image as our source of abstraction for the dimension (axis) of the tensor objects, we can create a 3-D tensor by passing a three-level nested list object to the "tf.constant" function, and we can split the numbers into a 3-level nested list with three element at each level:

The shape feature is another attribute that every tensor has. It represents the size of each dimension in the form of a list to make it easier to understand.
We can view the shape of the "tensor_3d" object we created with
the ".shape" attribute:

Size is another feature tensor objects have, and it represents the total number of elements a tensor has.
It's not possible to measure the size with an attribute of the tensor object, instead the "tf.size()" function is used, and then it's necessary to convert the output to NumPy with the instance function ".numpy()" to get a human-readable result:

Last, but not least we have the Dtypes, which contain data types such as ints and floats, but ay contain many other data types suc as complex numbers and strings. Each tensor object, however, must store all its elements in a single uniform data type. Therefore, we can also view the type of data selected for a particular tensor object with the ".dtype" attribute:

Part iv: Operating with Tensors

Now that we understand the basic properties and features of the tensor objects with TensorFlow, we can play a litte bit with them.

Let's start with indexing. As you already know if you're reading this, an index represents an item's position in a sequence, and this sequence can refer to many things: a string of characters, a list, a sequence of values, etc.

Luckily, TensorFlow follows the Ptyhon's standard for indexing, similar to list indexing usign NumPy, the rules are:

Indexes start at zero (0).
Negative indexes ("n") value means backward counting.
Colons (";") are used for slicing: "start:stop:step".
Commas (",") are used to reach deeper levels.

Following the rules above, let's create a 1-D tensor object:

Now I'll apply the 4 rules of indexing:

Rule #1 - Indexes start @ 0:

Rule #2 - Negative values means backward counting:

Rule #3 - Colons slice:

Rule #4 - Commas reach deeper levels:

Okay, now that I've showed you the basic indexing techniques with tensor objects, let's make some operations with them.

First, let's create two tensor objects to interact with later one:

Let's start adding one tensor to another, for this we can use the "tf.add()" function and pass the tensors as arguments:

Following up, now I'll make use of the element-wise multiplication, for this I'll use the "tf.multiply()" function and will pass the tensors as arguments again:

We can even do matrix multiplication with the "tf.matmul()" function... yes! passing the tensors as arguments:

Let's say we'd like to know the maximum or minimum value within a tensor, well there are the "tf.reduce_max()" and "tf.reduce_min()" functions:

Similarly, to find the index of the maximum element is possible using the "tf.argmax()" function:

We can make operations and play with the shapes of the tensors.
If you're familiar with Pandas DataFrames or NumPy Arrays, then you'll understand the concept of "Reshaping a Tensor".

The "tf.reshape" operations are very fast, since the data does not need to be duplicated.
Let's see how it works:

Firstly, create a new tensor object:

and

then a third tensor:

Now, if we pass -1 in the "shape" reshaped argument, then we flatten our tensor object:

Reshaping tensor objects with TensorFlow it's ridiculously easy. But it's important to keep in mind that when doing reshape operations, you must be reasonable, otherwise the tensor object might get messed up or can even raise fatal errors.

Part v: Special Types of Tensors

So far we've talked about tensors abstracting them into rectangular shapes and store only numerical values on them.
But they are more powerful than tat, tensors support irregular or even specialized data within them. There are:

Ragged Tensors
String Tensors
Sparse Tensors

Starting with Ragged Tensors, are tensors with different numbers of elements along their size axis:

Moving forward the String Tensors are tensors that store string objects within them, they're created as a normal tensor object, the only difference is how you store the strings:

And finally, we have the Sparse Tensors with are rectangular tensors for sparse data.
These are useful when you have holes, null values or other messy kind of things in your data. Sparse Tensors are to-go objects, hence they're a bit more consuming and should be more mainstream:

Get Certified - Pass the Azure Administrator Exam (AZ-104)

Hugo Estrada S. — Sun, 16 Aug 2020 04:42:35 +0000

A certification provides competency, shows commitment, professionalism and most importantly: it motivates you to keep learning and gather knowledge. It's also a great way to get a better job.

Currently Microsoft Azure is the number 2 cloud provider behind AWS, and is still growing. Microsoft constantly improves, changes and adds new services and resources to it's cloud platform.

Right now (Aug. 2020) more than 80% of the Fortune 500 companies are extensively using Microsoft Azure. With those numbers, it's easy to assume the huge demand for Azure skilled professionals.

The latest Microsoft Azure Administrator Exam, the AZ-104 is aimed for those software and cloud engineers, interested in master the skills needed to operate a Microsoft Azure-based cloud infrastructure.

In a nutshell, an Azure Administrator is responsible for implementing, monitoring and maintaining Microsoft Azure solutions, including major services related to Compute, Storage, Network and Security. The Azure Administrator will provision, size, monitor, and adjust resources as appropriate.

The exams is divided in 4 major areas:

1) Knowledge and Management of Azure Identity and Security.

2) Management of Azure Storage Services and Solutions.

3) Knowledge and Management of Azure Networking.

4) Management of Microsoft Azure Compute Services and Solutions.

If you're interested in getting the Azure Administrator Certification, download my "AZ-104 Practice and Answers" document. This document covers all the 4 areas of the AZ-104 Exam.

For this exam, you do need to have a solid background and familiarity with the Azure Portal, Azure Services and Azure Resources, if you already work or feel comfortable with the Azure Ecosystem, then just practice, do some reading and use my document to measure your score in order to pass the exam.

If you're new into Azure and you're interested in getting the AZ-104 certification, I highly recommend to practice on your own, read the Microsoft Documentation and also get the Azure Fundamentals certification first, take a look at my AZ-900 document:
https://dev.to/hugoestradas/get-certified-pass-the-azure-fundamentals-exam-az-900-le7

A quick disclaimer worth mentioned it, is that Microsoft currently offers two Azure Administrator Exams: AZ-103 and AZ-104. The topics of my document are specifically designed for the AZ-104 exam exclusively.

The debate on AZ-103 vs AZ-104 is necessary for estimating the suitability of both certification exams for modern Azure Administrator job roles. AZ-104, presents the same second domain, although with an increased focus on the management of data. The second domain dealing with implementation and management of storage in both the AZ-103 and AZ-104 exams shows contrasting differences in both exams.

While the AZ-103 certification exam deals with topics such as creation and configuration of storage accounts in this domain, the new exam additionally deals with the management of storage accounts. The AZ-103 certification exam included topics only on importing and exporting data to Azure. On the other hand, AZ-104 focuses on the management of data in Azure storage.

No matter which exam you choose, if you pass you'll get the Azure Administrator Certification, but since the AZ-104 is the latest version, in my personal opinion it's a better choice.

DOWNLOAD MY AZ-104 PRACTICE DOCUMENT
https://mega.nz/file/sA9FDBjA#eRVBcceAT90YaMa3MbC0Ka95ecZqZxO2ph8xM9NYn0Q

Pandas 101 - pt. iii: Data Science Challenges

Hugo Estrada S. — Tue, 28 Jul 2020 04:19:51 +0000

Like always, the repository with all the notebooks and data of this three-part series about Pandas is here:

https://github.com/hugoestradas/Pandas_101

In this final lecture about Pandas, I'm going to show you some practical scenarios that you might face as a Data Scientist in the real world, and some more advance techniques for data analysis.

Let's start by importing Pandas and Matplotlib (a library for creating static, animated, and interactive visualizations in Python):

1) Know your Dataset, (I really mean it)

Data Science in a nutshell can be defined as the process by which we extra information from data. When doing Data Science, what we’re really trying to do is explain what all of the data actually means in the real-world, beyond the numbers.

In order to properly explain the information we've been working with, we need to understand what data we've just figured out, and the only way of doing this, is by really getting into the sources, the data-sources: the data set.

Depending on the needs and circumstances of the project you can always start by answering these 5 questions:

What is the question we are trying to answer or problem we are trying to solve? This involves understanding the problem and tasks involved. Do we have the skill and the resources needed?
What was the process by which the data arrived to us? This means understanding how it was created and transformed.
What does the data look like? This is exploratory analysis. What dimensions exist, what measures exist, what is the relationship between each measure and other measures and dimensions.
Are there issues with the data? If yes, how did they appear and can we solve them. This is related with outliers, NAs, NULLs, etc.
Can we answer the question or solve the problem with these data? If not, why and can I solve it?

2) Dealing with Dates and Times

A lot of the analysis you will do, might relate to dates and times, for instance: finding the average number of sales over a given period, selecting a list of products to data mine if they were purchased in a given period, or trying to find the period with the most activity in online discussion activity.

I'll explain some of the basics of working with time series analysis.

First, you should be aware that date and times can be stored in many different ways. One of the most common legacy methods for storing the date and time in online transactions systems is based on the offset from the epoch, which is January 1, 1970.
There's a lot of historical cruft around this, but it's not uncommon to see systems storing the data of a transaction in seconds or milliseconds since this date. So if you see large numbers where you expect to see date and time, you'll need to convert them to make much sense out of the data.

In Python, you can get the current time since the epoch using the time module. You can then create a time stamp using the time module:

From here, you can create a timestamp, from the timestamp function on the data-time object:

When we print this value out, we see that the year, month, day, and so forth are also printed out:

The date-time object has handy attributes to get the representative hour, day, seconds, etc.

3) Querying a DataFrame

Before we talk about how to query data frames, we need to talk about Boolean masking. Boolean masking is the heart of fast and efficient querying in NumPy. It's analogous a bit to masking used in other computational areas.

A Boolean mask is an array which can be of one dimension like a series, or two dimensions like a data frame, where each of the values in the array are either true or false. This array is essentially overlaid on top of the data structure that we're querying. And any cell aligned with the true value will be admitted into our final result, and any sign aligned with a false value will not.

Boolean masks are created by applying operators directly to the pandas series or DataFrame objects. For instance, in our Olympics data set, you might be interested in seeing only those countries who have achieved a gold medal at the summer Olympics.
To build a Boolean mask for this query, we project the gold column using the indexing operator and apply the greater than operator with a comparison value of zero. This is essentially broadcasting a comparison operator, greater than, with the results being returned as a Boolean series. The resultant series is indexed where the value of each cell is either true or false depending on whether a country has won at least one gold medal, and the index is the country name:

This is essentially broadcasting a comparison operator, greater than, with the results being returned as a Boolean series. The resultant series is indexed where the value of each cell is either true or false depending on whether a country has won at least one gold medal, and the index is the country name:

So this builds us the Boolean mask, which is half the battle. What we want to do next is overlay that mask on the data frame. We can do this using the where function. The where function takes a Boolean mask as a condition, applies it to the data frame or series, and returns a new data frame or series of the same shape. Let's apply this Boolean mask to our Olympics data and create a data frame of only those countries who have won a gold at a summer games:

4) Handling Missing Values (Brief intro to Data Cleansing)

Underneath Pandas does some type conversion, if I create a list of string and we have one element, a "none" type, Pandas inserts it as a none and uses the type object for the underlying array.

For further examples, I'm going to load the "log.csv" file:

In this data the first column is a timestamp in the Unix epoch format. The next column is the user name followed by a web page they're visiting and the video that they're playing.

Each row of the DataFrame has a playback position. And we can see that as the playback position increases by one, the time stamp increases by about 30 seconds.

Except for user Bob. It turns out that Bob has paused his playback so as time increases the playback position doesn't change. Note too how difficult it is for us to try and derive this knowledge from the data, because it's not sorted by time stamp as one might expect. This is actually not uncommon on systems which have a high degree of parallelism.

There are a lot of missing values in the paused and volume columns. It's not efficient to send this information across the network if it hasn't changed. So this particular system just inserts null values into the database if there's no changes.

One of the handy functions that Pandas has for working with missing values is the filling function, "fillna".
This function takes a number or parameters, for instance, you could pass in a single value which is called a scalar value to change all of the missing data to one value. This isn't really applicable in this case, but it's a pretty common use case. Next up though is the method parameter. The two common fill values are ffill and bfill. ffill is for forward filling and it updates an na value for a particular cell with the value from the previous row. It's important to note that your data needs to be sorted in order for this to have the effect you might want. Data that comes from traditional database management systems usually has no order guarantee, just like this data. So be careful.

In Pandas we can sort either by index or by values. Here we'll just promote the time stamp to an index then sort on the index.

If we look closely at the output though we'll notice that the index isn't really unique. Two users seem to be able to use the system at the same time. Again, a very common case.

Let's reset the index, and use some multi-level indexing instead, and promote the user name to a second level of the index to deal with that issue.

Now that we have the data indexed and sorted appropriately, we can fill the missing datas using ffill.