DEV Community: Limarc Ambalina

What is a Conditional Generative Adversarial Network?

Limarc Ambalina — Thu, 16 Nov 2023 03:46:47 +0000

The rise of Generative Artificial Intelligence (GenAI) has introduced innovative services and cutting-edge tools to automate tasks, optimize processes, and speed up transactions. These benefits make it more enticing for businesses to deploy AI services for their expansion and growth strategies.

One important technological breakthrough that has made this growth possible is the conditional generative adversarial network (CGAN).

What are Generative Adversarial Networks?
Before diving in, we first need to explain the “GAN” in CGAN.

The CGAN is a type of generative adversarial network (GAN), which is now a well-known structure in the field of machine learning, more specifically, deep learning.

The concept behind the GAN is like a game between two adversarial neural networks or players. Player one is called the "generator." The generator’s role is to create or generate fake data and items – in many cases, these are images – that look as real as possible. It aims to trick the second player.

Player two, on the other hand, is known as the “discriminator.” Its job is to determine which images are real (from a database/sample) and which are fake (made by the generator). If the discriminator gets it right, it gets good feedback. If it’s wrong, it gets bad feedback.

Both of these players learn and improve over time. The generator gets better at creating convincing fakes, and the discriminator improves its ability to tell if something is genuine. Over time, the network reaches a point where the generator-produced data will look almost indistinguishable from real-world data.

How is a GAN Trained?

In a strict sense, GANs are considered an unsupervised learning method because they can learn from unlabeled data. However, during the training process, labels are used internally to guide the learning of the discriminator ("real" or "fake"). For each training iteration, the discriminator receives two kinds of inputs—real data with a "real" label, and generated data from the generator with a "fake" label.

When the discriminator is being trained, it is given these correctly labeled instances, and its goal is to classify them correctly. So, it learns how to distinguish between the "real" and "fake" data, and the correctness of its judgment is checked against these predetermined labels.

Meanwhile, when the generator is being trained, it aims to produce data that the discriminator will classify as "real." The discriminator's judgment is used to train the generator in this phase. If the discriminator makes the wrong judgment, the generator successfully produced realistic enough data and learns from it.

However, another automated process can't do the ultimate check on whether the GAN has been successfully trained. A human evaluator usually reviews the generator's output to ensure the quality of its generated data. Even this may be dependent on the specific use case. For example, if the GAN is used to generate images, humans would check the quality of those images. The text would be assessed for its coherency, relevance, and realism if used to generate text.

What is a CGAN?

CGANs, short for Conditional Generative Adversarial Networks, guide the data creation process by incorporating specific parameters or labels into the GAN1.

Both adversarial networks—the generator and the discriminator—consider these parameters when producing their output. With this input, the generator creates faux data that imitates real data and adheres to the set condition. And just like in the regular GAN model, the discriminator will distinguish between the forged data produced by the generator and the genuine data corresponding to the given condition.

With the conditional aspect included, CGANs can produce exact and highly specific data for tasks that require bespoke results. This control over the kind of data generated allows businesses to cater to their unique needs, making CGANs a versatile tool in data creation and augmentation.

Real-World Applications of CGAN

Here are some innovative applications and use cases of CGANs, demonstrating this AI model's groundbreaking adaptation capabilities:

GauGAN:

Introduced by NVIDIA, GauGAN converts segmented sketches into highly realistic images in line with the specific conditions the user sets. For example, GauGAN will fill a sketch of a tree with leaves, branches, or any other details associated with trees. This technology utilizes a variant of CGANs called spatially-adaptive normalization, which applies the input condition in each layer of the generator to control the synthesis of the output image at a much more detailed level. This technology is a compelling tool in architecture, urban planning, and video game design sectors.

Pix2Pix:

Developed by researchers at the University of California, this image-to-image translation tool utilizes a machine-learning algorithm based on the CGAN structure to transform one image into another. Pix2Pix takes an input image, such as a sketch or an abstract depiction, and transforms it into a more elaborate or realistic image. A common example is adding colors to an originally grayscale image or turning a sketch into a photorealistic image. This technology has the potential to be exceedingly beneficial in sectors requiring detailed visualizations from simple frameworks, such as architectural planning, product design, and various aspects of digital media and marketing.

StackGAN:

StackGAN is a text-to-image translation model that generates realistic images from textual descriptions in two stages utilizing CGANs. In the first stage, the model generates a low-resolution image based on the text description, which serves as the condition. In the second stage, the model takes that low-resolution image and the same text condition to produce a high-resolution image. The two-step approach results in a division of labor between the stages, allowing the network to handle complex shapes and fine-grained details better than possible with a single-stage process. It solves the challenge of producing detailed images of different objects based on random noise and text description, thereby creating images of better quality.
These examples show how these innovative networks are instrumental across numerous business functions.

What is a DCGAN?

Deep Convolutional Generative Adversarial Networks (DCGAN) improve how GANs process visual data by incorporating convolutional layers in both the generator and discriminator sections, leading to the generation of high-definition and superior-quality images. A convolutional layer works as a filter, aiding the generator in crafting progressively intricate visual data to outsmart the discriminator. Conversely, this filter simplifies incoming images, assisting the discriminator in distinguishing more effectively between genuine and fabricated images.

Comparing CGANs and DCGANs

CGAN and DCGAN are based on the GAN architectures.

Basic Structure:

CGANs and DCGANs retain the fundamental GAN structure, consisting of a generator and a discriminator interacting in a constant, competitive loop.

Mode of Operation:

Both types utilize the unique adversarial learning process, in which the generator and discriminator constantly learn from each other and improve over time to outdo the other.

Data Generation:

The two models can generate new and synthetic information that closely mimics the real world, reframing the existing boundaries of data limitations.

Unsupervised Learning:

They both fall under unsupervised learning, meaning they can automatically learn and discover patterns in the input data without labels.

Deep Learning Models:

Both variations leverage deep learning techniques to handle data. They use multiple layers of artificial neural networks to learn from data, extract relevant features, and generate
believable outputs.

But while they share the core GAN structure, CGANs and DCGANs differ in specifications and functionalities due to the unique alterations introduced in their architecture.

Input and Control:

The main distinction between CGANs and DCGANs lies in their input method. CGANs receive conditions or labels alongside random noise as inputs, offering control over the generated data type. DCGANs, on the other hand, cannot accommodate explicit conditions and rely purely on random noise for data production. It is worth noting that these ideas can be combined.

A Conditional DCGAN would use convolutional layers, like a DCGAN, and also take a conditional input, like a CGAN. This would enable the controlled generation of complex data, such as images.

Network Architecture:

CGANs have a flexible architecture that allows various types of neural networks based on the given task. Conversely, DCGANs have a rigid model that is solely designed for tasks that need the generation of highly detailed images.

Specificity vs. Detail:

Given conditional inputs, CGANs are proficient at creating specific data types tailored to a particular requirement. While DCGANs may lack specificity, they can produce more detailed, high-resolution images.

Training Stability:

Although CGANs have been successful, they lack the recognition that DCGANs have for training stability, which incorporates distinct architectural practices such as batch normalization.

Use Cases:

These two adversarial networks cater to unique use cases due to their differences. CGANs are well-suited to specific data creation and translation, while DCGANs are more apt for generating detailed images.

With abundant variations from CGANs to DCGANs, the diversity in generative adversarial networks ensures businesses can source a machine-learning model tailored to their unique organizational demands and prerequisites.

Final Thoughts

In conclusion, Generative Adversarial Networks (GANs), and their derived variants, Conditional Generative Adversarial Networks (CGANs) and Deep Convolutional Generative Adversarial Networks (DCGANs), are unlocking a variety of innovative applications in the realm of artificial intelligence.

The unique adversarial learning system, consisting of a generator and a discriminator, allows for the automated creation of synthetic data that closely mimics real-world instances. While the base structure, mode of operation, and learning models remain similar across these variations, subtle changes to inputs and architecture make a distinct difference in their functionality.

CGANs allow more control over generated data using conditional variables, making them well-suited for tailored data creation.

DCGANs, on the other hand, specialize in creating high-definition, detailed data, particularly in image generation.

In today's age of rapid digital transformation, adopting GANs, CGANs, and DCGANs provides businesses with cutting-edge tools to drive innovation, streamline processes, and craft unique solutions tailored to their requirements. As we continue to explore and enhance these networks, they are bound to revolutionize the technological landscape and redefine the boundaries of what AI can accomplish.

References

Conditional Generative Adversarial Nets
Conditional Gan Cgan In Pytorch And Tensorflow

Also published on TaskUs: Cgans 101

Coding Apps For Kids that Help Gamify Programming

Limarc Ambalina — Mon, 08 Feb 2021 18:02:45 +0000

Most parents fail to provide their children with proper coding education because of a lack of coding courses in elementary school. Many parents don't have the ability to teach coding themselves. If you want your kids to stay up to date with modern education and are looking for a way to teach your kids the basics of computer programming and coding, check out some of these best coding apps for kids.

These apps are available either on the web or on the Apple iPad.

1. Gamestar Mechanic

Gamestar Mechanic caters to the kids between the age of 7 to 14 and helps them design their own video games. It offers interactive self-paced quests for students to complete while they learn to build game levels. Using this site, children can learn critical thinking and problem-solving skills.

Platform: Web
Cost: $2 per student

2. The Scratch Coding App for Kids

Scratch is specifically designed to introduce young minds to the world of computer programming. It was developed by MIT students and staff in 2003 and has been the go-to program since to teach kids the fundamentals of coding.
Scratch offers a visual programming language that consists of blocks that need to be dragged and dropped for a program to run. Putting various types of blocks together creates loops, variables, initiate interactivity, play sound, etc. You don’t have to be a programming genius to learn and understand scratch. It is designed to be understood easily.

Platform: Web
Cost: Free

3. Tynker

Tynker is relatively new in the sea of coding platforms but has gained enough traction over the period of its existence to be called one of the best coding apps for kids. It has a similar interface as that of Scratch but with entirely different intentions.

While Scratch was designed to help kids program, Tynker aims to teach programming. Tynker consists of various lesson plans, management tools, and a showcase of programs that students have created through coding. Tynker is easy to follow and allows even a neophyte student to follow without any additional help.

Platform: Web
Cost: Free (with Premium upgrade option)

4. Move The Turtle: A Coding Game for Kids

Move The Turtle is a gamified program that teaches the basics of programming. Being a gamified program allows it to have an entire story arc with characters and plot lines that kids have to follow through in order to learn the fundamentals of coding.

With each new level, the difficulty rises and kids are taught new commands that make the Turtle move, make a sound, draw a line, etc. There is also a free play mode called “Compose” which allows kids to move the Turtle as they like.

Platform: iPad
Cost: $2.99

5. Hopscotch

Hopscotch is another visual coding program similar to Scratch and Tynker that offers drag and drop interactive mechanism. The app only runs on iPads.

While the characters and the mechanisms are not as high-end and vast, Hopscotch still remains one of the best coding apps for kids that teaches the basics of computer programming, logical thinking, and problem-solving.

Platform: iPad
Cost: Free

6. Daisy The Dinosaur

Daisy The Dinosaur comes from the same camp as Hopscotch but is aimed at younger students. The app consists of a dinosaur that can be moved using commands.

Since the app is limited to performing basic programming functions, it is a great introduction to coding classes for kids who are just starting to learn to code.

Platform: iPad
Cost: Free

7. The Cargo-Bot Coding App for Kids

Cargo-Bot is a game-based app that teaches computer programming to kids. Each level consists of coloured crates that can be moved by programming a claw crane to move left or right and drop and pick up.

Designed for iPads, the app was developed using a touch-based app called Codea, which is based on the programming language Lua.

We hope this list of the best coding apps for kids will help you guide your children along their path to becoming a computer programmer. With the world becoming more and more digitized each day, we wouldn't be surprised if coding became a mandatory class in schools alongside mathematics and science.

Get ahead of the game by instilling in your children a love of programming from an early age!

Originally published at: 7 Best Coding Apps For Kids that Help Gamify Programming

Business Development Tools in 2021

Limarc Ambalina — Sun, 03 Jan 2021 07:57:18 +0000

Slack, Rocketbolt, and Trello are among the best business development tools in 2020. In this article, we'll go over 10 tools your business should consider, especially if a part of your team is working remotely.

Whether you have a new startup or an established business, top business development tools are the need of the hour to manage the flurry of activities and keep your team productive.

From relationship-building to managing pitches to reporting and prospecting! Technology is empowering us with the latest innovations and business development tools to smoothen out our workflow, ease out the excess workload, and enhance employee productivity.

As per the latest surveys on social media monitoring, CRM, research services, SEO, list building software, social media, prospecting, market automation, and project management, the following are the top benefits these tools provide.

4 Primary Benefits of Business Development Tools

Better Integration of Data - For better integration of all activities under one roof, smart business software is much needed.
Simple, Practical CRM - An easy-to-learn CRM, user-friendly interface, well-integrated with marketing automation platforms is definitely priority for business managers.
Improved Tracking Facility - Tools that integrate well with other software to have up-to-date accurate data, better tracking business development tools are must.
Smart Reporting Tools - Business managers are always on the lookout for efficient reporting on the metrics for quick and easy management of workflow. Each business development tool or app comes along with certain specialty features to serve that particular organization. Hence, it is crucial to analyze your business needs and then pick up the best tool to boost up your business.

Below our my picks for the top 10 business development tools to survive the cut-throat business challenges of your business.

1.Twitter

Twitter is strongly emerging as one of the most important business development tools, with special features of providing real-time information.

It is an American microblogging and social networking service founded in 2006. Twitter facilitates public conversation, allows you to send & receive posts called tweets. Twitter is often looked upon as a common ground for the whole world to connect, share, learn and solve their problems

2. LinkedIn

LinkedIn is an American business development, employment-oriented portal founded in 2003. They operate via websites and mobile apps and have carved a niche for themselves in connecting business professionals all over the world. LinkedIn is mainly used for professional networking, where employers post their jobs and job seekers post their CVs.

LinkedIn is yet another important business development platform for all professional networking requirements.

Mainly, LinkedIn serves online professional network, jobs, people search, address book, advertising, company searches, professional identity, and group collaborations.

3. Join.me

Join.me is a cloud-based business Development software that specializes in collaboration, mobility, screen-sharing, collaborative whiteboard, connectivity, meetings, and SAAS technology.

It is considered as one of the essential business development tools for remote teams with budget constraints, allowing instant online meetings, sharing of tasks effortlessly.

Coming from the LogmeIn suite of services,join.me is a big success amongst small businesses, it has an easy-to-use modern interface with plenty of features to enhance your online meetings.

4. Slack

Slack offers connectivity, group work, mobility to remote teams staying continents apart. It is one of the most useful business development tools for sharing files, group collaborations, and connecting on calls.

This tool offers IRC-style features including persistent chat rooms organized by private groups, topics, and direct messaging. It is one of the simplest applications to connect with your team and work together.

5. Rocket Bolt

Rocketbolt is one of the key business development tools for lead activation, conversion optimization, engagement, SaaS, B2B, email productivity and lead tracking. It seamlessly integrates with current marketing and sales workflows to record all the lead activity.

Rocketbolt is a simple yet powerful lead tracking tool that saves time on researching and monitoring leads.

This trendy app easily manages your emails and make them easy to read. Also, it can be easily added to any website in less than a few minutes and quickly drive more sales & social media engagement without any additional maintenance.

Overall, it provides a great UX/UI with clean features, super easy to use application!

6. Trello

Trello is a smart task management application with special features for organization, collaboration, software, and projects for teams to work more collaboratively.

Trello’s boards, cards, and lists enable teams to organize
& prioritize projects in a fun, flexible way.

This business development tool gives a visual overview of what is being done and who is working on what, making a systematic approach to track the project and contribute where it is most needed.

7. Google Drive

Launched in 2012 by Google, this is a file storage, synchronization service that allows users to store files on their servers, share files and synchronize across various devices.

Google Drive is a specialty business development tool for online storage and file sharing. It is one of the best free cloud-based storage services that enables users to store and access files online by providing the15 GB free space to back up your important files.

It allows simultaneous data sharing with the whole team preventing the hassle of sending emails separately.

8. Prezi

Prezi is a cloud-based presentation platform launched in 2009 by a Hungarian presentation software company. This business development tool helps you connect more powerfully with the audience.

Prezi is strongly growing its base with each passing year, has more than 150 million users who have created more than 400 million presentations worldwide.

Its open canvas allows you to navigate freely through topics, encouraging easy interaction and collaboration between you and your viewers, resulting into conversational presentations that are more natural, engaging, and memorable.

9. Canva

Canva has successfully established itself into an efficient business development tool for graphic design platforms. It allows its users to create social media graphics, posters, presentations, documents and several other useful visual contents.

With Canva, you can design anything from business cards to consultant-worthy presentations and proposals in a matter of minutes. It is built more intuitively to provide design efficiency where users can choose from many professionally designed templates, edit the designs and upload their own photos through drop and drag interface.

You can start by selecting the type of project you want to build like- a presentation, letter or a Facebook cover, etc. Then you can browse through the provided templates, pick one, customize it, and you are ready with a stunning presentation to send your prospects.

And you do not have to be an expert designer to make these presentations, anybody with an interest in designing and a keen eye for details can use this business development application with ease and confidence.

10. Fileboard

Fileboard specializes in online meetings, screen sharing, track analytics and sales presentations.

This business development tool is an AI-enabled software that shows how sales professionals are engaging efficiently through insights with their prospecting, automate tasks and then comes with a suite of tools to enhance productivity.

Wrap Up

The are endless business development tools out there, varying as per the business needs of that organization.

We tried presenting the top 10 tools which are mostly useful for every organization’s work structure.

You surely want the best business development tools and software to have an edge over your competition. Also, the ease of work experience these tools provide is unmatchable.

We recommend to select the best business tools after thoroughly scrutinizing their payment plans, applications benefits and after reviewing their full reports to extract maximum benefit out of these.

This article was originally published on Hacker Noon

Text Annotation Tools for Machine Learning Projects

Limarc Ambalina — Mon, 14 Dec 2020 07:52:49 +0000

From search engines and sentiment analysis to virtual assistants and chatbots, there are numerous areas of research within machine learning that require text annotation tools and services.

In the AI research and development industries, annotated data is gold. Large quantities of high-quality annotated data is a goldmine. On the other hand, sometimes finding or creating this data can be an expensive and arduous task for your team. Fortunately, there are a variety of text annotation tools and services available that can provide you with the data you need. Some of these services include entity extraction, part-of-speech tagging, and sentiment analysis.

What are the Best Text Annotation Tools and Services?

Read on below to find out which text annotation service or tool is best for your project.

1. Tagtog

Based in Poland, Tagtog is a text annotation tool that can be used to annotate text both automatically or manually. Tagtog supports native PDF annotation and includes pre-trained NER models for automatic text annotation. On top of the Tagtog tool, the company also has a network of expert workers from various fields that can annotate specialized texts.

2. LightTag

The LightTag text annotation tool is a platform for annotators and companies to label their text data in house. While the starter package is free, each package level rises in cost and has a monthly limited amount of annotations, starting from 1000 annotations a month.

3. Lionbridge AI

With a specialization in linguistics, Lionbridge has a community of 1 million annotators fluent in over 300 languages. Some of our text annotation services include text extraction, sentiment classification, entity annotation, named entity recognition, and linguistic component analysis. Furthermore, Lionbridge also offers a custom data annotation software that your team can license and use for a variety of text annotation projects.

4. Scale

Based in San Francisco, Scale is a provider of computer vision and NLP data annotation services. Through a combination of human work and Scale’s platform, the company provides the following text annotation services: OCR transcription, text categorization, and comparison.

5. KConnect

One problem many AI researchers and developers face is getting access to AI training data for highly specialized fields. The team at KConnect seeks to help annotators quickly and efficiently classify and annotate medical data. Specifically, KConnect provides semantic annotation, text analysis, and semantic search services for medical information.

6. Clickworker

Based in the United States and Germany, Clickworker is a crowdsourcing company that has a huge workforce able to perform a variety of tasks. Some of their services include sentiment analysis and categorization.

7. ParallelDots Text Annotation APIs

ParallelDots is a provider of numerous text annotation tools and APIs. Some of their solutions include sentiment analysis, emotion analysis, keyword extractors, and named entity recognition.

8. Appen

With a huge source of crowdworkers from various countries, Appen is a provider of numerous forms of AI training data. For instance, some of their text annotation services include sentiment annotation, intent annotation, and named entity annotation.

9. Dandelion API

Based in Italy, Dandelion API provides a variety of automatic text annotation tools. While they are a relatively new startup company, their tools can be used for entity extraction, sentiment analysis, and text and content classification.

10. Dataturks Text Annotation Tools

With an in-house API for data annotation and thousands of partnered outsourcing companies, Dataturks provides various image annotation and text annotation tools. Specifically, some of their text labeling capacities include text classification, named entity recognition, and part-of-speech labeling.

Click here for the original article with link to each tool.

Crime Datasets for Machine Learning

Limarc Ambalina — Thu, 10 Dec 2020 03:42:50 +0000

For those looking to build text analysis models, analyze crime rates or trends over a specific area or time period, we have compiled a list of the 16 best crime datasets made available for public use. The datasets come from various locations around the world and most of the data covers large time periods.

Canada Crime Datasets

Crime in Vancouver – This dataset covers crime in Vancouver, Canada from 2003 to July 2017. The data contains the type of crime, date, street it occurred on, coordinates, and district.

Ontario Crime Statistics – Available on the Government of Canada website, this dataset includes crime statistics from the province of Ontario from 1998 to 2018. The data includes crime rate per 100,000 people, amount of cleared cases, cases cleared by charge, people charged, adults charged, youth charged, and more.

Toronto Assault Crime – Provided by the Toronto Police Service over the Public Safety Data Portal, this dataset includes an interactive map with every assault incident from 2014 to 2018 plotted on the map. The data is downloadable as a spreadsheet with over 59,000 rows.

United Kingdom Crime Datasets

Crime in England and Wales – Published by the Home Office, this dataset contains crime statistics from 2008 – 2009. The data was compiled from the British Crime Survey and recorded crime data from the police. The dataset includes statistics data on violent crime, property crime, and more in XLS format.

London Crime – This dataset contains 13 million rows of data with the following columns: borough, type of crime, and date.

United States Crime Datasets

Austin Crime Statistics – With data covering crimes reported in Austin between 2014 and 2016, this dataset contains 159,000 rows of data with 18 columns. The data includes location info, date and time, area, district, and description of the crime.

Baton Rouge Crime – This crime dataset contains all incidents handled by the Baton Rouge Police Department. The crimes covered in this dataset include: narcotics, theft, assault, nuisance, vice, battery, damage to property, sexual assaults, and homicide. Due to privacy issues for assault victims, the data is not geocoded.

Crimes in Boston – This Boston crime dataset includes information about incidents where Boston PD officers responded between August 2015 to date. The dataset includes information about the type of crime, the date and time of the crime, and the location where it occurred. The CSV file includes the following columns: incident number, offense code, offense code group, offense description, district, reporting area, shooting, date, year, month, day of the week, hour, street, latitude, and longitude.

Crimes in Chicago – The Chicago crime dataset includes reported crimes dating back to 2001 and is updated constantly with a seven-day lag between updates. The dataset includes location info, incident type and description, year of the incident, and date the record was updated.

Denver Crime Data – Updated regularly, the Denver Crime Dataset covers criminal offenses in Denver over the past five years and also the current year. The data within this crime dataset comes from the National Incident Based Reporting system and includes the following information: offense codes, offense types, date of crime, reported date, address, and location.

FBI National Incident Based Reporting System (NIBRS) – This dataset is a great resource for crime or policing analysis in the United States. The original data has been cleaned and organized into one convenient database.

Los Angeles Crime and Arrest Data – Based on open data from the city of Los Angeles, this dataset includes crime data from 2010 to 2019. The dataset includes the report ID, arrest date, time, area, suspect data, type of charge, charge description, and location info.

NYC Complaint Data – This New York City crime dataset includes all crimes reported to the New York City Police Department from 2006 to 2017. The data includes 6.5 million rows and 35 columns including: incident date, complaint number, location, coordinates, suspect info, victim info, and more.

Oakland Crime Statistics – This dataset contains crime data from Oakland between 2011 and 2016. Each year has its own separate CSV file for a combined total of over 1 million rows of data and 10 – 11 columns.

Open Baltimore Crime Data – This crime dataset is updated every week with a lag time of nine days to allow for changes to the data and processing time. The dataset covers crimes in Baltimore and has 16 columns of data, including date, crime code, location, description, coordinates, and number of incidents.

Phoenix Crime Data – Updated daily, the Phoenix Crime Dataset is a CSV file that contains crime data from November 2015 to date with a seven day lag. The data includes information about homicides, rapes, robberies, aggravated assaults, burglaries, thefts, motor vehicle thefts, arson, and drug offenses.

San Francisco Crime Classification – Containing crime data from 2003 to 2015, this dataset includes the following information: timestamp of incident, category, description of incident, day of the week, district, resolution, address, and coordinates.

See the original article with links to each dataset here

Top 12 Machine Learning Slack Groups for Data Scientists

Limarc Ambalina — Mon, 16 Nov 2020 06:21:58 +0000

Slack is a growing chat client that allows teams to communicate and collaborate on projects in one place. You can make group channels (group chats) for different teams within an organization, where the members can also share documents and comments. You can also make a secure private channel where you direct message one or more people.

Over the past few years, Slack has been gaining popularity for web developers, data scientists, engineers, bloggers, digital marketers, etc. There are now 10 million daily active users on Slack.

It’s no longer just used by tech companies to send internal messages. People are also using Slack to get connected with people and resources around the world. We at Lionbridge AI have created this list of machine learning Slack groups for data scientists to meet like-minded people and stay updated on the latest AI and ML trends.

Slack Groups for Data Scientists

Data Quest: Data Quest (Data Science Community) is the largest slack community for data practitioners. Join and chat with data scientists all over the world. Data scientists use this Slack group to swap tutorials and resources, find people to work on projects together, get feedback on their machine learning algorithms and architecture, and discuss machine learning trends and new technologies.

VR Theory: VR Theory is a Slack group discussing the latest in Virtual reality and Augmented reality. Join in on the conversation and nominate the next AMA.

WebVR: WebVR is one of the most active and popular of all VR and AR Slack groups, with some members being well-respected influencers and developers in the field.

Data Science Salon: The official Slack group for Data Science Salon, an organization that hosts conferences for senior executives, data scientists, developers, analysts, and other technical industry professionals.

Kaggle Noobs: Kaggle and data science industry community. This Slack group is a great place to meet other people interested in data science.

AI Crush: AI Crush will help you find investors for your AI projects. This Slack group is also a great way to expand your network in the AI field generally.

Silverpond’s Machine Learning and Artificial Intelligence: This Slack group was created so that the machine learning and AI community can chat amongst themselves, discuss research, and share their interesting projects.

Commercial Drones: Join commercial drone industry experts and discuss topics including: regulations news, aerial media, surveying, inspection, construction, mining, real estate, etc.

R-Team for Data Analysis: R-team from all around the world, helping each other in learning and exploring the R data analysis tool.

Data Scientist / Spark ML Group: An online Slack team of communication and knowledge management which focus on data science and machine learning by Apache Spark, Python Scikit-Learn, Scala Breeze, R, or any other topics on big data domain.

TWiML Online Meetup: Listeners of This Week in Machine Learning & AI podcast and participants in its monthly online meetup and ongoing study groups. Join via this registration form.

Dataparis: The Dataparis has a total of 7 Slack channels for big data analytics and data science: general, random, recruiting, meetups, neural networks, big data, and python.

If you found these Slack groups helpful and are looking for other platforms to connect with like-minded data scientists, we also recommend joining these Facebook groups and following these AI influencers on Twitter.

See original article here with links to each Slack group: https://lionbridge.ai/articles/machine-learning-slack-groups-data-scientists/

15 Best Chatbot Datasets for Machine Learning

Limarc Ambalina — Fri, 13 Nov 2020 07:14:45 +0000

An effective chatbot requires a massive amount of training data in order to quickly solve user inquiries without human intervention. However, the primary bottleneck in chatbot development is obtaining realistic, task-oriented dialog data to train these machine learning-based systems.

We’ve put together the ultimate list of the best conversational datasets to train a chatbot, broken down into question-answer data, customer support data, dialogue data and multilingual data.

Question-Answer Datasets for Chatbot Training

Question-Answer Dataset: This corpus includes Wikipedia articles, manually-generated factoid questions from them, and manually-generated answers to these questions, for use in academic research.

The WikiQA Corpus: A publicly available set of question and sentence pairs, collected and annotated for research on open-domain question answering. In order to reflect the true information need of general users, they used Bing query logs as the question source. Each question is linked to a Wikipedia page that potentially has the answer.

Yahoo Language Data: This page features manually curated QA datasets from Yahoo Answers from Yahoo.
TREC QA Collection: TREC has had a question answering track since 1999. In each track, the task was defined such that the systems were to retrieve small snippets of text that contained an answer for open-domain, closed-class questions.

Customer Support Datasets for Chatbot Training

Ubuntu Dialogue Corpus: Consists of almost one million two-person conversations extracted from the Ubuntu chat logs, used to receive technical support for various Ubuntu-related problems. The full dataset contains 930,000 dialogues and over 100,000,000 words
Relational Strategies in Customer Service Dataset: A collection of travel-related customer service data from four sources. The conversation logs of three commercial customer service IVAs and the Airline forums on TripAdvisor.com during August 2016.
Customer Support on Twitter: This dataset on Kaggle includes over 3 million tweets and replies from the biggest brands on Twitter.

Dialogue Datasets for Chatbot Training

Semantic Web Interest Group IRC Chat Logs: This automatically generated IRC chat log is available in RDF, back to 2004, on a daily basis, including time stamps and nicknames.
Cornell Movie-Dialogs Corpus: This corpus contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts: 220,579 conversational exchanges between 10,292 pairs of movie characters involving 9,035 characters from 617 movies.
ConvAI2 Dataset: The dataset contains more than 2000 dialogues for a PersonaChat competition, where human evaluators recruited via the crowdsourcing platform Yandex.Toloka chatted with bots submitted by teams.
Santa Barbara Corpus of Spoken American English: This dataset includes approximately 249,000 words of transcription, audio, and timestamps at the level of individual intonation units.
The NPS Chat Corpus: This corpus consists of 10,567 posts out of approximately 500,000 posts gathered from various online chat services in accordance with their terms of service.
Maluuba Goal-Oriented Dialogue: Open dialogue dataset where the conversation aims at accomplishing a task or taking a decision – specifically, finding flights and a hotel. The dataset contains complex conversations and decision-making covering 250+ hotels, flights, and destinations.
Multi-Domain Wizard-of-Oz dataset (MultiWOZ): A fully-labeled collection of written conversations spanning over multiple domains and topics. The dataset contains 10k dialogues, and is at least one order of magnitude larger than all previous annotated task-oriented corpora.

Multilingual Chatbot Training Datasets

NUS Corpus: This corpus was created for social media text normalization and translation. It is built by randomly selecting 2,000 messages from the NUS English SMS corpus and then translated into formal Chinese.
EXCITEMENT Datasets: These datasets, available in English and Italian, contain negative feedbacks from customers where they state reasons for dissatisfaction with a given company.

View the original article here for links to all datasets:
https://lionbridge.ai/datasets/15-best-chatbot-datasets-for-machine-learning/

Intro to 4 Types of Audio Classification

Limarc Ambalina — Wed, 11 Nov 2020 07:54:55 +0000

Audio classification is the process of listening to and analyzing audio recordings. Also known as sound classification, this process is at the heart of a variety of modern AI technology including virtual assistants, automatic speech recognition, and text to speech applications. You can also find it in predictive maintenance, smarthome security systems, and multimedia indexing and retrieval.

Audio classification projects like those mentioned above start with annotated audio data. Machines require this data to learn how to hear and what to listen for. Using this data, they develop the ability to differentiate between sounds to complete specific tasks. The annotation process often involves classifying audio files based on project-specific needs through the help of dedicated audio classification services.

In this article we look at four types of classification and related use-cases for each.

Types of Audio Classification

Acoustic Data Classification:

Also known as acoustic event detection, this type of classification identifies where an audio signal was recorded. This means differentiating between environments such as restaurants, schools, homes, offices, streets, etc. One use of acoustic data classification is the building and maintaining of sound libraries for audio multimedia. It also plays a role in ecosystem monitoring. One example of this is the estimation of the abundance of fish in a particular part of the ocean based on their acoustic data.

Environmental Sound Classification:

Just as the name implies, this is the classification of sounds found within different environments. For example, recognizing urban sound samples such as car horns, roadwork, sirens, human voices, etc. This is used in security systems to detect sounds like breaking glass. It is also used for predictive maintenance by detecting sound discrepancies in factory machinery. It is even used to differentiate animal calls for wildlife observation and preservation.

Music classification:

Music classification is the process of classifying music based on factors such as genre or instruments played. This classification plays a key role in organizing audio libraries by genre, improving recommendation algorithms, and discovering trends and listener preferences through data analysis.

Natural Language Utterance Classification:

This is the classification of natural language recordings based on language spoken, dialect, semantics, or other language features. In other words, the classification of human speech. This kind of audio classification is most common in chatbots and virtual assistants, but is also prevalent in machine translation and text to speech applications.

The Importance of Audio Data Quality

For projects involving audio classification, the quality of your dataset can and will decide the quality of your project results. Therefore, to ensure an accurate level of audio classification, you’ll need a good volume of high-quality, accurately-annotated data.

This article was originally published on: https://lionbridge.ai/articles/what-is-audio-classification/

Climate Change Datasets

Limarc Ambalina — Wed, 28 Oct 2020 05:43:52 +0000

Data is a central piece of the climate change debate. With the climate change datasets on this list, many data scientists have created visualizations and models to measure and track the change in surface temperatures, sea ice levels, and more. Many of these datasets have been made public to allow people to contribute and add valuable insight into the way the climate is changing and its causes.

We hope this collection provides you with a jumping off point to use your skills to contribute to one of the biggest and most important challenges of our time.

Global Climate Change Datasets

1.Berkeley Earth Surface Temperature Data – From the Berkeley Earth Data page, this dataset in made up or temperature recordings from the Earth’s surface.

The data ranges from November 1st, 1743 to December 1st, 2015. The dataset is divided into several files including:

GlobalTemperatures
GlobalLandTemperaturesByCountry
GlobalLandTemperaturesByState
GlobalLandTemperaturesByMajorCity
GlobalLandTemperaturesByCity

2.Global Climate Change Data – This dataset includes information from the Climate Change Knowledge Portal and World Development indicators. It covers various topics such as greenhouse gas emissions, energy consumption, and more. The total time period of the data covers 1990 – 2011.

3.International Greenhouse Gas Emissions – Created by the United Nations, this Kaggle dataset contains Greenhouse Gas Inventory Data from 1990 to 2014. The official UN website has updated the dataset up to 2017. It includes emission levels by country and region for the following gases:

carbon dioxide (CO2)
methane (CH4)
nitrous oxide (N2O)
hydrofluorocarbons (HFCs)
perfluorocarbons (PFCs)
unspecified mix of HFCs and PFCs
sulphur hexafluoride (SF6)
nitrogen trifluoride (NF3)

4.Daily Sea Ice Extent Data – From The National Snow and Ice Data Center, this climate change dataset has information on the Earth’s cryosphere, and includes glacier, ice, snow and frozen ground data. The dataset has seven columns: year, month, day, extent, missing, source, and hemisphere. Extent refers to the area of the ocean that includes portions of sea ice.

sea ice extent data

5.Climate Change Adaptation of Coffee Production – From the Harvard Dataverse, this dataset was created to determine the impact of climate change on coffee production quality in Nicaragua. The dataset is divided into six Geotiff Raster files.

6.Climate Change in Russia – As Russia is one of the largest producers of CO2 emissions worldwide, this portal on Statista highlights Russia’s C02 emissions volume from 1985 to 2019. It also includes information about the percentage of the Russian population who have been exposed to pollution.

Please note that this dataset is from Statista. Some of the charts and statistics within this dataset may require a premium Statista account.

7.The Climate Change Knowledge Portal – This portal from World Bank Group is an easy-to-navigate platform where you can view climate change data visualizations based on historical data and projections. You can browse the data by impact sectors: energy, water, agriculture, and health. Alternatively, you can also browse by country, region, and watershed. Most importantly, the data is available for free download.

United States Data

8.Climate Change Projections and Impacts for New York State – This dataset is curated by the New York state government website. It contains climate data projections for three time periods: the 2020s, 2050s, and 2080s. The dataset includes the following data variables:

Average annual temperature
Average annual rainfall
Extreme weather events
Rise of sea levels

9.SGMA Climate Change Resources – From the California Natural Resources Agency, the SGMA Climate Change Resources Dataset includes data on changes in precipitation and bodies of water within the state of California. Some of the data provided includes climate condition projections for 2030 and 2070.

Social Media Climate Change Datasets

10.Harvard Dataset of Climate Change Tweet IDs – Collected between September 2017 and May 2019, the Climate Change Tweet IDs Dataset contains the IDs from over 39 million tweets about climate change. The tweets were tracked and curated using these hashtags related to climate change:

#climatechange
#climatechangeisreal
#actonclimate
#globalwarming
#climatechangehoax
#climatedeniers
#climatechangeisfalse
#globalwarminghoax
#climatechangenotreal

11.Sentiment of Climate Change – From Crowdflower, this dataset includes tweets that were classified for their sentiment by human contributors. The tweets were classified as:

Yes = Content suggests global warming is happening
No = Content suggests global warming is not happening
I can’t tell = Content is not clear or completely not related to global warming

We hope you found this list of climate change datasets useful. If you couldn’t find the data you need, check out our datasets library. Please be sure to subscribe to our newsletter below for more open datasets, AI news, and machine learning guides.

Please see the original climate change datasets article for links to each dataset.

5 Must-read Papers on Product Categorization for Data Scientists

Limarc Ambalina — Mon, 12 Oct 2020 05:30:34 +0000

Product categorization/product classification is the organization of products into their respective departments or categories. As well, a large part of the process is the design of the product taxonomy as a whole.

Product categorization was initially a text classification task that analyzed the product's title to choose the appropriate category. However, numerous methods have been developed which take into account the product title, description, images, and other available metadata. The following papers on product categorization represent essential reading in the field and offer novel approaches to product classification tasks.

1. Don't Classify, Translate

In this paper, researchers from the National University of Singapore and the Rakuten Institute of Technology propose and explain a novel machine translation approach to product categorization. The experiment uses the Rakuten Data Challenge and Rakuten Ichiba datasets. Their method translates or converts a product's description into a sequence of tokens which represent a root-to-leaf path to the correct category. Using this method, they are also able to propose meaningful new paths in the taxonomy.

The researchers state that their method outperforms many of the existing classification algorithms commonly used in machine learning today.

Published/Last Updated - Dec. 14, 2018
Authors and Contributors - Maggie Yundi Li (National University of Singapore), Stanley Kok (National University of Singapore), and Liling Tan (Rakuten Institute of Technology)

Read Now

2. Large-Scale Categorization of Japanese Product Titles Using Neural Attention Models

The authors of this paper propose attention convolutional neural network (ACNN) models over baseline convolutional neural network (CNN) models and gradient boosted tree (GBT) classifiers. The study uses Japanese product titles taken from Rakuten Ichiba as training data. Using this data, the authors compare the performance of the three methods (ACNN, CNN, and GBT) for large-scale product categorization. While differences in accuracy can be less than 5%, even minor improvements in accuracy can result in millions of additional correct categorizations.
Lastly, the authors explain how an ensemble of ACNN and GBT models can further minimize false categorizations.

Published/Last Updated - April, 2017 for EACL 2017
Authors and Contributors - From the Rakuten Institute of Technology: Yandi Xia, Aaron Levine, Pradipto Das Giuseppe Di Fabbrizio, Keiji Shinzato and Ankur Datta

Read Now

3. Atlas: A Dataset and Benchmark for Ecommerce Clothing Product Classification

Researchers at the University of Colorado and Ericsson Research (Chennai, India) have created a large product dataset known as Atlas. In this paper, the team presents their dataset which includes over 186,000 images of clothing products along with their product titles. Furthermore, they introduce related work in the field that has influenced their study. Finally, they test their dataset using a Resnet34 classification model and a Seq to Seq model to categorize the products.

The data is taken from Indian ecommerce stores, so some of the categories used may not be applicable to Western markets. However, the dataset has been open-sourced and is available on Github.

Published/Last Updated - Aug. 19, 2019
Authors and Contributors - Venkatesh Umaashankar (Ericsson Research), Girish Shanmugam (Ericsson Research), and Aditi Prakash (University of Colorado)

Read Now

4. Large Scale Product Categorization using Structured and Unstructured Attributes

In this study, a team at WalmartLabs compares hierarchical models to flat models for product categorization.
The researchers employ deep-learning based models which extract features from each product to create a product signature.

In the paper, the researchers describe a multi-LSTM and multi-CNN based approach to this extreme classification task. Furthermore, they present a novel way to use structured attributes. The team states that their methods can be scaled to take into account any number of product attributes during categorization.

Published/Last Updated - Mar. 1, 2019
Authors and Contributors - From WalmartLabs: Abhinandan Krishnan and Abilash Amarthaluri

Read Now

5. Multi-Label Product Categorization Using Multi-Modal Fusion Models

In this paper, researchers from New York University and U.S. Bank investigate multi-modal approaches to categorize products on Amazon. Their approach utilizes multiple classifiers trained on each type of input data from the product listings. Using a dataset of 9.4 million Amazon products, they developed a tri-modal model for product classification based on product images, titles, and descriptions. Their tri-modal late fusion model retains an F1 score of 88.2%.

The findings of their study demonstrate that increasing the number of modalities could improve performance in multi-label product categorization.

Published/Last Updated - June 30, 2019
Authors and Contributors - Pasawee Wirojwatanakul (New York University) and Artit Wangperawong (U.S. Bank)

Read Now

In the papers on product categorization above, the researchers trained their models on open datasets which included millions of products. However, if you are building a product categorization model for commercial use, many open datasets may not be available to you.

Looking for training data for your product classification model? Check out this training data guide and these open datasets.

How Self-Agreement Can Improve Your Training Data

Limarc Ambalina — Mon, 14 Sep 2020 02:31:41 +0000

Finding, creating, and annotating training data is one of the most intricate and painstaking tasks in machine learning (ML) model development. Many crowdsourced data annotation solutions often employ inter-annotator agreement checks to make sure their labelling team understands the labeling tasks well and is performing up to the client’s standards. However, some studies have shown that self-agreement checks are as important or even more important than inter-annotator agreement when evaluating your annotation team for quality.

In this article, we will explain what self-agreement is and introduce an ML study where self-agreement checks were crucial to the quality of the team training data and the accuracy of their model.

What is Self-Agreement in Machine Learning?

Simply put, self-agreement is a QA protocol you can use in data annotation to evaluate the abilities of individual annotators. Whereas inter-annotator agreement protocols check to see if two or more annotators agree with each other, self-agreement checks whether or not a single annotator is consistent in their own annotations.

Self-agreement Checks

For example, a simple inter-annotator agreement workflow would be to send two separate annotators the same piece of data. Then, you would check to see if their annotations are the same. If they are not the same, you could then bring that piece of data to a supervisor to make a ruling on which annotation is correct.

On the other hand, with self-agreement protocols, you would send the same annotator the same piece of data twice to see if they provide the same label both times. For example, if they are tasked with annotating 100 images, you could set image 1 and image 35 as the same image, evaluate the result, and repeat this process many times. Theoretically, you could send an annotator the same data more than twice, but the effect is minimized because the annotator starts to realize that they’ve seen this data point before.

Why Use Self-Agreement in Your Data Annotation Workflow?

The point of self-agreement is to evaluate the abilities of the annotator and make sure they are annotating each piece of data correctly, and not simply rushing through the project to get it done as quickly as possible. Furthermore, there is concrete proof found in a 2016 study that using self-agreement can help weed out low-quality annotators and improve the quality of your dataset.

Self-Agreement Tests Can Improve Data Quality

In a 2016 sentiment analysis study by researchers from the Jozef Stefan Institute, the team found that the quality of human annotators could play a larger role in the accuracy of the model than the type of model itself.

The team’s goal was to create a sentiment classifier for Twitter posts in multiple languages, so they analyzed 1.6 million tweets in 13 different languages. These tweets were all labeled for sentiment by human annotators. Ultimately, the researchers said:

"Our main conclusion is that the choice of a particular classifier type is not so important, but that the training data has a major impact on the results."

Many of the teams insights were about the relationship between inter-annotator agreement and self-agreement and how those values relate to the quality of the data. Firstly, they found that self-agreement will almost always be higher than inter-annotator agreement.

Figure 1 from Multilingual Twitter Sentiment Classification

In the above diagram and the rest of this article, Alpha refers to Krippendorf’s Alpha, which is a coefficient used to calculate agreement between observers. An Alpha = 1 is the highest score which means perfect agreement.

If you have low self-agreement, you will have even lower inter-annotator agreement in most, if not all, cases. As a result, self-agreement tests may be an easier and quicker way to track the overall quality of your dataset by analyzing the performance of your annotators.

For example, if you are aiming for an inter-annotator agreement Alpha of 0.6, but the self-agreement levels of most of your annotators are at 0.4, chances are you aren’t going to hit the inter-annotator agreement Alpha you were hoping for. Therefore, you may want to focus on raising self-agreement above your desired levels before proceeding with inter-annotator agreement checks.

Low Quality Annotators and Low self-agreement

In the study, one example of this was the low quality of the Spanish tweets sentiment dataset (see image above). The researchers found that the self-agreement was at 0.244, while the inter-annotator agreement was at 0.120.

Low Quality Spanish Annotations Degraded the Emojis Dataset

As part of the overall project, the team created an Emojis Dataset which included tweets from various languages that had emojis. They collected 70,000 tweets in total across various languages. Around 20,000 of these tweets were from the poorly-annotated Spanish dataset mentioned in the previous section. As a result, the total self-agreement of the Emojis Dataset was at Alpha 0.544.

As a result, the emojis dataset (as seen in Figure 1 above) was the only dataset where the self-agreement was lower than the inter-annotator agreement. However, after removing all of the Spanish tweets from the Emojis Dataset, the Alpha of the Emojis Dataset jumped to 0.720.

This insight reconfirmed the team’s conclusion: “Low quality annotators have to be excluded and their annotations removed from the datasets.” This is especially true when you have a large project using multiple annotators or even multiply annotation teams.

Should You Use Self-Agreement or Inter-Annotator Agreement to Improve Training Data Quality?

The safe answer is: both. In most cases, you should not abandon inter-annotator agreement testing entirely. In fact, the research team stated that the two measures can provide you with different insights:

“It turns out that the self-agreement is a good measure to identify low-quality annotators, and that the inter-annotator agreement provides a good estimate of the objective difficulty of the task”

So if the self-agreement of an annotator is extremely low, then they either aren’t prepared for the labeling task or they’re simply the wrong person for your project. If inter-annotator agreement is low, but self-agreement is at acceptable levels, then the task is either too difficult or calls for subjective reasoning, as is often the case with sentiment classification projects.

The Key Takeaway

There is no one-size-fits-all method when it comes to testing for inter-annotator agreement and self-agreement. It depends on the task and what your acceptable Alpha levels are. Throughout the article, the researchers consistently emphasized that both levels should be constantly tested throughout the training data creation process.

Annotators should be informed when levels drop too low, and actions should be taken to maintain the quality of the training data. Sometimes that will unfortunately mean removing low-quality annotators from your project and labeling their data again with a better annotator.

Hopefully, this guide helped you understand the power of self-agreement checks and how they can improve the quality of your data. If you’re looking to learn more about how to improve your data, check out this in-depth training data guide.