DEV Community: crowintelligence

Graph Theory and Network Science for Natural Language Processing – Part 2, Databases and Analytics Engines

crowintelligence — Tue, 30 Jun 2020 10:20:01 +0000

From keyword extraction to knowledge graphs, graph and network science offer a good framework to deal with natural language. We love using graph-based methods in our work so much, like generating more labeled data, visualizing language acquisition and shedding light on hidden biases in language, that we decided to start a series on the topic. The first part explored the theoretical background of network science and dealt with graphs using Python. This part focuses on graph processing frameworks and graph databases.

Why do we need graph databases and frameworks?

The question seems to be naive for everyone but newbies. You should keep in mind that at some point:

Your data doesn’t fit into your computer’s memory.
Data processing lasts for ages even if you use parallelization techniques.
It is too complicated to use csv, json, parquet or any other file format.
You must manage your data, because it is changing over time.
You need to process your data frequently to answer various questions.

As we mentioned in the first part of the series, NetworkX is not good at handling large networks, i.e. about over 100.000 nodes, but it really depends on the structure of the network. If you work with a large dataset, you need to use two tools, namely one for processing it (e.g. to compute centrality measures, find clusters) and another for storing it and running analytic queries on it (e.g. find the shortest path between two nodes, list all nodes that can be reached from a given node within five or less step).

Graph Databases

Source: https://en.wikipedia.org/wiki/Gremlin_(query_language)#/media/File:Gremlin_(programming_language).pngThe landscape of graph databases is huge and complicated. Read this post if you want to get a systematic overview of it. We have a very opinionated position on graph databases, we like open source and open standards, so we like graph databases that support the Gremlin graph traversal machine and language. The Gremlin language enables one to host language embedding, so you can use it in your own language in a very idiomatic way.

Source: https://images.manning.com/book/b/7825565-46a5-4846-b899-a0dfb64e54bb/Bechberger-GD-MEAP-HI.png

If you want to learn more about what graph databases offer, how to model your data and what kind of queries can be run on such dbs, read Graph Databases in Action from Bechberger and Perryman – it’s freely available on its website.

Source: https://janusgraph.org/img/janusgraph.pngThere are billions of graph databases, but we especially love JanusGraph. It is 100% open source and as per our experience it works fine, though it is not perfect.

Source: https://en.wikipedia.org/wiki/Neo4j#/media/File:Neo4j-2015-logo.pngProbably neo4j is the most comprehensive and most advanced graph database which is widely used in the industry . We think it is superior to others, but it is not fully open. Of course you can use its community edition for learning and testing. It supports Gremlin, so it is also a good choice to work with.

Graph Frameworks – really it’s just Spark

All graph processing framework build on a paper from Google that describes its internal system for large scale graph processing called Pregel after the river of Königsberg, and yes, this river had those seven bridges.

Source: https://upload.wikimedia.org/wikipedia/commons/f/f3/Apache_Spark_logo.svg

There are three major frameworks for graph processing: Apache Hadoop, Apache Giraph and Apache Spark. Apache Giraph is the only one which was made only for graph processing. Sadly it is neither an actively maintained project nor a well documented one. Hadoop and Spark are big data analytics engines with graph processing capabilities. These days Spark seems to be more popular, at least among data scientists.

Source: https://spark.apache.org/docs/latest/img/graphx_logo.png GraphX is the graph and parallel computing API of Spark. Although it is far from being a perfect tool, it is widely used by the industry, very robust and well-supported by documentations and by a big user base.

OK, but how these things are used in NLP/ML?

Deep Learning is the sexiest things on earth these days, but it needs lots of data. Google is using its Pregel system to feed its algorithm in a semi-supervised way. This paper explains how Pregel is used for a kind of label spreading method to boost training data. Such system used to train the smart reply function of Gmail and it helped to improve Google’s sentiment analyzer.

Source: https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcT46T0yTqDOiXRqTY3Fts9LRYwcBKIgAZ29UQ&usqp=CAUGraph databases can be used for various task, but Knowledge Graphs are the most well-known examples. Historically, Google developed its Knowledge Graph service to enhance its search results with factual information on the basis of Freebase, a semantic database. Now, the name of the service is a synonym of semantic databases. Building knowledge graphs is a very common NLP task in the industry. E.g. by using a named entity recognizer you can build a very simple one based on the co-occurrence of entities, or you can take a step further and by using relation mining you can determine the type of the connection between the co-occuring entities. Read this post to see a simple example of building a knowledge graph from unstructured text. The knowledge graph is usually stored in a graph database. Graph analytics is used to enhance the data with centrality measures, cluster, and other metrics. Also, graph analytics helps to filter out unwanted datapoints.

Source: https://dist.neo4j.com/wp-content/uploads/20190326120839/OReilly-Graph-Algorithms_v2_ol1.jpgGraph Algorithms Paractical Examples in Apache Spark and Neo4j by Needham and Hodler is full of great examples of using graph analytics and graph databases. You can download it for free after filling out a form here.

No one works alone in the real-world. Data engineers tend toprovide data scientists with the necessary infrastructure. So you don’t have to become an expert in graph databases and processing frameworks, but you should know enough to work with your peers and communicate with them.

What’s coming up next?

If you are interested in this topic, we have a good news. Alessandro Negro of GraphAware and author of Graph-Powered Machine Learning will speak about Using Knowledge Graphs to predict customer needs, improve product quality and save costs at our upcoming meetup. He will also present a demo, Fighting corona virus with Knowledge Graph and Hume. Register here to attend the online event, or you can watch the recorded talk later on our YouTube channel.

In the third part of this blog series we will introduce the open source tools to visualize smallish and large graphs. Stay tuned!

Subscribe to our newsletter

Get highlights on NLP, AI, and applied cognitive science straight into your inbox.

Enter your email address

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Graph Theory and Network Science for Natural Language Processing - Part 1

crowintelligence — Tue, 30 Jun 2020 10:18:48 +0000

From keyword extraction to knowledge graphs, graph and network science offer a good framework to deal with natural language. We love using graph-based methods in our work, like generating more labeled data, visualizing language acquisition and shedding light on hidden biases in language. This series gives you tips on how to get started with graph and network theory, which Python tools to use, where to look for graph databases and how to visualize networks, finally we offer a few resources on Graph Neural Networks.

Graph Theory and Network Science

First of all, one might ask what’s the difference between Graph Theory and Network Science. We argue that there is no sharp boundary between the two fields. It seems that NLP practitioners tend to prefer graphs to networks, while cognitive scientists and AI researchers tend to have reversed preferences. We’ll be sloppy and use the two terms interchangeably here. But for the sake of those who stick to the separation of the two field, let’s see how Wikipedia defines them

“Network science is an academic field which studies complex networks such as telecommunication networks, computer networks, biological networks, cognitive and semantic networks, and social networks, considering distinct elements or actors represented by nodes (or vertices) and the connections between the elements or actors as links (or edges). “

Wikipedia: Network Science

“In mathematics, graph theory is the study of graphs, which are mathematical structures used to model pairwise relations between objects. “

Wikipedia: Graph theory

Graphs and networks are versatile fields on their own. Here we focus on the very basics of the theory behind them. For the practical parts, we only deal with resources available to Pythonistas.

Theoretical Background

Richard J. Trudeau’s Introduction to Graph Theory is a short, cheap, and accessible introduction into the field. It is a math classic from Dover, containing just enough material to get started with graphs.

Barabási, Network Science

Network Science by Albert-László Barabási is a comprehensive, freely available textbook. It can be used as a reference work to look up the gritty nitty details of network theory from time to time. Don’t be scared by the long chapters of the book. To understand graph-based NLP, you don’t need the second half of it (from chapter 6).

Graph-Based Natural Language Processing and Information Retrieval by Mihalcea and Radev is a short (less than 190 pages) yet comprehensive book. The authors are top-tier researchers of their field, their TextRank algorithm is one of the best unsupervised keyword extraction and extractive summarizer algorithms. The book gives you a comprehensive overview of graph-based methods in NLP. You can use it as a texbook as well as a reference work. Its first and second chapters (which are devoted to Graph Theory and Graph Based Algorithms respectively) are not suitable for complete beginners. Instead, we recommend Trudeau’s and Barabási’s books to learn the basics of graph theory and network science. If you want to learn more about graph algorithms, read our post on resources to learn the basics of algorithms.

The Python way

Although there is a plethora of network packages, NetworkX stands out as one of the most comprehensive Python package, one with an active group of maintainers. It is awesome for small and medium sized networks up to about 100.000 nodes. Check out this post on benchmarking all major graph libraries to select the one that best suits to your needs. Unless you have a very specific problem, we strongly recommend using NetworkX. If your network is too large, you should use a graph processing framework to analyze it. You’ll also need a graph database to store it and run analytic queries on it. We’ll cover these topics in the second part of these series. Now let’s turn back to Python tools.

Social Network Analysis for Startups by Tsvetovat and Kouznetsov is a fantastic book despite its misleading title. This book is a practical introduction into graph theory/network science and social network analysis using Python. The chapters follow each other in a logical manner, the examples are really good, and the explanations are superb. The only problem with this book is its age. Having published in 2011, this book shows its age and you have to adapt the example code to the present day versions of Python, matplotlib, NetworkX and other tools.

Complex Network Analysis in Python by Zinoviev is a more recent title. It uses NetworkX to teach network science in a pragmatic manner. The first part deals with the basics, the second is devoted to classic explicit networks. The third and fourth parts are rare gems. They are dealing with creating networks, based on co-occurrence and similarity,. These are topics which are hardly found in other sources! The last part is devoted to directed networks, which sadly contains only one chapter.

If you are interested in graph-based methods in machine learning in general, Graph-Powered Machine Learning by Alessandro Negro is the best resource to use. It is freely available here. By the way, Alessandro will speak at our meetup soon. Register here to attend the online event, or you can watch the recorded talk later on our YouTube channel.

What’s coming up next?

We hope you enjoyed our journey in the world of graphs and networks. In the next part, we will collect the best resources on graph analytics frameworks and graph databases. The third part will be devoted to visualization. Stay tuned!

Subscribe to our newsletter

Get highlights on NLP, AI, and applied cognitive science straight into your inbox.

Enter your email address

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

How to fuel your data-driven business with text data? – Part 2, Strategies and Tools

crowintelligence — Tue, 30 Jun 2020 10:16:37 +0000

If data is the new oil, then getting and enriching data is like fracking and refining it, at least in the case of textual data. Our previous post introduced the basic idea of data gathering and annotation. Now we help you with the strategies and tools you can employ to fuel your algorithms.

Both data gathering and annotation are complex enterprises. It should be carefully thought over who you trust to carry out these tasks and what tools you employ. Let’s see our tips on data gathering and annotation strategies and tools.

Data gathering options

In-house solution

As we mentioned in the first part of this series, data gathering should be think of as a process. That’s why most of our clients want to build their in-house capabilities. This way they can be flexible and react very fast to changes in the requirements. If one goes for the in-house solution, there are plenty of tools to use. Our favorite one is Scrapy, the lingua franca of scraping and crawling the web. It is a mature and well-maintained Python framework with excellent documentation.

Source: https://miro.medium.com/max/1200/1*YJNS0JVl7RsVDTmORGZ6xA.pngYou can learn the basics of Scrapy and web scraping within a short time. A few minutes of googling will provide you with excellent tutorials. Our favorite resource is Mitchell’s Web Scraping with Python.

Source: https://covers.oreillystatic.com/images/0636920078067/lrg.jpg

If your company is not a Python shop and/or you are interested in other technologies, take a look at Apache Nutch and sparkler, which is a Spark-based crawler.

No matter which tool you use, you’ll have to manage your scrapers and the infrastructure around them. Your devops team should be prepared for the needs! You can go for cloud solutions too, e.g. Scrapinghub’s Scrapy Cloud.

Outsourcing

Web scraping and crawling seems to be an easy task. Google has already done it for ages! That’s just partly true. Google is able to do it by employing an army of developers and running probably the largest hardware infrastructure of the world. However we learned from our mistakes that scraping is not that simple! We’ve already discussed the problem of modern JavaScript frameworks and locked sites. There are sites that ban particular IP address after a certain amount of requests, so it is a good tactic to rotate your IP address. Sites are constantly changing these days, so scrapers should be maintained if one needs up-to-date data.

Source: https://commons.wikimedia.org/wiki/File:Mistakes-to-avoid-when-hiring-freelancers.jpg

If you want to get your data from more sources, and you want to update your data on a regular basis, you need to manage your scrapers. There are companies specialized in such tasks! Scraping and crawling are highly specialized skills and most companies don’t need employers with such skills all the time. Chances are high that the easiest way to collect your data is not to compete with such specialized companies to hire developers, but becoming their client. Of course, there are plenty of firms which offer similar solutions. To find the most suitable one, don’t forget that Google is your friend!

Crowdsourcing, employing an army of developers

As another option you can look for a specialist on big freelancer sites who can write or update a specific scraper for you. This is the crowdsourcing solution. By splitting up data gathering into small tasks, you can dramatically reduce your costs. You can group the sites into workpackages, or you can treat one site as one job and post them on freelancer sites. However, this option gives you more administrative work. You need to manage your contractors and constantly check the quality of their work. Also, this presupposes a robust architecture for managing and deploying the scrapers/crawlers.

Annotation options

The importance of annotation

Why do we need annotation? The industry usually uses supervised algorithms, which needs labeled or annotated data. Raw data should be cleaned and labeled before it can fuel any training algorithm. When you read about the 80% rule in data science, the articles usually tell you that 80% of each project spent on collecting, cleaning and reshaping data. In case of projects involving textual data, this is not true. The reality suggests that even more time is needed to get your data right and annotated. We would say that 90-95% of the time should be devoted to gather, clean, transform and annotate your data. Sometimes even more.

Regarding textual data, annotation can be carried out at different levels. A label can be given either to the whole text (e.g. its genre, like criminal news), or to each sentence (e.g. the sentence expresses positive or negative sentiment), or to the words/phrases (e.g. Named Entities like names of persons, firms, institutions, etc.) The more data you have, the better your chances are to build a good model on it.

Heavily annotated text!
Source: https://www.reddit.com/r/step1/comments/dx6f8t/mistake_for_those_who_recently_started_preparing/

A good annotation software makes possible to upload texts in raw format, manage annotators, and define annotation, i.e. what kind of labels can annotators assign to texts or words. Annotators should be prepared for their tasks, which means that they need some training and a guideline at hand during their work. It is a good quality assurance practice to annotate every item, or a a certain percent of the whole corpus with at least three annotators and measure their agreement. One can easily think that annotation is a tedious and very time consuming task – and it is! However, thank to recent advances in the field of active learning, the costs and time horizon of annotation tasks can be dramatically reduced. (Read more on this topic in Robert Munro‘s book, Human-in-the-Loop Machine Learning). Considering your annotation strategy, you have to keep in mind all these issues! No matter whether you build-up your in-house solution, or run your annotation tasks on crowdsourcing sites or you hire a specialist company.

In-house solution

If you’d like to keep the data annotation task within your organization, you’ll need a good annotation tool. You can find free, open source tools like doccano. It doesn’t support active learning out of the box, so it is a good task for your Python developers to integrate it with an active learning library. The creators of Spacy made Prodigy, an annotation tool that supports active learning. It’s not free but it is reasonably priced.

Source: https://raw.githubusercontent.com/doccano/doccano/master/docs/images/demo/demo.gifNow you have data and an annotation tool, so you are ready to plan your annotation task. Read Natural Language Annotation for Machine Learning by Pustejovsky and Stubbs to learn more about it. Keep in mind, annotation is not a black art, but you need experience to plan and execute it correctly.

Source: https://images-na.ssl-images-amazon.com/images/I/51n62wukauL._SX381_BO1,204,203,200_.jpg

Crowdsourcing

If building in-house competencies is not a viable option, it’s worth considering crowdsourcing. You still need someone who describes the tasks, manages the annotation process and takes care of quality issues, but you don’t have to deal much with annotators. Tools, like Amazon’s Mechanical Turk allows one to slice tasks into small micro-tasks, and present them to remote workers via a platform. You don’t have to deal with hiring workers and putting them on your pay-roll, since the crowdsourcing site manages these tasks. Usually, you can set some sort of experience limit, so you can select among applicants on the basis of their expertise. It is a good practice to provide workers with good instructions and a trial task before accepting their application.

Crowdsourcing can be extremely fast, and if it is done wisely, the results can be of good quality for a relatively low price. However, the more complex the task is, the harder it is to find good workers. Also, crowdsourcing raises ethical and methodological questions both for academia and for the industry. Also, it can rise privacy issues too.

Outsourcing

There are data annotator companies that offer solutions to the problems of crowdsourcing. Such companies employ (permanently or for a limited time) lots of annotators, so their people are well trained, precise and paid better than workers of crowdsourcing sites. They can be of help in planning the annotation task too. Also, such companies are aware of the legal environment, like GDPR. The complete outsourcing of the annotation task to a company seems to be an expensive step, however sometimes it is the best way to get data. The market of such companies is huge and it is relatively easy to find one, you can go for a global provider (like Appen, for example) or look for local companies in your region.

Source: https://upload.wikimedia.org/wikipedia/commons/7/72/Crowdsourcing.png

Do you need help? Hire Us!

Considering such options can be daunting. Don’t panic! Contact us, and we’ll help you to make the right decision so your algorithms will be fueled by the finest oil.

Sources

The header image was downloaded from the following link: https://www.flickr.com/photos/sfupamr/14601885300

Subscribe to our newsletter

Get highlights on NLP, AI, and applied cognitive science straight into your inbox.

Enter your email address

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

How to fuel your data-driven business with text data? – Part 1, Data gathering and annotation

crowintelligence — Tue, 30 Jun 2020 10:14:37 +0000

If data is the new oil, then getting and enriching your own data is like fracking and refining it, at least in the case of textual data. This post gives you an overall picture on how to think about gathering and labeling data. You also get some tips on what kind of business questions should be considered.

The Data Science Hierarchy of Needs

These days more and more people try to build a so-called vertical AI startup/solution. These endeavors intend to solve industry specific problems by combining AI and subject matter expertise. They have four distinct features: 1) they are full stack products 2) they rely on subject matter expertise 3) they are built on the top of a proprietary dataset 4) AI delivers core value. Our experience suggests that the third point – getting the right proprietary dataset – is the hardest and most decisive factor regarding every data driven project, being either an intra- or entrepreneur endeavor.

Most people take data for granted. We get news about the newest deep learning algorithms every day. We live in the era of big data. We hear (at least those who work in the tech field) about new machine learning/artificial intelligence startups every day. So it must be easy to get data!

On the one hand, yes, there are awesome data repositories, like the UCI Machine Learning Repository. Governments are getting open and they are publishing their data via their own platforms or they are using something like CKAN. But keep in mind, your competitors can access these data too!

On the other hand, you have to get your own, domain-specific dataset, and annotate it to train your model(s)! Deep learning and other fancy ML algorithms are just the tip of the iceberg. There are plenty of things to do underneath. If you can’t get the underlying levels right, even the sexiest new deep learning algorithm will perform badly on your specific problem. Again, you can start with combining open datasets, but your competitors are doing the same thing too. If you want to deliver real value that is different from your competitors (i.e. better or more precise), you have to build and annotate your own dataset. The popular data science hierarchy of needs pyramid should look like as follows.

Source of the original picture: https://miro.medium.com/max/3760/1*jmk4Q2GAeUM_eqUtMh99oQ.pngSeparate your tasks

Harvesting and annotating data are two separate tasks done by two different groups. Data collection is often carried out by traditional software engineers, or by the data infrastructure team.While annotation is often lead (and sometimes even done) by Data Scientists/Analysts. A good product manager keeps his or her hands on the data and involves every stakeholder into the process. A PM should always remind one that getting and annotating data is a process, so you should constantly check the quality and scope of your raw and annotated data. The performance of the model you built using the data should be also monitored. You can use evaluation metrics and even some user feedback to plan further data gathering and data annotation task(s), which will help you build even better models.

Before you consider various options to gather and label data, keep in mind that you should build your initial dataset AND a pipeline/process that will help you train better and better models. Choosing a solution at one phase doesn’t mean that you cannot move to another one at a later phase. But note that transitioning from outsourcing to in-house scraping and labeling can be hard and very costly.

Your options

In theory, you have an idea about a product, and you need a special purpose dataset to train its magical AI part. Before you think over your options, you have to answer a few questions. What kind of data do you need in order to train a model? How can you get the data? Should you clean up the raw data before annotation? How much data should be annotated for the first model(s)? What does it mean to make a representative dataset in your case? Probably, you won’t get final answers first, but don’t be afraid as a rough idea is enough initially.

As a next step you should consider your options of data gathering and annotation, like

building in-house competency
crowdsourcing
outsourcing

Source: https://cdn.pixabay.com/photo/2017/12/12/17/59/traffic-sign-3015228_960_720.pngYour constraints

You should know about your constraints like

budget
time
law
ethics
technology

If you know your data sources, check them! Are they plain text or HTML? If they are websites, do you have to login to these sites? Do they use modern JavaScript frameworks, like React? Do these sites/texts contain sensitive information about humans? If you have to scrape a site, check its robots.txt to learn about what the owners let you scrape! Different regions have different laws to regulate scraping and storing publicly available data. Re-using data gained from scraping is often regulated by law. Although, it can be pretty expensive, ask your lawyer first!

Keep in mind that if something is legal, it is not necessarily ethical. Your project should be legal AND ethical. It is hard to define what ethical means. Probably your colleagues follow the ethical regulations and guidelines published by professional bodies and governments at your region. If not, ask them to do so! Also, the team should agree on that the goal of the project is in accordance with the members’ ethical norms. Scraping sites that requires login is a shady part of the business. Imagine that your colleague thinks it is actually stealing data and harming the privacy of the users of this site. Will such a colleague build the best scraper for the task? – Presumably, no. So, even if you have nothing against scraping data from certain sources, accept the fact that someone may think that it is not acceptable, even it it is legal.

Furthermore getting data from the web is not as easy as it sounds. For example modern JavaScript technologies requires a so-called pre-renderer, like Selenium, to pretend that a browser opened the site to show up its content.

Last but not least, you have budgetary and time constraints too. The more ready-made a solution is, the more expensive it is, but usually the less time it requires to deliver the data. In-house solutions require hiring permanent and temporary workers. Finding the right people takes time. You can employ juniors who are willing to learn a new filed, but again, this takes time. If you have enough money, first start with outsourcing the tasks to reliable partners. Later you can build up your own capabilities. If you are very short of money, bring data scraping in-house and crowdsource annotation. Otherwise read on and consider the tools and options you have.

That’s all for now. If you’d like to learn more about tools used for data gathering and annotation, stay tuned. The second part of this series will come soon!

Hire us

If you face any issues during data gathering and annotation, don’t hesitate to contact us at crowintelligence@gmail.com

Subscribe to our newsletter

Get highlights on NLP, AI, and applied cognitive science straight into your inbox.

Enter your email address

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Spark NLP: State of the art natural language processing at scale

crowintelligence — Tue, 30 Jun 2020 10:13:01 +0000

Natural language processing is a key component in many data science systems that must understand or reason about text. Common use cases include question answering, paraphrasing or summarization, sentiment analysis, natural language BI, language modeling, and disambiguation. This talk introduces the Spark NLP library – the most widely used NLP library in the enterprise, thanks to implementing production-grade, trainable, and scalable versions of state-of-the-art deep learning & transfer learning NLP research, as a permissive open-source library backed by a highly active community and team.

Spark NLP natively extends the Spark ML pipeline API’s which enabling zero-copy, distributed, unified pipelines, which leverage all of Spark’s built-in optimizations. Benchmarks and design best practices for building NLP, ML and DL pipelines will be shared. The library implements core NLP algorithms including lemmatization, part of speech tagging, dependency parsing, named entity recognition, spell checking and sentiment detection. The talk will demonstrate using these algorithms to solve commonly used tasks, using Python notebooks that will be made publicly available after the talk. Bio: David Talby is a chief technology officer at John Snow Labs, helping fast-growing companies apply big data and data science techniques to solve real-world problems in healthcare & life science. Previously, he was with Microsoft where he led business operations for Bing Shopping in the US and Europe, and before that at Amazon in Seattle and in the UK, where he built and ran distributed teams that helped scale global financial systems. David holds a PhD in computer science and master’s degrees in both computer science and business administration.

You can find David’s slides here https://drive.google.com/file/d/1JY69DNcoBPkGlNTd2HyWvnUmDorkvH6f/view?usp=sharing
Spark NLP homepage: https://nlp.johnsnowlabs.com/
Public notebooks about the open-source library: https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/colab

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Active Learning for Natural Language Processing

crowintelligence — Tue, 30 Jun 2020 10:10:17 +0000

More than 90% of machine learning applications improve with human feedback. For example, a model that classifying news articles into pre-defined topics has been trained on 1000s of examples where humans have manually annotated the topics. However, if there are tens of millions of news articles, it might not be feasible to manually annotate even 1% of them. If we only sample randomly, we will mostly get popular topics like “politics” that the machine learning model can already identify accurately. So, we need to be smarter about how we sample. This talk is about “Active Learning”, the process of deciding what raw data is the most optimal for human review, covering: Uncertainty Sampling; Diversity Sampling; and some advanced methods like Active Transfer Learning.

Robert Munro has worked as a leader at several Silicon Valley machine learning companies and also led AWS’s first Natural Language Processing and Machine Translation solutions. Robert is the author of Human-in-the-Loop Machine Learning, covering practical methods for Active Learning, Transfer Learning, and Annotation. Robert organizes Bay Area NLP, the world’s largest community of Language Technology professionals. Robert is also a disaster responder and is currently helping with the response to COVID-19.

The slides are available on this link.

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Growth Hacking with NLP and Sentiment Analysis

crowintelligence — Tue, 02 Jun 2020 13:39:25 +0000

We developed a course, Growth Hacking with NLP and Sentiment Analysis during the past months. We loved working with Manning, and now we are excited to start mentoring our students. Join us if you’d like to learn about applied sentiment analysis using Python and libraries like simpletransformers and scikit-learn.

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Corpus Linguistics - the theoretical minimum

crowintelligence — Fri, 22 May 2020 11:49:23 +0000

Corpus Linguistics is a neglected field of linguistics. Linguists tend to think that it cannot offer much, only some methodological tools to support their ideas. However, they often blame it, when it contradicts to their results. Corpus Linguistics was often considered the historic predecessor of Natural Language Processing in the pre-Big Data era. In this post, we claim that Corpus Linguistics offers a unique perspective on language, and it provides experts with theoretical and practical framework to analyze linguistic data. For the best resources of Corpus Linguistics, don’t stop reading!

The Corpus MOOC

Lancaster University is the epicenter of Corpus Linguistics and you can take their superb Corpus Linguistics: Method, Analysis, Interpretation MOOC course on FutureLearn for free! This is the easiest way to get into Corpus Linguistics. It is strongly recommended even for professional NLP and text/content analysis experts, since it gives a different perspective on linguistic data than other disciplines do.

Take a look at the ESRC Centre for Corpus Approaches to Social Science (CASS) website to get an idea of how corpus methods can be applied to content analysis. If you are a student, consider applying to the Lancaster Summer Schools in Corpus Linguistics. It has a reputation that students gain fantastic experiences there.

Books on the theory and methodology of Corpus Linguistics

Corpus Linguistics by Tony McEnery and Andrew Hardie is a perfect intro into the field. OK, it is not the most exciting book on earth, because it has to deal with questions of data sources and ethics. It shines when it describes use-cases in neo-Firthian/functional and cognitive linguistics – but don’t be afraid of those very technical terms! This is a textbook so it explains everything that you need to know about the topics.

Oaks’ Statistics for Corpus Linguistics is our favorite book from the field. First, we used it as a textbook during our studies in the early 2000s and we often open it as a reference book since then.

Software tools for the non-programmers

Source: https://i.ytimg.com/vi/ryYKHbPQof8/maxresdefault.jpg

Laurence Anthony’s AntConc was the one and only free and comprehensive corpus analysis toolkit for non-programmers. The accompanying YouTube tutorials are the best resources to learn how to use it in practice. We’ve been using AntConc for years now. Although its user interface is spartan, we learned to love it, since we haven’t found a better tool yet.

Source: https://img.scoop.it/t8KfHWF_eh_GfK8O-7kfojl72eJkfbmt4t8yenImKBVvK0kTmF0xjctABnaLJIm9

#LancsBox: Lancaster University corpus toolbox is “a new-generation software package for the analysis of language data and corpora developed at Lancaster University ” Developed by the best corpus linguistics research center, #LancsBox seems to be the heir apparent to AntConc. Its user interface is more user-friendly and its functionality is more versatile. We esp. love its collocation network visualization capabilities.

Practical Programming for Corpus Linguistics

God knows why, but corpus linguists prefer the R programming language, so here we list the best sources to learn R and corpus linguistics hand in hand.

R. Harald Baayen is one of the early pioneers of quantitative linguistics. His Analyzing Linguistic Data is an excellent introduction into corpus/quantitative methods and into programming with R. This book came out in 2008 and shows its age now, so we don’t recommend it to complete beginners in R.

If you read only one book on corpus linguistics, and you are not afraid of coding, Gries’s Quantitative Corpus Linguistics with R should be that book. Gries is an exceptional teacher, who wrote a pedagogically brilliant textbook. It helps you acquire the necessary skills to analyze linguistic data in a step-by-step fashion. It provides the reader with lucid explanations at every stage. Read our interview with Gries from 2010 on our previous blog.

Written in the same vein as Quantitative Corpus Linguistics, Statistics for Linguistics with R introduces the main statistical methods and their use in linguistics. Just like Baayen’s book, this one covers topics of corpus and quantitative linguistics. Although it is a masterpiece, we only recommend it to those who have a strong interest in linguistics.

Sources

The header image was generated for the meetup on visualizing linguistic data. If you speak Hungarian, you can read more about it here.
Each book cover was downloaded from Amazon via Google Image Search.

Getting started with Statistics

crowintelligence — Thu, 07 May 2020 15:16:21 +0000

“I keep saying the sexy job in the next ten years will be statisticians. People think I’m joking, but who would’ve guessed that computer engineers would’ve been the sexy job of the 1990s?”, said Hal Varian chief economist at Google in 2009. These days machine learning and artificial intelligence are the sexiest fields, but their practitioners should be undercover statisticians. If you are looking for an intro into stats, this is a must-read post for you.

Warm-up readings

Charles Wheelan’s Naked Statistics is the most entertaining book about statistics, which doesn’t use any equation, but explains the main concepts through real-world examples. It is absolutely beginner-friendly and provides you just with the first steps in your journey towards mastering statistics.

David Salsburg’s The Lady Tasting Tea is the best book on the history of statistics. Salsburg tells the history of the development of the field and the modern scientific thinking without using heavy math. If you learn stats, you will learn the names of Pearson, Spearman and others soon. You’ll wonder who was Student and why he had developed this t-test and how computers overtake statistical tables and calculations on papers.

Learning by doing

The Head First series by O’Reilly is using a unique approach to teaching that is based on the cognitive science of learning. This learning method involves lots of activities, pictures, and the explanation of the same concept several times from different angles. We love the series, especially Head First Statistics by Dawn Griffiths. If you do the exercises of the book, and not just read it, you will have a solid foundation of the very basics of statistics.

Do you want to get some experience of how data analysts work? Milton’s Head First Data Analysis is the best resource for you! You’ll learn about how to use a spreadsheet to analyze data, how to clean messy real-world data, and how to put your statistical knowledge into practice.

Think Python and Stats

Allen B. Downey publishes high quality open books on computer science, statistics and complexity. Think Stats is an excellent book written for programmers. You can get the most from it if you’re a confident intermediate pythonista and you’ve already mastered the basics of statistics. Having worked through the book, you are ready to use advanced statistical Python modules.

Although Python has the built-in statistics module, it is convenient only for the most basic tasks. If you are into classical statistics, the statsmodels module is made for you.

SciPy and scikit-learn provides you a plethora of statistical and machine learning algorithms.

Advanced topics

Math for Machine Learning and Artificial Intelligence : in our previous post, we gave you some advice on learning higher math for ML, AI, and Statistics
Getting Started with SQL : if you are serious about data analysis, you should learn the basics of (relational) databases. You can learn from our post where to start your journey.

Sources

The header image was downloaded from xkcd. Its source can be found here.
The Think Stats cover image was downloaded from this link.

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

How to treat your robot?

crowintelligence — Thu, 07 May 2020 15:14:01 +0000

description: Is it possible that robots and creatures with artificial intelligence acquire rights? Will overworked carmaker robots establish their union one day? Shall abolitionists help sex robots? Whom to blame when a robot harms a worker in a factory? Will an artificial intelligence go to jail, and if yes, will it be incarcerated with human inmates? These questions seem to be impractical and science-fiction these days, but remember that one day children, women and animal rights were neither topics of public discourse. So, let’s see those features one by one that make something the subject of moral consideration.

Children and women had no rights for a long time in human history. Universal suffrage and women’s rights were unimaginable for centuries before the modern era. These days, most of the developed countries protect the rights of animals and the (living) environment to some extent. The technological development raises the question if we should give rights to machines. Should we stop beating our robots?

Is it possible that robots and creatures with artificial intelligence acquire rights? Will overworked carmaker robots establish their union one day? Shall abolitionists help sex robots? Whom to blame when a robot harms a worker in a factory? Will an artificial intelligence go to jail, and if yes, will it be incarcerated with human inmates? These questions seem to be impractical and science-fiction these days, but remember that one day children, women and animal rights were neither topics of public discourse. So, let’s see those features one by one that make something the subject of moral consideration.

Moral agency and patiency

The morality of machine intelligence can be approached from two distinct directions. 1) When we wonder whether machines can be liable for their acts, can set their own goals and are capable of deliberate and conscious actions, we raise issues of moral agency. 2) When we speculate about if machines can be used as sex toys or can be beaten, we inquire if they are mere artifacts or entities that we should take care of. This consideration is called moral patiency.

As the title of this post suggests we are dealing with moral patiency in detail, but this does not mean that moral agency is excluded from our argumentation, since we assume that moral agency entails patiency. More precisely, patiency is a necessary condition of agency. However, we don’t argue for this position here, but take it as a premise.

Sentience and patiency

As part of the Cartesian tradition, Western culture thought of animals as machines till the 1970s. The treatment of animals has radically changed since then, thanks to activists and books like Animal Liberation by Peter Singer. According to Singer, one can be the subject of moral considerations if it is a sentient being, or to put it simply, if it can suffer. If we accept this point of view, we have to examine if machines and artificial minds can be sentient beings.

No robot or artificial intelligence can feel anything. At the moment, the technology is far from producing a sentient machine. However there are lots of projects aiming to develop some sort of digital or robotic companion. The most well-known ones are chatbots for customer relations, chatbots for therapeutic use, supporting robots for the elderly, and robots as sex toys, just to mention a few. These projects don’t aim to build a fully autonomous general artificial intelligence, but to create reliable and useful tools that can be used in social interactions.

Human-Computer Interaction researchers illustrate companion machines of the future with the analogy of working and companion dogs. Guide dogs are very smart in general and are trained to excel at aiding humans to move freely. This way, they are similar to companion machines. Moreover, this study from the Family Dog Project argues that qualities of companion dogs, such as faithfulness, kindness, smartness should be implemented in companion robots to help humans accept machines. In this way, we may ascribe similar attributes, feelings, and emotional states to robots as to dogs.

The projection of these qualities raises an important question. If machines exhibit some feelings and emotions, should they be in an emotional state? Or to think it even further, can robots be in an emotional state that is identical with the emotional state of humans?

The problem of other minds

Maybe at first it sounds a stupid question, but why do we attribute sentience to animals? Is it just another form of anthropomorphism, or do they really have feelings? Anyway, how does one know that another human being shares the same feelings and emotions with her? On what basis can one attribute a rational mind to others? Philosophers of mind call this phenomenon “the problem of other minds”.

According to Wittgenstein, this is a linguistic question. If I hit my finger when I drive a nail, I cry out loud and say “awwwww!”, because I learned this behavior from my environment. My parents and all the adults around me did the same when I was a child, so I learnt to do it too. I learnt what to say when I feel terrible physical pain, just like I learnt to say “Hello” to my neighbors when I meet them. All these things constitute a language game or a way of life and they are social by their very nature. I cannot feel pain without expressing it. I cannot feel anything if I cannot name it. Hence language is a precondition of other minds. This is the way Wittgenstein’s argumentation goes. Consequently, the condition of emotional states and mental activity is speaking.

Our everyday experience contradicts with the view described above. We do attribute mental and emotional states to animals, although they cannot speak. We even speak about physical objects as if they were persons. E.g. “Why does my computer not want to work?” Philosophers call this strategy “intentional strategy”, which is a funny word for having mental states. If something behaves like an intentional agent, the best way to deal with it is to assume that it is really intentional.

But what can we know about the mental states of other creatures? Can we imagine what it is like to be a bat? More precisely, can we put ourselves in the place of a bat? What would it like to navigate ourselves using only our ears? Some philosophers of mind think that echolocation cannot be imagined and we cannot know what it is like to be a bat, since being a bat or being a human comes with a different qualia, i.e. a distinct way of perceiving and experiencing the world around us.

Source: Hans Holbein / Wikimedia Commons / Public Domain

If we want to attribute sentience to animals and machines, we need something more than the intentional strategy. We have to identify similar behavioral patterns that animals share and we have to find their physiological structure. Some behavioral patterns are produced by very similar physiological structures, while others are not, but are functionally very similar. If a behavioral pattern can be “implemented” by various organic structures, it can be implemented by inorganic ones as well. Using the philosophers’ terminology, if functionalism works, we can build sentient machines.

One of the first lessons of robotics came from phenomenology and cognitive science. The mind of autonomous biological agents do not end at their skull. Humans and animals have bodies, and they sense the world through their organs. Also, they do not just passively navigate themselves in their environment, but they actively use the environment for various tasks to extend their minds. For example they apply landmarks for navigation. So human and animal cognition is embodied and extended at the same time. These embodied and extended minds created the abstract space of morality, or more exactly they are constantly creating morality.

Rights and obligations

Although there is still much to do, animal rights are established and are codified in almost all developed countries around the globe. The most common argument about the necessity of laws protecting the rights of animals is that animals, or at least vertebrates, are sentient beings. Although rights are granted to animals, they are not exercised by them. In case of minors and animals, it is the caretaker and the public who act on their behalf and exercise their rights. Also, animals are aware neither of their rights nor of the moral consequences of their acts.

Let’s study the case of a dog which bit a postman. No one would blame the dog for its act, but its owner would be in big trouble. On the one hand, he’d be charged with causing harm to the postman, on the other hand with treating his dog badly, which might have caused its aggressive behavior. But who should be blamed when an intelligent machine does harm? Its owner, its manufacturer or the programmer who trained it? What shall we do with such a machine? Can we simply switch it off or would it count as an execution?

Source: Wikimedia Commons / Gnsin – Gnsin / CC BY-SA 3.0

During the course of history, minors, women and minorities were treated as sentient beings with limited rationality. As a result, they were deprived from the same rights as adults, mostly privileged and rich men, had . Also, they had specific obligations, e.g. to follow the orders of the head of the household, who were often adult men. They were also subjected to those persons’ orders who were above them in the societal hierarchy.

Machines have no obligations, since they are not living beings, but they are built with the purpose to handle and execute various tasks. If you hire a gardener, she has got the obligation to trim your lawn, but the lawnmower has no obligation, although it was built to trim the lawn. Also, horses and companion dogs have no obligations, but they are kept for various tasks by their owners. If a lawnmower doesn’t work, its owner can throw it away. If a horse is sick or it doesn’t want to jump over fences all day, its owner cannot simply throw it away. How about a sentient machine? What if a sex robot becomes sentient one day and it has negative feelings when someone uses it? What if fashion changes and the old model of the robot goes out of fashion? Can its owner throw it away?

It’s not about the future, it’s about the present of humanity

If you happen to think that we raised issues that are not reasonable and practical, it’s high time to shed light on the importance of philosophizing on beating robots. When we are considering the moral acceptability of beating a robot, we are not only thinking about the moral status of robots, but that of ourselves. What kind of traits do we want to cultivate in ourselves?

The questions of ethics are perennial, although there are no exact, timeless answers to them. The recent surge of Artificial Intelligence made us chewing over these problems again and again – as technology is evolving rapidly.

Sources

Peter Singer: Animal Liberation, Harper Perennial Modern Classics, 2009
Ludwig Wittgenstein: Philosophical Investigations, John Wiley and Sons, 2016
Thomas Nagel: What Is it Like to Be a Bat? In: Thomas Nagel: Mortal Questions, Cambridge University Press, 2003
Paul M. Churchland: Matter and Consciousness, MIT Press, 1998
Hursthouse, Rosalind and Pettigrove, Glen, “Virtue Ethics”, The Stanford Encyclopedia of Philosophy (Winter 2018 Edition), Edward N. Zalta (ed.), URL = <https://plato.stanford.edu/archives/win2018/entries/ethics-virtue/>.

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Getting started with SQL

crowintelligence — Tue, 05 May 2020 17:18:46 +0000

description: SQL and databases are among the most needed data science skills, it is #3 right afrer Python and R according to this empirical study. However, the need for a database isn’t obvious for the beginner programmer at first.At some point the aspiring data scientist will grow out the world of csvs and plain text files. Using databases becomes handy, when someone starts building Rest APIs or one has to connect to a remote SQL server full of gigabytes of valuable data. Here are our tips to get started with SQL and how to use it the Pythonic way.

canonical_url

SQL and databases are among the most needed data science skills. According to a recent study, SQL is the third most demanded skill right afrer Python and R. Suprisingly a beginner programmer can happily live without a database for a long time. However, at some point the aspiring data scientist will grow out the world of csvs and plain text files. Using databases becomes handy, when someone starts building Rest APIs or one has to connect to a remote SQL server full of gigabytes of valuable data. Here are our tips to get started with SQL and how to use it in the Pythonic way.

Which SQL implementation should I use?

SQL is a standard, its latest release came out in 2016. There are many closed and open source vendors who built their own implementation of the standard. Each likes to extend it with its own flavor, but the differences are minor (at least for a beginner). We encourage you to use MariaDB, unless you have good reason to ignore it (e.g. at work, your company is using MySQL, or at school you are learning about databases using Postgres, etc.)

Should I install it?

Absolutely no, you shouldn’t install it on your computer! Use the official Docker image of your prefered SQL implementation. If you don’t use Docker, invest some time into learning its basics.This tutorial helps you learn how to install Docker and start a container on your machine (the first twelve lessons till “Docker – Containers and Shells” is enough at first). Don’t simply start your Docker image, attach a volume to it, since this is the way to preserve (i.e. save) your databases. your effort turns to be a bonus, as knowing some Docker is a very valuable data science skill!

We strongly recommend you to start the phpMyAdmin, the free administration tool for SQL, Docker image along with your SQL implementation. phpMyAdmin provides a simple and intuitive interface to manage your databases and execute various SQL statements.

This short tutorial helps you to set up MariaDB and phpMyAdmin and persisting your databases using a docker-compose.

The pythonic way

SQL is a kind of programming language (actually, it is a so-called non-procedural programming language) and it is very different from Python. The easiest way to start using SQL in your Python projects is using the pymysql package, which lets you easily connect to your database,. On the top of that, you can write SQL statements as simple strings, which are passed to a function that sends them to the database engine for execution.

Using string variables to store your SQL statements isn’t pythonic. Although you can use f-strings to substitute parts of your expressions to your Python variables, this method can become very tedious esp. when you are working with complex statements. SQLAlchemy is the de facto standard way to use SQL in Python programs. It comes in two flavors, namely Core and ORM (which stands for object relational mapping). ORM is very advanced, hence chances are high that you won’t need it as a data scientist. Core provides you with the ability to use SQL statements as methods, so you can even chain them together. Also, you can use strings as SQL statements, aka “textual SQL” too. Using SQLAlchemy Core makes your code more pythonic and readable, which means a more maintainable code. If you don’t want to switch from pymysql to SQLAlchemy later, you can start using Core’s textual SQL and later you can gradually transist to Core objects and their methods. This part of the official documentation of the toolkit is a pretty nice intro into using Core.

DataFrames and SQL tables – How to integrate all this into your workflow?

You can easily make a pandas DataFrame from an SQL table and vice versa. This short tutorial shows you how easy it is to achieve this.

Resources

Although there are plenty of tutorials on the net, and we linked some of them in this post, we strongly recommend the following two books.

Learning SQL, 2nd edition by Alan Beaulieu: This title is a short, practice oriented intro into SQL. It is language and implementation agnostic and despite its age it is superb.
Essential SQLAlchemy, 2nd edition by Myers and Copeland: SQLAlchemy has got an extensive and very usable documentation, but it lacks user-friendly tutorials. This book is the only comprehensive intro into SQLAlchemy, as per our best knowledge.

Image sources

Header image https://cdn.pixabay.com/photo/2017/06/12/04/21/database-2394312_960_720.jpg
phpMyAdmin logo https://upload.wikimedia.org/wikipedia/commons/thumb/4/4f/PhpMyAdmin_logo.svg/115px-PhpMyAdmin_logo.svg.png
Docker logo https://upload.wikimedia.org/wikipedia/commons/c/c9/MariaDB_Logo.png
SQLAlchemy logo https://quintagroup.com/cms/python/images/sqlalchemy-logo.png

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.