A Talk on Natural Language Processing and the Data Science Job Market

--

PART 1: Natural Language Processing Short Talk

Natural Language Processing (NLP) is technology used to aid computers to understand human languages. It is one of the largest branches of Artificial Intelligence, and within it are a broad range of approaches due to the diversity in voice and text-based data.

Two important functions of NLP are sentiment analysis and text categorization.

Figure 1: Two important functions of NLP

Sentiment analysis identifies the mood or subjective opinions within large amounts of text collection. It is useful for:

1. Customer satisfaction: given data on customers, e.g. customer reviews, sentiment analysis identifies the moods and opinions of the customers.

2. Credibility of news

3. Concept/Entity Extraction

Text categorization is a linguistic based document summary including search and indexing, content alerts, and duplication detection. Within text categorization there is manual and automatic classification. For manual classification the human annotator interprets the context of text and categorizes it accordingly. For automatic classification ML, NLP, and other techniques to automatically classify text in a faster and more cost-effective way.

Other approaches in NLP include topic discovery modeling, contextual extraction, speech-to-text and text-to-speech translation, and document summarization:

a. Topic discovery modeling

  • Accurately captures the meaning and themes in text collections, and apply optimization and forecasting

b. Contextual extraction

  • Automatically pull structured information from text-based sources.

c. Speech-to-text and text-to-speech translation

  • Transforming voice commands into written text and vice versa

d. Document Summarization

  • Relation Modeling, automatically generating synopses of large bodies of text. E.g. Toronto — Blue Jays, NYC — Yankees

Figure 2: Broad range of functions of NLP

An example of the use of NLP is in the classification of customer reviews about drugs. The classification algorithm is a binary classifier that rates a review from good to bad (1 being bad and 9 being good).

Using the following test cases:

1. “It is great”, “I like it”

2. “It sucks”

This frequency data is fed into a logistic regression model which outputs either 0 or 1 to show if the test case is a good review or bad review.

The limitation of this method is that text cleaning takes most of the time due to numerous text data that are irrelevant or unintelligible.

Logistic regression is a simple model to use to classify reviews in comparison to more accurate models such as Random Forest or Recurrent Neural Network.

PART 2: Career talk by Lisa from Mango Tech:

The daily tasks of a data scientist consist of: customer clustering, market campaign predicting, producing recommendation systems.

The daily tasks of a data analyst in finance consists of: times series, stock price forecasting, credit score prediction, fraud detection.

General functions of jobs related to data science:

Data scientists have the challenge of communicating technical, domain specific topics to consultants and/or customers. Clients view the processes as a “black box”; this knowledge gap is challenging to deliver exact client needs.

The learning paths for the career options for data analyst, data scientist, research scientist, data engineer, machine-learning engineer consist of:

1. Statistics/Math foundations

2. Machine learning/Deep learning

3. Data wrangling, cleaning and visualization

a. Relevant to the data scientist and data engineer

4. Business acumen/case study

a. Relevant to data analyst

5. Big data ecosystem, data streaming

a. Relevant to data engineer

6. Data pipeline, DevOps, automation

7. Deployment

--

--

University of Toronto Machine Intelligence Team
University of Toronto Machine Intelligence Team

Written by University of Toronto Machine Intelligence Team

UTMIST’s Technical Writing Team publishes articles on topics within machine learning to our official publication: https://medium.com/demistify

No responses yet