IST718 Master Syllabus
NOTE TO INSTRUCTORS To maintain consistency among class sections of this course, all syllabi should contain this information, cover the schedule of topics, and follow the guidelines herein.
A broad introduction to analytical processing tools and techniques for information professionals. Students will develop a portfolio of resources, demonstrations, recipes, and examples of various analytical techniques.
Detailed Course Description
Upon the successful completion of this course, you will be able to:
- Translate a business challenge into an analytics challenge;
- Deploy a structured lifecycle approach to data science and big data analytics projects;
- Analyze big data, create statistical models, and identify insights that can lead to actionable results;
- Use software tools such as R, SQL, Python and the Hadoop stack in database analytics;
- Select visualization techniques, communicate analytic insights to business sponsors, and others;
- Explain how advanced analytics can be leveraged to create competitive advantage;
- Define and distinguish a data scientist from a traditional business intelligence analyst.
Prerequisite Knowledge required
Students taking this course should be familiar with command-line interfaces, possess, basic quantitative skills including elementary statistics, and possess basic programming skills in SQL and either R or Python.
- HDP Analyst: Data Science Student Guide Rev 4
This can be purchased from https://marketplace.mimeo.com/StudentMaterials with a Major Credit card for $15. Make sure to purchase the student guide and not the lab guide.
- An introduction to Statistical Learning with Applications in R (ISLR) by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani [PDF]
- Deep Learning (DL) by Ian Goodfellow, Yoshua Bengio, and Aaron Courville [Book]
- There will be other online readings specific to each unit.
Methods of Evaluation
NOTE TO INSTRUCTORS It is important to mix several methods of evaluation such as individual practice, group work, and assessment. The following table should be used as a guideline for weighting each activity:
|Assessment||Examples of Activity||At Least||No More Than|
|Individual Homework||Labs, Homework, Papers, Problem Sets, Discussion, Programming Exercises||20%||50%|
|Individual Assessment||Exams, Tests, Quizzes||20%||50%|
|Group Activities||Group projects, Group Papers, Group Homework||20%||30%|
Topics to be covered
NOTE TO INSTRUCTORS At minimum, the following topics should be covered in the course. Full course preparations are provided for these topics:
This course will revolve around three use cases: Sentiment analysis, a prediction use case with Random Forests, and Object Recognition with Deep Learning.
Students will first learn to program on a big data analytics environment with Hadoop and Apache Spark. Then, they will learn to some fundamental machine learning concepts.
The following is the outline:
- Part I: Introduction
- Unit 1: Linux environment
- Unit 2: Python programming
- Part II: Machine learning and artificial intelligence fundamentals
- Unit 3: A statistical perspective on learning
- Unit 4: Assessing model accuracy
- Part III: Big data analytics environments
- Unit 5: Hadoop
- Unit 6: HDFS, MapReduce and YARN
- Unit 7: Hive, Pig, and HCatalog
- Unit 8: In-memory analytics with Apache Spark + Python
- Part IV: Use cases
- Unit 9: Sentiment analysis on Twitter
- Unit 10: Predicting X with Random Forests
- Unit 11: Object recognition Deep Learning
- Unit 12: Object recognition Deep Learning (2)
- The machine learning component will be based on Chapter 2 and Random Forest will be based on Chapter 8 of ISLR
- The introduction to probability will be based on I.3 and the Deep Learning component will be based on I.2, I.5, 4.3, and II.6 of DL
There is latitude for individual instructors to cover their own topics of interest. Consider using no more than 2 weeks to do this. Some suggestions might be:
- Big Data in the Cloud
- Machine learning in the Cloud
- Spark Streaming
- Spark GraphX
- Future trends in the Hadoop Ecosystem
- Kaggle Competitions
- Operationalizing a big data analytics project
- Deep learning on Hadoop