Assignment

  1. By using a case-study, describe the stages in a Data Science process highlighting the core importance of each stage in the process. List two top Data Science programming languages and discuss their advantages and disadvantages possibly using real-life scenarios.

Understanding the business problem,Data gathering,Data processing,Data analysis,Data visualization,Data cleaning,Creating and testing a model.

(1)Julia

Advantage:

1. Fast

Julia was designed from the beginning for high performance. Julia programs compile to efficient native code for multiple platforms via LLVM.

2. Dynamic

Julia is dynamically typed, feels like a scripting language, and has good support for interactive use.

3. Reproducible

Reproducible environments make it possible to recreate the same Julia environment every time, across platforms, with pre-built binaries.

4. Composable

Julia uses multiple dispatch as a paradigm, making it easy to express many object-oriented and functional programming patterns. The talk on the Unreasonable Effectiveness of Multiple Dispatch explains why it works so well.

5. General

Julia provides asynchronous I/O, metaprogramming, debugging, logging, profiling, a package manager, and more. One can build entire Applications and Microservices in Julia.

6. Open source

Julia is an open source project with over 1,000 contributors. It is made available under the MIT license. The source code is available on GitHub.

Disadvantage:

1. Download package

many packages need to climb over the wall to download. After downloading, they may not be installed. If they can be installed, it may take a lot of compilation time.

2. The cost of learning grammar rules

The cost of learning grammar rules is too high.

3. Need skills

The so-called high performance also needs to learn a lot of skills.

4. Code

The old version of the code on the Internet can't run, and the new version of the code is too few.

(2)Python

Advantage:

1. Simple grammar

Compared with the traditional C / C + +, Java, c# and other languages, python has less strict requirements for code format. This looseness makes users more comfortable when writing code without spending too much energy on details.

2. Python is open source

Open source, or open source, means that all users can see the source code.

Python's open source is reflected in two aspects:

① Code written by programmers in Python is open source.

For example, we have developed a BBS system and put it on the Internet for users to download. What users download is all the source code of the system, which can be modified at will. This is also the characteristic of interpretive language itself. If you want to run the program, you must use source code.

② Python interpreters and modules are open source.

The official open source of Python interpreter and module code is to hope that all Python users will participate in improving Python performance and making up for Python vulnerabilities. The more the code is studied, the more robust it will be.

There is always a small group of people in the world who either do not admire fame and fortune, or will continue to strengthen and improve Python in order to achieve some purpose. Don't think that everyone is only for immediate interests. There are always some elites who will catch big fish for a long time, and there are always some geeks who will do some cool things.

3. Python is free

Open source does not mean free. Open source software and free software are two concepts, but most open source software is also free software; Python is such a language, which is both open source and free.

If you want to distinguish between open source and free, please hit: is open source equal to free? Speak with facts

Users use Python to develop or publish their own programs without paying any fees or worrying about copyright. Python is free even for commercial purposes.

4. Python is a high-level language

The advanced mentioned here means that Python is deeply encapsulated and shields many underlying details. For example, python will automatically manage memory (automatically allocate when needed and automatically release when not needed).

The advantage of high-level language is that it is easy to use without worrying about details; The disadvantage is that it is easy for people to taste it and know why.

5. Python is an object-oriented programming language

Object oriented is a common feature of modern programming languages, otherwise it will be stretched when developing medium and large programs.

Python supports object orientation, but it does not force object orientation. Java is a typical object-oriented programming language, but it forces the code to be organized in the form of classes and objects.

6. Python is powerful (with many modules)

Python has many modules, which basically realize all common functions, from simple string processing to complex 3D graphics rendering, which can be easily completed with the help of Python modules.

The python community has developed well. In addition to the core modules officially provided by python, many third-party organizations will also participate in the development of modules, including software giants such as Google, Facebook and Microsoft. Even for some niche functions, python often has corresponding open source modules, or even more than one module.

7. Python has strong scalability

Python's scalability is reflected in its modules. Python has the most abundant and powerful class libraries in the scripting language. These class libraries cover most application scenarios such as file I / O, GUI, network programming, database access, text operation and so on.

The underlying code of these class libraries is not necessarily python, but there are a lot of C / C + +. When a key code is needed to run faster, it can be implemented in C/C++ language and then invoked in Python. Python can "glue" other languages together, so it is called "glue language".

Relying on its good scalability, python makes up for the disadvantage of slow running efficiency to a certain extent

Disadvantage:

1. Slow running speed

Slow running speed is a common problem of interpreted languages, and Python is no exception.

Python is slow not only because it "translates" the source code while running, but also because Python is a high-level language that shields many underlying details. This cost is also great. Python needs to do a lot of work. Some work is very resource consuming, such as managing memory.

Python is almost the slowest, not only much slower than C / C + +, but also slower than Java.

However, the disadvantage of slow speed often does not bring any big problems. First, the faster the hardware speed of the computer comes, the more money you spend, you can pile up high-performance hardware. The improvement of hardware performance can make up for the lack of software performance.

Secondly, some application scenarios can tolerate slow speed, such as websites. When users open a web page, they are waiting for the network request most of the time, rather than waiting for the server to execute the web program. The server takes 1ms to execute the program and 20ms to execute the program, which is senseless to users, because the network connection time often takes 500ms or even 2000ms.

2. Difficult code encryption

Unlike the source code of compiled languages will be compiled into executable programs, python runs the source code directly, so it is difficult to encrypt the source code.

  1. Data acquisition/collection represent one of the stages in of a Data Science process. Discuss at least four forms of gathering dataset when carrying out a Data Science project. Also, support your description with real-life examples, and describe any three data format that a dataset can assume per time.

Direct download of data file (or files) manually,Query data from a database,Query an API (usually web-based),Scrap data from a webpage,Acquisition of data by oneself.

The three most common formats (judging by my completely subjective experience):

  • CSV (comma separate value) files
  • JSON (Javascript object notation) files and strings
  • HTML/XML (hypertext markup language / extensible markup language)
  1. Data cleaning/cleansing is one of the stages of Data Science process in which issues such as missing values, unit conversion, misspellings, duplicate roles, inconsistent format, and unspecified units are resolved to put the dataset in a good shape for the subsequent process. With this understand and what we have learned in class, you are to work with the “Crime Incidence Report” dataset which I handed over to you several weeks ago, performing four different data cleaning operations/tasks. Afterwards, you are expected to clearly report your results (screenshots of the outcome of the cleaning tasks and codes used). The implementation can be done using tools/programming languages that we used in class (Python, Jupiter Notebook) or other tools of your choice.

 

  1. Briefly discuss why Data visualization is an important stage of Data Science process. Using the New York “Airbnb” dataset which I shared with you a few weeks ago, perform at least four different kind of Data visualization tasks using functions/methods from the Matplotlib and Seaborn libraries. Also, clearly report your results with screenshots of the outcome of the data cleaning tasks and their respective codes.

Data visualization is used in many fields such as finance, transportation, logistics, medical treatment, electric power and so on. Some visualization objectives are to observe and track data, so it is necessary to emphasize real-time, change and computing ability, and may generate a constantly changing and readable chart. Some emphasize the presentation of data in order to analyze data, and may generate a searchable and interactive chart. Some may generate distributed multi-dimensional charts in order to find potential associations between data. In order to help ordinary users or business users quickly understand the meaning or changes of data, some will use beautiful colors and animation to create vivid, clear and attractive charts. Others are used for education, publicity or politics. They are made into posters and courseware and appear in the streets, advertisements, magazines and gatherings. This kind of visualization has strong persuasion. Using strong contrast, replacement and other means, it can create powerful images. To realize the visualization of data information, the key is to improve people's visual design level, display multidimensional and abstract data in front of people in the mode of two-dimensional or three-dimensional graphics, and realize the interaction between people and information data. Through the intuitionistic and interactive convenience of the chart, the decision-making ability and work efficiency can be enhanced internally, and the audience can clearly understand the main business and company of the enterprise at a glance.

  1. Exploratory Data Analysis (EDA) constitute an impart aspect of a Data Science project. You are required to give a concise but informative description of the “FIFA World Cup” dataset handed to you in the previous lecture and carryout the following EDA tasks: (a) Find the summary statistics of the data and provide a brief explanation of the statistics for any (single) variable in the dataset, (b) Find the mean, median, and mode of the variable “Qualified teams”, (c) Generate an histogram plot for the variable “Winner” to understand the number of times a country won the world cup between 1930 to 2014, (d) Find and visualize the correlation relationship among the numeric variables (Year, GoalsScored, QualifiedTeams, MatchesPlayed) in the data using “heatmap” plot, (e) Plot the pairplots of 'GoalsScored', 'QualifiedTeams', 'MatchesPlayed' variables and summarize the information in the plot. Note: Clearly report your results with screenshots of the outcome of the EDA tasks and the respective codes.

详细文档

点赞

发表评论