Working As a Data Scientist
Nowadays it sometimes seems like everything is about data. There are discussions about data policies and we are frequently asked to agree to the terms of data protection. And usually we do agree. Data has somehow become some kind of currency. We pay for digital services with information about our interests and habits. It is actually called the “gold of the 21st century” with the tiny difference that there is a huge quantity which permanently increases as we generate and provide data all the time.
But why is data so valuable and what happens with all the heaps of information we produce every day? At first glance it is only an unstructured mass of bits and bytes. But Data Science offers a big toolbox which Data Scientists dispose of to structure them and to dig up the treasured hidden information.
So in general, Data Science is about gaining knowledge from data. It is an interdisciplinary field combining Computer Science, Mathematics and Domain Knowledge. The areas of application are diverse and so are occurring problems and methods to solve them. This can be simple statistical analyses for business cases or the investigation of complex structures and the detection of inherent slight abnormalities. Depending on the problem the best model has to be found. Therefore, we have a toolbox offering nearly everything ranging from plain Statistics to Machine Learning or multilayer Neural Networks.
As you can see working with data is complex and requires a whole bunch of multifarious skills. So the perfect candidate for a job in Data Science brings programming skills, mathematical understanding, business knowledge, the will to work one’s way into new topics all the time and the ability to see things from various perspectives to communicate between all the involved parties.
At first you need to understand the problem. Data Science can be applied in many areas. Even if the methods are similar they need to be adapted to the particular use case. There may be peculiarities that demand special treatments or you simply need domain knowledge to interpret and handle the data correctly and to finally verify the results.
Data Science, of course, does not work without data. As told by law of large numbers results become statistically more reliable the more data you have and algorithms perform more stable. Thus, you need to collect lots of data before you can start to do analyses or implement any code. Today, data is tracked in lots of different ways and places. User information is tracked on websites and images or texts are recorded on various physical devices or taken from social media. A more classical way to gain information is to conduct surveys or maybe there are some dusty folders with handwritten notes in the archive room. Wherever the information may come from, they potentially improve your work.
Processing data is an important and time consuming part of Data Science. This comprises going through all possible data sources and merging the extracted data in a suitable file format as well as eliminating false information. The quality of the data is essential for the efficiency of the algorithms or accuracy of the analyses. Consequently, editing data in terms of adding new features by combining information or discarding unnecessary details can enhance performance.
Having so much data leads soon to some physical problems: Big Data is characterized by its volume. Big Data requires large storage capacities on the one hand and high computational power to go through these heaps of data on the other hand. This cannot be accomplished by simple methods on ordinary computers. It is the job of a Data Engineer to choose suitable hard- and software, to maintain the Data Warehouse and to prepare data for further treatments. The role of a Data Steward is to monitor the quality and correctness of the data. This means also to guarantee the compliance with guidelines and policy.
Having now the prepared data, their exploration can be started. For statistical analyses usually Data Analysts or Business Analysts are consulted. They mediate between business and the IT department. Applying statistical methods or using special analysis software they draw conclusions from data and turn them into guidance and business strategies. With their visualization and communication skills they translate the analytical results into business language. Therefore, it is vital to know how business partners think and to understand how the business processes work.
If more elaborated analyses are desired, Machine Learning algorithms as subpart of Artificial Intelligence are implemented by Data Scientists. This is especially done if the algorithm is supposed to detect complex patterns in the dataset. Depending on the nature of the data various methods are at choice, e.g. kNN, Linear Regression, Decision Tree, Random Forest or Neural Networks. If the problem is even more complex and enough data is available, the number of layers in the Neural Network can be increased so it becomes Deep Learning.
Here, visualization of the method and results by the Data Scientist is material, too. In order to avoid mistakes or tunnel view issues, it is important to discuss and question the task and applied methods over and over again.
So being a Data Scientist offers a bright and diverse career. If you know your tools and are experienced in using them, you can choose from a wide range of advanced and value-adding jobs. But how do you write the code? What are the differences between the methods? All of this and some stunning examples of Data Science and AI in future blog posts. Stay tuned!
About the Author
Christine has studied Applied Mathematics in Flensburg and decided for Wind Energy as field of application. In the practical time of her studies she developed Python code for wind energy applications at DNV GL Renewables Certification and Fraunhofer IWES. This year she attended the Data Science Bootcamp at neuefische to work on new professional challenges as a Data Scientist.