What determines a good data scientist?

4 min readMar 16, 2019

In the big data world, more and more people hope to be a data scientist, utilizing the power of data and machine learning to solve real world problems. Currently it’s almost a year since I graduated and became a data scientist in an e-commerce company. I realized that there is a deep gap between data mining related academic program and the real work in the industry. So, in this post, I try to summarize five abilities from my work experience, which I think are important to differentiate data scientists.

Interpreting Machine Learning models

Many people say machine learning models are ‘black boxes’, in the sense that they can make good predictions but you can’t understand the logic behind those predictions. However, data scientists need to try their best to explain machine learning results, and tell a good story about them. Example questions should be: a retailer is trying to predict sales next month, by building the model, can you tell which feature is most importance to increasing sales? a bank built a model to detect fraud transactions, which type of user is more likely to have frauds?

Which is More Promising: Data Science or Software Engineering? - Data Driven Investor

About a month back, while I was sitting at a café and working on developing a website for a client, I found this woman…

www.datadriveninvestor.com

If you have no idea about the above questions, my suggestion is to explore more about Local Interpretable Model-Agnostic Explanations(LIME), Permutation importance, Partial Dependence Plots, and SHAP. These are powerful tools to explain your machine learning models, and getting insights behind data.

Deploying models to ML system

A lot of ML courses and tutorials teach us how to build models and get better prediction, but in practice the ML codes only cover a little in the whole system. If you work in the internet company, making real-time prediction is critical to better serve your customers. As a data scientist, you should not only build the models, but also deploy them into production. In other words, you should provide end-to-end solutions, from clarifying business objectives to implementing the solutions, to your business partners.

Also, the final models should be considered from both business and technical perspective. That’s why my closest colleague are engineers and product managers. Sometimes you should think about the problems out of box, and done is better than perfect. It is worth checking these tools if you need to learn to deploy models: Kafka, Apache Flink, Apache Beam, End-to-End Machine Learning with Tensorflow on GCP.

Simplifying complex problems from real world

The real world is noise and the data is dirty, so your work should never be like a Kaggle project, where the problem is defined and the data is ready to use. Think like a professional consultant, structuring the problems and dividing into subtasks. One mistake from many data scientists is try to see what they can do with current datasets. But the correct mindset is, I think, Don’t start with data, Start with problem space, and use data to re-define the problems. Problem -> relevant KPIs -> product Requirements -> Analysis Necessary -> Data.

Communicating with business partners

It is worth noting that communication is significant for data scientists when you work in the team. From my experience, I have weekly meeting with Engineers and Product managers, monthly meeting with the whole data science team, and seasonal showcase with other teams and business partners. The crucial part in the work is establish trust, which is your key asset in the real world. Like my mentor told me, People generally want to work with people who are: friendly, work well together in a collaborative fashion ( vs. trying to individually take all the credit for the team’s work, for example). Those skills get better the more you use them, and you learn quickly from any mistakes you make. Through communication, you and your colleague can better understand each other, and then improve great work efficiency.

Also, document your model and your ideas in your company’s internal wiki. With that, others can know more about what you are working on. I output at least an article/powerpoint per week in our internal system, and in the view of existing conditions, it makes me much easier to share my ideas and work(especially when English is my second language).

Lifelong learning

I didn’t stop learning and try new things after graduation, and actually I gained a lot from my peers, colleagues, and resources from our data science team. Thanks to my boss giving me a flexible working environment, I don’t need to worry about deadlines and can have more opportunities to learn something, and use it in my own project, iteratively.

Let’s make a analogy like this: Imagine that you are a machine learning model, but not necessarily supervised, and the input data are the information your received every day: the people you meet, the movie you watch, the newspaper and books you read, etc. One possible way to improve yourself is to clean and filter your input data because garbage in, garbage out.

In the data science area, the ‘clean’ data should be some blogs with high quality, papers from both academic and industry, online courses from MOOC. Prepare for your own input data, and tune your parameters with your passion.