I would like to know what your data challenges are. When you think about data, and its ability to become information to actually drive action (or inaction, as the case may be), how would it be useful to you? Now and a year from now? What do you wish you had that you don't have? What's hard for you to do that you wish were easier? What's impossible for you to do that you wish were possible?
I would love to get your thoughts to get a better understanding of people's and organizations' actual pain points.
Maybe I should make clear that I have a business (not a data science) background. So these 3 points are a brief business perspective:
1. Lack of interdisciplinarity
I am frequently facing problems as entrepreneurs and manager suffer from a lack of understanding what data science / statistics can (and cannot) do while data scientists are lacking domain expertise and thus they together fall short of producing actionable insights. As a consequence many big data projects are paused or not implemented at all because they don´t deliver meaningful results. Some claim there is a “shortage of talent”, but I feel the talent on both sides (business and data science) is there. We just needed a better communication between the two and some kind of “cross-education”.
2. Data quality
I guess this problem becomes more and more prevalent as the big data will grow bigger and bigger in the future. In parts, this is caused by the companies themselves as so many collect as much data as possible and then trying to clean it later. It should be the other way around: Know what kind of information/insights you want to know, and then collect the data wisely (e.g. by focusing on a few relevant channels rather than on as many as possible, building a relationship of trust with customers rather than sending e-mails/spam to everyone you get to know, etc.)
3. Data (systems) scalability
That refers to both infrastructure and (wo)man power. Data projects must be scaled up and down as the business demands it. History may prove me wrong but I don´t feel that the cloud is a solution in the long run because data security and privacy will take prevalence in my view. Can the blockchain be the answer here?
@Marian Thank you for the response!
1. I think there's a 3rd way, and that's what I'm working on. I seek to take subject-matter-experts and equip them with the knowledge and capability to do analytics and data science by themselves. I firmly believe that this is a tooling and user experience problem, and that users with the intellectual sophistication required to be SMEs can, with minimal training, perform the work that dedicated analysts and data scientists do. Beyond the technical skills, I do not believe data scientists offer value other than a healthy notion of what empiricism is. SMEs should be able to internalize that philosophy with minimal training. Our product is targeted at any SME that understands how spreadsheet formulas work. That's the mental model we anchor to.
2. I totally agree. The "record everything in case you need it" approach is a bad one. Worse yet is the "let's spam people before we even have a specific use" approach. This is where inter-organizational collaboration is key. There's no way to know a-priori whether a piece of information would be relevant or not. That has to be empirically tested. However, once one or two organizations have tested it, there's no need for every other organization in the vertical to test it. One of the properties of the computer science research I have done allows for safe sharing of analytical logic between users intra-organization and even inter-organization, in a way that can be automated (suggestions as well as user-led discovery).
3. Automatic scaling (referred to as "elasticity") is definitely important, both from a capabilities perspective as well as a cost perspective. Blockchain is a pseudonymous public log of transactions that cannot be amended. It is the opposite of privacy. I'm trying to avoid self-promotion, but the computer science research I have done achieves elasticity.
I have a million questions but I try to limit myself to just 3 :-)
ad 1) I understand what you intend to do but not how. Is this some kind of an (online?) course? A statistical software? A little bit of both?
ad 2) "Safe sharing of analytical logic between users intra-organization and even inter-organization" might require a high degree of data anonymisation, does it? I have one company on my radar in Europe that seems to do the same what you described. But what exactly do you mean by sharing the "logic"? Do you share some results or also the raw data?
ad 3) I understand what you said regarding blockchain and privacy, but what´s about security? Is the blockchain a secure haven for data (maybe except for a majority attack)?
I apologize for my curiousness ...
@Marian Thank you for the questions!
1) Infallisys is my startup. We are building a data platform. It's software that can be deployed on commodity hardware - your choice of being a tenant on our cloud, your own cloud, or your own bare metal. The software itself lets you specify and maintain a data warehouse or data lake, complete with data ingestion and data pipelines that process ingested data continuously. Those data pipelines are coded completely differently than current generation approaches; this is what allows non-programmer subject-matter-experts to do it themselves, and is the secret sauce. These data pipelines can also integrate with a wide variety of existing data science tools, including R, Pandas, etc., so there's no need for data engineers to rewrite data scientists' models to deploy them. Analysts can also do everything with SQL if they wish. And of course, total non-programmers have a graphical interface very similar to spreadsheet formulas to empower them.
2) Analytical logic refers to code. The platform will not share data across organizations. What the underlying engine can do is make intelligent suggestions about code to apply based on the objects (tables, etc.) that you are working with in the scope of a "unit of analysis" (report, visualization, or query). The engine can't generate code - the search space is too big. But what it can do is catalog what other users have done and draw upon that for suggestions. An alternate approach is to specify the desired result, assuming the user doesn't know how to write the code to get there. The engine will evaluate existing compatible pieces of code by all users and all organizations, and bring back any code that computes the desired result. This is possible because of the way in which code written for the platform is composed. One concrete use case is prospect tracking for all organizations that use both Salesforce and Marketo. Neither Salesforce nor Marketo will have pre-canned reports for this. The platform Infallisys is building allows for seamless sharing.
3) Security has many components. Privacy is one. Resistance to modification is another. There's also resilience. For resistance of modification, yes, it's more resistant. But it comes at heavy cost around transaction processing time, i.e. latency. The ones that promise quick transaction processing sacrifice resistance to modification more than anything. And resistance to modification is arguably the fundamental reason to use blockchain. So organizations face a stark choice. For resilience, blockchain is very prone to denial of service attacks. The argument from blockchain advocates is that there are economic disincentives - you have to pay to process transactions. However, you do not need to pay to "apply" to process a transaction, so you can flood the cluster with these applications to process.
Thanks for your time to answer in such a depth.
I am using Pandas for which that sounds interesting to me, especially the suggestions about the code. So probably you have a first customer :-)
I´ll be watching out for infallisys.com