Where I work, the data scientists are more educated and experienced on container...

dasil003 · on Sept 18, 2018

If the data scientists know more about CI, unit testing, profiling, and caching than the engineers then they are better engineers and I'd wonder a bit about their math/stats chops and whether their role was just re-branded "data scientist" to keep up with trends.

mlthoughts2018 · on Sept 19, 2018

It’s quite common to start out with a PhD / masters in math / stats, with deep specialization in fields like NLP, computer vision, MCMC sampling, and then to become an experienced expert in GPU computing, containerization, web service layers, etc., while working on implementations of ML models.

This was true for me anyway. The main thing I do is deep learning for computer vision and image search, but I think it’s fair to say I have significant experience with Docker, GPU architectures, various CI tooling, linux system programming, deep internals of CPython, internals of MySQL and Postgres, lots of frustrating performance tradeoffs with py4j in the pyspark world, as well as all the usual crap with pandas, sklearn, data visualization tools, and a lot more.

I’d say almost all data scientists I’ve worked with are just like me, just with maybe different specialization areas, except possibly for very young data scientists right out of undergrad.

screye · on Sept 19, 2018

If you don't mind me asking, where do you work at? (or even a family of companies that have this type of roles)

What you describe sounds like where I want to be, and it would help to where I could go looking. (Deep Learning / Vision specialization with a math focused CS background but want to learn more core software skills going ahead)

mlthoughts2018 · on Sept 19, 2018

I work in a fairly mature startup that’s been around for over 10 years. It began primarily with an app, but shifted focus to other business areas. The image processing products are mostly related to offering information retrieval and search services for app users that have curated personal image collections.

I would say your description is accurate but only accidentally. The reason we have to learn more core engineering skills is that infrastructure will not take responsibility of bringing our solutions to production, and seek to limit the tools in our toolbox with policy.

It’s not fun when your everyday life is constant impedance mismatch against tools that infra will allow you to use. You can constantly see better / faster / safer / cheaper ways to solve problems, which have no downsides at all relative to the bad, slow, insecure, expensive ways infra currently makes everyone solve problems, but you just feel constantly sad that you are superficially prevented from the autonomy to select the efficient solution accordung to your creativity and skill.

saltcured · on Sept 19, 2018

I feel like you are roughly describing research programmers versus system administrators or operators in academic computing environments.

I think a big difference between research programmers and production/ops people is that as researchers we often chase a transient goal. Build some complex and horrible integration to compute a result or put something in a paper. We used to call these Rube Goldberg machines rather than Lovecraftian horrors, but we mean the same thing. Something that belongs on a movie set, with some Jacobs ladders arcing in the background. In some circles, it is called the heroic demo.

In the past, we might substitute other fads for your CI tooling or API validation tools. I remember when some research programmers were all-in on enterprise junk like J2EE/managed code, SOAP/WSDL, and other stovepipe tooling. There is a lot of cargo culting of such tools. When you have furnished your lab with rapid prototyping tools and focused on crazy integration stunts, you are almost always deluding yourself to think these tools are also giving you "production" system qualities.

Building something at the hairy edge of possibility is inherently about experimentation and risk-taking. Building reliable, production operations is inherently about conservative design and risk-mitigation. There seems to be a new cargo cult of devops which believes you somehow mash these together and the conflict disappears. You don't have to have to map the negotiation onto two teams with opposing objectives, but the negotiation has to live somewhere.

Magically erasing the negotiation just means that you have chosen to default on the optimization task and jettison concern for at least one of functionality, cost, or risk. Startups commonly do this because the VC funding has mitigated the risk elsewhere: you can fail because they've also funded your competitor who may succeed...

mlthoughts2018 · on Sept 19, 2018

I’m not referring to transient research prototypes, but to robust long-lived systems needed for experimentation and reproducible results tracking, and services that are directly customer facing.

We are often required to create new services and functionality because it is how our company can grow, and we have to have ease of access to experimental working space, with freedom to do things like custom compilations of ML frameworks, using programming languages that haven’t been widely used in the company yet to gain access to an important library, define complex assumption-breaking deployment constraints relative to GPU runtimes or containerized notebook servers, etc.

I think people who see how these things grow out of prototypes and wrongly conclude it was designed with transient concerns and thus isn’t robust in some way, they are rushing to judgment and discounting the fact that that ML expert who also wrote the web service layer and who also wrote the Jenkinsfile and who also wrote the container definition abd who also knows how to tune indices in the database, etc., really made their choices for serious, pragmatic engineering reasons that solve the business problem efficiently, and that they already anticipated and accounted for the shallow tradeoffs and caveats that IT will use as potshots to try to circumvent the responsibility to help maintain it.

saltcured · on Sept 19, 2018

We may be talking past each other. I am on the research side of academic computing/informatics and have faced these struggles my whole career, encountering some very reluctant IT divisions.

We have had to bite the bullet and use colo facilities to self-host internet-facing deployments that the overhead-funded IT groups would not touch with a ten foot pole. From these experiences, I also acquired a more nuanced perspective on the IT division perspective and constraints, and how they derive from overall organizational policy and economics. We also had funny situations where we tried to help other PIs benefit from our new-found independence, and immediately regretted it. They did not understand what self-hosting means. I think anybody trying to toss integrations over the fence to an ops team needs to have an extended tour of duty trying to operate their own solutions in production WITHOUT assistance before they form bold opinions about operations constraints.

When there are strong time-to-market constraints (which includes publishing papers in academics), you are forced to find solution points that are different than if you are planning to run something for long periods at low overhead and low accumulative risk. These solution points also have to take into account the staffing and resources for that ongoing production.

Those things like bleeding edge libraries and assumption-breaking deployment constraints are the headache for ongoing operations and maintenance. It's not enough to have an existence proof that some complex integration can be built and passes its tests. You need a plan for how all the components will be maintained, patched, and upgraded. You need contingency planning when some of those bleeding edge components are going to become deprecated. You need to consider what staff capabilities are assigned to do that maintenance work or what will happen when the institutional knowledge used to form the original integration is not on-call to reintegrate it in the face of unexpected events.

mlthoughts2018 · on Sept 20, 2018

> “I think anybody trying to toss integrations over the fence to an ops team needs to have an extended tour of duty trying to operate their own solutions in production WITHOUT assistance before they form bold opinions about operations constraints.”

I think this is one of the worst possible attitudes to have. It’s a petty way to feel, desiring some type of “I’ve seen some shit” tough guy credential more than supporting the stuff needed to actually solve business problems.

If you hire people whose value add to your company is inventing completely new things, including deployment, ops, scaling, etc., that goes along with that, then it is the job of infrastructure on the other side of that fence to happily and eagerly accept whatever is tossed over the fence, to understand why developer teams made the choices they made, and to take an attitude of supporting as much as possible.

> “You need a plan for how all the components will be maintained, patched, and upgraded. You need contingency planning when some of those bleeding edge components are going to become deprecated. You need to consider what staff capabilities are assigned to do that maintenance work or what will happen when the institutional knowledge used to form the original integration is not on-call to reintegrate it in the face of unexpected events.”

Yes, of course. But all this is already what dev teams are doing. Ops / infra is not taking a hare-brained plan and adding these robustness aspects into it. Not at all. Instead they take plans from application teams and try to use policy to minimize their own maintenance burden, even when that optimization is antithetical to what the company requires at a more fundamental level.

A lot of companies languish and die because of sociological dysfunction in the policy interface between dev teams and infrastructure. The more that infrastructure has political control of that interface, the closer to death is that company.

It’s like a body that is disallowed from generating white blood cells in response to a new immune challenge. Even if the bleeding edge integrations are really hard, the alternative world where you slow them down with policy is death and attrition.