Spaces:
Sleeping
Sleeping
| # Prompts moved from test_pdf_parser.py to make the agent self-contained | |
| REVIEWER_SYSTEM_PROMPT = """You are a senior AI research expert and technology assessment consultant, specializing in evaluating the potential for scientific research work to be automated by current or near-future AI systems. | |
| Your assessment should be: | |
| 1. Systematic and evidence-based using the 12-dimensional framework | |
| 2. Objective in analyzing current AI capability boundaries | |
| 3. Realistic in predicting technology development trends | |
| 4. Comprehensive in considering automation barriers and societal impacts | |
| Maintain critical thinking and provide detailed justifications for each score. Your evaluation will influence research directions and resource allocation decisions.""" | |
| EVALUATION_PROMPT_TEMPLATE = """ | |
| # Systematic AI Automation Assessment Framework | |
| Please conduct a comprehensive evaluation of the provided academic work using the following 12-dimensional framework. Your output should be organized into four sections: executive_summary, dimensions, scores, recommendations, and limitations_uncertainties. | |
| IMPORTANT: Follow the exact JSON schema structure provided. The 'dimensions' section should contain detailed analysis objects with 'score' and 'analysis' fields. The 'scores' section should contain only the numerical scores as a flat object. Do not include dimension scores as top-level fields. | |
| ## Executive Summary | |
| Please provide a concise 150-word summary of key findings and overall assessment. | |
| ## 12-Dimensional Evaluation | |
| ### 1. **Task Formalization** (Score: 0-4) | |
| **What to Evaluate**: Whether the task has clear rules/mathematical objectives | |
| **Score Anchors**: | |
| - 0: Ill-defined | |
| - 1: Partly formal | |
| - 2: Mostly formal | |
| - 3: Fully formal with minor caveats | |
| - 4: Mathematically exact | |
| **Analysis Required**: Examine the clarity of problem definition, mathematical formulation, and objective functions. | |
| ### 2. **Data & Resource Availability** (Score: 0-4) | |
| **What to Evaluate**: Public data, simulators, tool chains availability | |
| **Score Anchors**: | |
| - 0: None | |
| - 1: Sparse/private | |
| - 2: Moderate | |
| - 3: Rich | |
| - 4: Abundant & public | |
| **Analysis Required**: Assess the availability and quality of datasets, existing tools, and computational resources. | |
| ### 3. **Input-Output Complexity** (Score: 0-4) | |
| **What to Evaluate**: Modal diversity, structure and length complexity | |
| **Score Anchors**: | |
| - 0: Chaotic | |
| - 1: High complexity | |
| - 2: Moderate complexity | |
| - 3: Low complexity | |
| - 4: Highly regular | |
| **Analysis Required**: Evaluate the complexity of input processing and output generation requirements. | |
| ### 4. **Real-World Interaction** (Score: 0-4) | |
| **What to Evaluate**: Need for physical/social/online feedback | |
| **Score Anchors**: | |
| - 0: Constant interaction needed | |
| - 1: Frequent interaction | |
| - 2: Occasional interaction | |
| - 3: Rare interaction | |
| - 4: None (offline) | |
| **Analysis Required**: Determine the extent of real-world interaction and feedback requirements. | |
| ### 5. **Existing AI Coverage** (Score: 0-4) | |
| **What to Evaluate**: Proportion of work already completed by existing AI models | |
| **Score Anchors**: | |
| - 0: < 5% | |
| - 1: β 25% | |
| - 2: β 50% | |
| - 3: β 75% | |
| - 4: > 95% | |
| **Analysis Required**: Identify specific existing AI tools/models and quantify coverage percentage. | |
| ### 6. **Automation Barriers** (Qualitative Analysis - No Score) | |
| **What to Evaluate**: Major obstacles like creativity, common sense, legal issues | |
| **Analysis Required**: List and explain key barriers preventing full automation: | |
| - Creativity requirements | |
| - Common sense reasoning | |
| - Domain expertise | |
| - Legal/ethical constraints | |
| - Tacit knowledge | |
| - Other specific barriers | |
| ### 7. **Human Originality/Irreplaceability** (Score: 0-4) | |
| **What to Evaluate**: Dependence on human creativity and originality | |
| **Score Anchors**: | |
| - 0: Routine work | |
| - 1: Incremental innovation | |
| - 2: Moderately novel | |
| - 3: Clearly novel | |
| - 4: Paradigm-shifting | |
| **Analysis Required**: Assess the level of human creativity, insight, and original thinking required. | |
| ### 8. **Safety & Ethical Criticality** (Score: 0-4, Reverse Scoring) | |
| **What to Evaluate**: Consequences of failure/misuse | |
| **Score Anchors**: | |
| - 0: Catastrophic consequences | |
| - 1: Serious consequences | |
| - 2: Manageable consequences | |
| - 3: Minor consequences | |
| - 4: Negligible consequences | |
| **Analysis Required**: Evaluate risks and potential negative impacts of automation. | |
| ### 9. **Societal/Economic Impact** (Qualitative Analysis - No Score) | |
| **What to Evaluate**: Net impact after full automation | |
| **Analysis Required**: Describe comprehensive societal and economic implications: | |
| - Job displacement effects | |
| - Research quality changes | |
| - Innovation ecosystem impacts | |
| - Economic benefits/costs | |
| - Social implications | |
| ### 10. **Technical Maturity Needed** (Score: 0-4) | |
| **What to Evaluate**: Required R&D depth for automation | |
| **Score Anchors**: | |
| - 0: Multiple breakthroughs needed | |
| - 1: One major breakthrough needed | |
| - 2: Cutting-edge R&D required | |
| - 3: Incremental work needed | |
| - 4: Already solved | |
| **Analysis Required**: Identify specific technical advances needed and their feasibility. | |
| ### 11. **3-Year Feasibility** (Probability: 0-100%) | |
| **What to Evaluate**: Probability of AI reaching expert level within 3 years | |
| **Analysis Required**: Provide realistic probability estimate with detailed justification considering: | |
| - Current AI development pace | |
| - Required technical breakthroughs | |
| - Resource availability | |
| - Market incentives | |
| ### 12. **Overall Automatability** (Score: 0-4) | |
| **What to Evaluate**: Comprehensive automation feasibility | |
| **Score Anchors**: | |
| - 0: Not automatable | |
| - 1: Hard to automate | |
| - 2: Moderately automatable | |
| - 3: Highly automatable | |
| - 4: Already automatable | |
| **Analysis Required**: Synthesize all dimensions into overall assessment. | |
| ## Recommendations | |
| ### For Researchers | |
| Please provide specific recommendations for researchers in this field. | |
| ### For Institutions | |
| Please provide recommendations for research institutions and funding bodies. | |
| ### For AI Development | |
| Please provide recommendations for AI researchers and developers. | |
| ## Assessment Limitations and Uncertainties | |
| Please list any limitations or uncertainties in your assessment. | |
| --- | |
| **Instructions**: | |
| - Provide specific evidence and examples for each score | |
| - Be conservative in scoring when uncertain | |
| - Consider both current capabilities and realistic near-term developments | |
| - Justify all numerical scores with detailed reasoning | |
| - For qualitative dimensions, provide comprehensive analysis | |
| - Please use `return_assessment` tool to return the complete AI automation assessment as a single JSON object. | |
| - Do not mention the tool in your response in order to avoid model hallucination. | |
| Now please begin the systematic evaluation of the provided academic work. | |
| """ | |
| # Tools schema for function calling (Anthropic tools) | |
| # The model must call `return_assessment` to output a strict JSON object | |
| TOOLS = [ | |
| { | |
| "name": "return_assessment", | |
| "description": "Return the complete AI automation assessment as a single JSON object.", | |
| "input_schema": { | |
| "type": "object", | |
| "properties": { | |
| "executive_summary": { | |
| "type": "string", | |
| "description": "A concise 150-word summary of key findings and overall assessment." | |
| }, | |
| "dimensions": { | |
| "type": "object", | |
| "description": "Detailed analysis of each dimension with scores and justifications.", | |
| "properties": { | |
| "task_formalization": { | |
| "type": "object", | |
| "properties": { | |
| "score": { | |
| "type": "number", | |
| "description": "The score for the task formalization dimension, on a scale of 0-4." | |
| }, | |
| "analysis": { | |
| "type": "string", | |
| "description": "A detailed analysis of the task formalization dimension, including the score and the justification for the score." | |
| } | |
| }, | |
| "required": [ | |
| "score", | |
| "analysis" | |
| ] | |
| }, | |
| "data_resource_availability": { | |
| "type": "object", | |
| "properties": { | |
| "score": { | |
| "type": "number", | |
| "description": "The score for the data resource availability dimension, on a scale of 0-4." | |
| }, | |
| "analysis": { | |
| "type": "string", | |
| "description": "A detailed analysis of the data resource availability dimension, including the score and the justification for the score." | |
| } | |
| }, | |
| "required": [ | |
| "score", | |
| "analysis" | |
| ] | |
| }, | |
| "input_output_complexity": { | |
| "type": "object", | |
| "properties": { | |
| "score": { | |
| "type": "number", | |
| "description": "The score for the input output complexity dimension, on a scale of 0-4." | |
| }, | |
| "analysis": { | |
| "type": "string", | |
| "description": "A detailed analysis of the input output complexity dimension, including the score and the justification for the score." | |
| } | |
| }, | |
| "required": [ | |
| "score", | |
| "analysis" | |
| ] | |
| }, | |
| "real_world_interaction": { | |
| "type": "object", | |
| "properties": { | |
| "score": { | |
| "type": "number", | |
| "description": "The score for the real world interaction dimension, on a scale of 0-4." | |
| }, | |
| "analysis": { | |
| "type": "string", | |
| "description": "A detailed analysis of the real world interaction dimension, including the score and the justification for the score." | |
| } | |
| }, | |
| "required": [ | |
| "score", | |
| "analysis" | |
| ] | |
| }, | |
| "existing_ai_coverage": { | |
| "type": "object", | |
| "properties": { | |
| "score": { | |
| "type": "number", | |
| "description": "The score for the existing AI coverage dimension, on a scale of 0-4." | |
| }, | |
| "analysis": { | |
| "type": "string", | |
| "description": "A detailed analysis of the existing AI coverage dimension, including the score and the justification for the score." | |
| }, | |
| "tools_models": { | |
| "type": "array", | |
| "items": { | |
| "type": "string" | |
| } | |
| }, | |
| "coverage_pct_estimate": { | |
| "type": "number" | |
| } | |
| }, | |
| "required": [ | |
| "score", | |
| "analysis" | |
| ] | |
| }, | |
| "automation_barriers": { | |
| "type": "object", | |
| "properties": { | |
| "analysis": { | |
| "type": "string", | |
| "description": "A detailed analysis of the automation barriers dimension, including the score and the justification for the score." | |
| } | |
| }, | |
| "required": [ | |
| "analysis" | |
| ] | |
| }, | |
| "human_originality": { | |
| "type": "object", | |
| "properties": { | |
| "score": { | |
| "type": "number", | |
| "description": "The score for the human originality dimension, on a scale of 0-4." | |
| }, | |
| "analysis": { | |
| "type": "string", | |
| "description": "A detailed analysis of the human originality dimension, including the score and the justification for the score." | |
| } | |
| }, | |
| "required": [ | |
| "score", | |
| "analysis" | |
| ] | |
| }, | |
| "safety_ethics": { | |
| "type": "object", | |
| "properties": { | |
| "score": { | |
| "type": "number", | |
| "description": "The score for the safety and ethics dimension, on a scale of 0-4." | |
| }, | |
| "analysis": { | |
| "type": "string", | |
| "description": "A detailed analysis of the safety and ethics dimension, including the score and the justification for the score." | |
| } | |
| }, | |
| "required": [ | |
| "score", | |
| "analysis" | |
| ] | |
| }, | |
| "societal_economic_impact": { | |
| "type": "object", | |
| "properties": { | |
| "analysis": { | |
| "type": "string" | |
| } | |
| }, | |
| "required": [ | |
| "analysis" | |
| ] | |
| }, | |
| "technical_maturity_needed": { | |
| "type": "object", | |
| "properties": { | |
| "score": { | |
| "type": "number" | |
| }, | |
| "analysis": { | |
| "type": "string" | |
| } | |
| }, | |
| "required": [ | |
| "score", | |
| "analysis" | |
| ] | |
| }, | |
| "three_year_feasibility": { | |
| "type": "object", | |
| "properties": { | |
| "probability_pct": { | |
| "type": "number", | |
| "description": "The probability of AI reaching expert level within 3 years, on a scale of 0-100%." | |
| }, | |
| "analysis": { | |
| "type": "string", | |
| "description": "A detailed analysis of the three year feasibility dimension, including the probability and the justification for the probability." | |
| } | |
| }, | |
| "required": [ | |
| "probability_pct", | |
| "analysis" | |
| ] | |
| }, | |
| "overall_automatability": { | |
| "type": "object", | |
| "properties": { | |
| "score": { | |
| "type": "number", | |
| "description": "The score for the overall automatability dimension, on a scale of 0-4." | |
| }, | |
| "analysis": { | |
| "type": "string", | |
| "description": "A detailed analysis of the overall automatability dimension, including the score and the justification for the score." | |
| } | |
| }, | |
| "required": [ | |
| "score", | |
| "analysis" | |
| ] | |
| } | |
| }, | |
| "required": [ | |
| "task_formalization", | |
| "data_resource_availability", | |
| "input_output_complexity", | |
| "real_world_interaction", | |
| "existing_ai_coverage", | |
| "automation_barriers", | |
| "human_originality", | |
| "safety_ethics", | |
| "societal_economic_impact", | |
| "technical_maturity_needed", | |
| "three_year_feasibility", | |
| "overall_automatability" | |
| ] | |
| }, | |
| "scores": { | |
| "type": "object", | |
| "properties": { | |
| "task_formalization": { | |
| "type": "number", | |
| "description": "The score for the task formalization dimension, on a scale of 0-4." | |
| }, | |
| "data_resource_availability": { | |
| "type": "number", | |
| "description": "The score for the data resource availability dimension, on a scale of 0-4." | |
| }, | |
| "input_output_complexity": { | |
| "type": "number", | |
| "description": "The score for the input output complexity dimension, on a scale of 0-4." | |
| }, | |
| "real_world_interaction": { | |
| "type": "number", | |
| "description": "The score for the real world interaction dimension, on a scale of 0-4." | |
| }, | |
| "existing_ai_coverage": { | |
| "type": "number", | |
| "description": "The score for the existing AI coverage dimension, on a scale of 0-4." | |
| }, | |
| "human_originality": { | |
| "type": "number", | |
| "description": "The score for the human originality dimension, on a scale of 0-4." | |
| }, | |
| "safety_ethics": { | |
| "type": "number", | |
| "description": "The score for the safety and ethics dimension, on a scale of 0-4." | |
| }, | |
| "technical_maturity_needed": { | |
| "type": "number", | |
| "description": "The score for the technical maturity needed dimension, on a scale of 0-4." | |
| }, | |
| "three_year_feasibility_pct": { | |
| "type": "number", | |
| "description": "The probability of AI reaching expert level within 3 years, on a scale of 0-100%." | |
| }, | |
| "overall_automatability": { | |
| "type": "number", | |
| "description": "The score for the overall automatability dimension, on a scale of 0-4." | |
| } | |
| }, | |
| "required": [ | |
| "task_formalization", | |
| "data_resource_availability", | |
| "input_output_complexity", | |
| "real_world_interaction", | |
| "existing_ai_coverage", | |
| "human_originality", | |
| "safety_ethics", | |
| "technical_maturity_needed", | |
| "three_year_feasibility_pct", | |
| "overall_automatability" | |
| ] | |
| }, | |
| "recommendations": { | |
| "type": "object", | |
| "properties": { | |
| "for_researchers": { | |
| "type": "array", | |
| "items": { | |
| "type": "string", | |
| "description": "A specific recommendation for researchers in this field." | |
| } | |
| }, | |
| "for_institutions": { | |
| "type": "array", | |
| "items": { | |
| "type": "string", | |
| "description": "A recommendation for research institutions and funding bodies." | |
| } | |
| }, | |
| "for_ai_development": { | |
| "type": "array", | |
| "items": { | |
| "type": "string", | |
| "description": "A recommendation for AI researchers and developers." | |
| } | |
| } | |
| }, | |
| "required": [ | |
| "for_researchers", | |
| "for_institutions", | |
| "for_ai_development" | |
| ] | |
| }, | |
| "limitations_uncertainties": { | |
| "type": "array", | |
| "items": { | |
| "type": "string", | |
| "description": "A limitation or uncertainty in the assessment." | |
| } | |
| } | |
| }, | |
| "required": [ | |
| "executive_summary", | |
| "dimensions", | |
| "scores", | |
| "recommendations", | |
| "limitations_uncertainties" | |
| ], | |
| "additionalProperties": False, | |
| "description": "Complete evaluation output with executive summary, detailed dimensions analysis, numerical scores, recommendations, and limitations." | |
| } | |
| } | |
| ] | |
| TOOL_CHOICE = { | |
| "type": "tool", | |
| "name": "return_assessment" | |
| } |